Educational and Psychological Measurement, Ahead of Print.
This simulation study investigated to what extent departures from construct similarity as well as differences in the difficulty and targeting of scales impact the score transformation when scales are equated by means of concurrent calibration using the partial credit model with a common person design. Practical implications of the simulation results are discussed with a focus on scale equating in health-related research settings. The study simulated data for two scales, varying the number of items and the sample sizes. The factor correlation between scales was used to operationalize construct similarity. Targeting of the scales was operationalized through increasing departure from equal difficulty and by varying the dispersion of the item and person parameters in each scale. The results show that low similarity between scales goes along with lower transformation precision. In cases with equal levels of similarity, precision improves in settings where the range of the item parameters is encompassing the person parameters range. With decreasing similarity, score transformation precision benefits more from good targeting. Difficulty shifts up to two logits somewhat increased the estimation bias but without affecting the transformation precision. The observed robustness against difficulty shifts supports the advantage of applying a true-score equating methods over identity equating, which was used as a naive baseline method for comparison. Finally, larger sample size did not improve the transformation precision in this study, longer scales improved only marginally the quality of the equating. The insights from the simulation study are used in a real-data example.
Category Archives: Educational and Psychological Measurement
Equating Oral Reading Fluency Scores: A Model-Based Approach
Educational and Psychological Measurement, Ahead of Print.
Words read correctly per minute (WCPM) is the reporting score metric in oral reading fluency (ORF) assessments, which is popularly utilized as part of curriculum-based measurements to screen at-risk readers and to monitor progress of students who receive interventions. Just like other types of assessments with multiple forms, equating would be necessary when WCPM scores are obtained from multiple ORF passages to be compared both between and within students. This article proposes a model-based approach for equating WCPM scores. A simulation study was conducted to evaluate the performance of the model-based equating approach along with some observed-score equating methods with external anchor test design.
Words read correctly per minute (WCPM) is the reporting score metric in oral reading fluency (ORF) assessments, which is popularly utilized as part of curriculum-based measurements to screen at-risk readers and to monitor progress of students who receive interventions. Just like other types of assessments with multiple forms, equating would be necessary when WCPM scores are obtained from multiple ORF passages to be compared both between and within students. This article proposes a model-based approach for equating WCPM scores. A simulation study was conducted to evaluate the performance of the model-based equating approach along with some observed-score equating methods with external anchor test design.
Functional Approaches for Modeling Unfolding Data
Educational and Psychological Measurement, Ahead of Print.
The purpose of this study is to introduce a functional approach for modeling unfolding response data. Functional data analysis (FDA) has been used for examining cumulative item response data, but a functional approach has not been systematically used with unfolding response processes. A brief overview of FDA is presented and illustrated within the context of unfolding data. Seven decision parameters are described that can provide a guide to conducting FDA in this context. These decision parameters are illustrated with real data using two scales that are designed to measure attitude toward capital punishment and attitude toward censorship. The analyses suggest that FDA offers a useful set of tools for examining unfolding response processes.
The purpose of this study is to introduce a functional approach for modeling unfolding response data. Functional data analysis (FDA) has been used for examining cumulative item response data, but a functional approach has not been systematically used with unfolding response processes. A brief overview of FDA is presented and illustrated within the context of unfolding data. Seven decision parameters are described that can provide a guide to conducting FDA in this context. These decision parameters are illustrated with real data using two scales that are designed to measure attitude toward capital punishment and attitude toward censorship. The analyses suggest that FDA offers a useful set of tools for examining unfolding response processes.
Why Do Regular and Reversed Items Load on Separate Factors? Response Difficulty vs. Item Extremity
Educational and Psychological Measurement, Ahead of Print.
When constructing measurement scales, regular and reversed items are often used (e.g., “I am satisfied with my job”/“I am not satisfied with my job”). Some methodologists recommend excluding reversed items because they are more difficult to understand and therefore engender a second, artificial factor distinct from the regular-item factor. The current study compares two explanations for why a construct’s dimensionality may become distorted: response difficulty and item extremity. Two types of reversed items were created: negation items (“The conditions of my life are not good”) and polar opposites (“The conditions of my life are bad”), with the former type having higher response difficulty. When extreme wording was used (e.g., “excellent/terrible” instead of “good/bad”), negation items did not load on a factor distinct from regular items, but polar opposites did. Results thus support item extremity over response difficulty as an explanation for dimensionality distortion. Given that scale developers seldom check for extremity, it is unsurprising that regular and polar opposite items often load on distinct factors.
When constructing measurement scales, regular and reversed items are often used (e.g., “I am satisfied with my job”/“I am not satisfied with my job”). Some methodologists recommend excluding reversed items because they are more difficult to understand and therefore engender a second, artificial factor distinct from the regular-item factor. The current study compares two explanations for why a construct’s dimensionality may become distorted: response difficulty and item extremity. Two types of reversed items were created: negation items (“The conditions of my life are not good”) and polar opposites (“The conditions of my life are bad”), with the former type having higher response difficulty. When extreme wording was used (e.g., “excellent/terrible” instead of “good/bad”), negation items did not load on a factor distinct from regular items, but polar opposites did. Results thus support item extremity over response difficulty as an explanation for dimensionality distortion. Given that scale developers seldom check for extremity, it is unsurprising that regular and polar opposite items often load on distinct factors.
On Modeling Missing Data in Structural Investigations Based on Tetrachoric Correlations With Free and Fixed Factor Loadings
Educational and Psychological Measurement, Ahead of Print.
In modeling missing data, the missing data latent variable of the confirmatory factor model accounts for systematic variation associated with missing data so that replacement of what is missing is not required. This study aimed at extending the modeling missing data approach to tetrachoric correlations as input and at exploring the consequences of switching between models with free and fixed factor loadings. In a simulation study, confirmatory factor analysis (CFA) models with and without a missing data latent variable were used for investigating the structure of data with and without missing data. In addition, the numbers of columns of data sets with missing data and the amount of missing data were varied. The root mean square error of approximation (RMSEA) results revealed that an additional missing data latent variable recovered the degree-of-model fit characterizing complete data when tetrachoric correlations served as input while comparative fit index (CFI) results showed overestimation of this degree-of-model fit. Whereas the results for fixed factor loadings were in line with the assumptions of modeling missing data, the other results showed only partial agreement. Therefore, modeling missing data with fixed factor loadings is recommended.
In modeling missing data, the missing data latent variable of the confirmatory factor model accounts for systematic variation associated with missing data so that replacement of what is missing is not required. This study aimed at extending the modeling missing data approach to tetrachoric correlations as input and at exploring the consequences of switching between models with free and fixed factor loadings. In a simulation study, confirmatory factor analysis (CFA) models with and without a missing data latent variable were used for investigating the structure of data with and without missing data. In addition, the numbers of columns of data sets with missing data and the amount of missing data were varied. The root mean square error of approximation (RMSEA) results revealed that an additional missing data latent variable recovered the degree-of-model fit characterizing complete data when tetrachoric correlations served as input while comparative fit index (CFI) results showed overestimation of this degree-of-model fit. Whereas the results for fixed factor loadings were in line with the assumptions of modeling missing data, the other results showed only partial agreement. Therefore, modeling missing data with fixed factor loadings is recommended.
An Explanatory Multidimensional Random Item Effects Rating Scale Model
Educational and Psychological Measurement, Ahead of Print.
Random item effects item response theory (IRT) models, which treat both person and item effects as random, have received much attention for more than a decade. The random item effects approach has several advantages in many practical settings. The present study introduced an explanatory multidimensional random item effects rating scale model. The proposed model was formulated under a novel parameterization of the nominal response model (NRM), and allows for flexible inclusion of person-related and item-related covariates (e.g., person characteristics and item features) to study their impacts on the person and item latent variables. A new variant of the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm designed for latent variable models with crossed random effects was applied to obtain parameter estimates for the proposed model. A preliminary simulation study was conducted to evaluate the performance of the MH-RM algorithm for estimating the proposed model. Results indicated that the model parameters were well recovered. An empirical data set was analyzed to further illustrate the usage of the proposed model.
Random item effects item response theory (IRT) models, which treat both person and item effects as random, have received much attention for more than a decade. The random item effects approach has several advantages in many practical settings. The present study introduced an explanatory multidimensional random item effects rating scale model. The proposed model was formulated under a novel parameterization of the nominal response model (NRM), and allows for flexible inclusion of person-related and item-related covariates (e.g., person characteristics and item features) to study their impacts on the person and item latent variables. A new variant of the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm designed for latent variable models with crossed random effects was applied to obtain parameter estimates for the proposed model. A preliminary simulation study was conducted to evaluate the performance of the MH-RM algorithm for estimating the proposed model. Results indicated that the model parameters were well recovered. An empirical data set was analyzed to further illustrate the usage of the proposed model.
Evaluating the Effects of Missing Data Handling Methods on Scale Linking Accuracy
Educational and Psychological Measurement, Ahead of Print.
For large-scale assessments, data are often collected with missing responses. Despite the wide use of item response theory (IRT) in many testing programs, however, the existing literature offers little insight into the effectiveness of various approaches to handling missing responses in the context of scale linking. Scale linking is commonly used in large-scale assessments to maintain scale comparability over multiple forms of a test. Under a common-item nonequivalent group design (CINEG), missing data that occur to common items potentially influence the linking coefficients and, consequently, may affect scale comparability, test validity, and reliability. The objective of this study was to evaluate the effect of six missing data handling approaches, including listwise deletion (LWD), treating missing data as incorrect responses (IN), corrected item mean imputation (CM), imputing with a response function (RF), multiple imputation (MI), and full information likelihood information (FIML), on IRT scale linking accuracy when missing data occur to common items. Under a set of simulation conditions, the relative performance of the six missing data treatment methods under two missing mechanisms was explored. Results showed that RF, MI, and FIML produced less errors for conducting scale linking whereas LWD was associated with the most errors regardless of various testing conditions.
For large-scale assessments, data are often collected with missing responses. Despite the wide use of item response theory (IRT) in many testing programs, however, the existing literature offers little insight into the effectiveness of various approaches to handling missing responses in the context of scale linking. Scale linking is commonly used in large-scale assessments to maintain scale comparability over multiple forms of a test. Under a common-item nonequivalent group design (CINEG), missing data that occur to common items potentially influence the linking coefficients and, consequently, may affect scale comparability, test validity, and reliability. The objective of this study was to evaluate the effect of six missing data handling approaches, including listwise deletion (LWD), treating missing data as incorrect responses (IN), corrected item mean imputation (CM), imputing with a response function (RF), multiple imputation (MI), and full information likelihood information (FIML), on IRT scale linking accuracy when missing data occur to common items. Under a set of simulation conditions, the relative performance of the six missing data treatment methods under two missing mechanisms was explored. Results showed that RF, MI, and FIML produced less errors for conducting scale linking whereas LWD was associated with the most errors regardless of various testing conditions.
Detecting Preknowledge Cheating via Innovative Measures: A Mixture Hierarchical Model for Jointly Modeling Item Responses, Response Times, and Visual Fixation Counts
Educational and Psychological Measurement, Volume 83, Issue 5, Page 1059-1080, October 2023.
Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.
Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.
Position of Correct Option and Distractors Impacts Responses to Multiple-Choice Items: Evidence From a National Test
Educational and Psychological Measurement, Volume 83, Issue 5, Page 861-884, October 2023.
Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests’ outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees’ responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options’ processing are discussed, and implications for test development are considered.
Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests’ outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees’ responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options’ processing are discussed, and implications for test development are considered.
Detecting Cheating in Large-Scale Assessment: The Transfer of Detectors to New Tests
Educational and Psychological Measurement, Volume 83, Issue 5, Page 1033-1058, October 2023.
Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.
Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.