Educational and Psychological Measurement, Ahead of Print.
The forced-choice response format is often considered superior to the standard Likert-type format for controlling social desirability in personality inventories. We performed simulations and found that the trait information based on the two formats converges when the number of items is high and forced-choice items are mixed with regard to positively and negatively keyed items. Given that forced-choice items extract the same personality information as Likert-type items do, including socially desirable responding, other means are needed to counteract social desirability. We propose using evaluatively neutralized items in personality measurement, as they can counteract social desirability regardless of response format.
Category Archives: Educational and Psychological Measurement
Evaluating Equating Methods for Varying Levels of Form Difference
Educational and Psychological Measurement, Ahead of Print.
Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably. In practice, however, equating methods are often implemented without considering the extent to which two forms differ in difficulty. The study aims to examine the effect of the magnitude of a form difficulty difference on equating results under random group (RG) and common-item nonequivalent group (CINEG) designs. Specifically, this study evaluates the performance of six equating methods under a set of simulation conditions including varying levels of form difference. Results revealed that, under the RG design, mean equating was proven to be the most accurate method when there is no or small form difference, whereas equipercentile is the most accurate method when the difficulty difference is medium or large. Under the CINEG design, Tucker Linear was found to be the most accurate method when the difficulty difference is medium or small, and either chained equipercentile or frequency estimation is preferred with a large difficulty level. This study would provide practitioners with research evidence–based guidance in the choice of equating methods with varying levels of form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.
Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably. In practice, however, equating methods are often implemented without considering the extent to which two forms differ in difficulty. The study aims to examine the effect of the magnitude of a form difficulty difference on equating results under random group (RG) and common-item nonequivalent group (CINEG) designs. Specifically, this study evaluates the performance of six equating methods under a set of simulation conditions including varying levels of form difference. Results revealed that, under the RG design, mean equating was proven to be the most accurate method when there is no or small form difference, whereas equipercentile is the most accurate method when the difficulty difference is medium or large. Under the CINEG design, Tucker Linear was found to be the most accurate method when the difficulty difference is medium or small, and either chained equipercentile or frequency estimation is preferred with a large difficulty level. This study would provide practitioners with research evidence–based guidance in the choice of equating methods with varying levels of form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.
Iterative Item Selection of Neighborhood Clusters: A Nonparametric and Non-IRT Method for Generating Miniature Computer Adaptive Questionnaires
Educational and Psychological Measurement, Ahead of Print.
The questionnaire method has always been an important research method in psychology. The increasing prevalence of multidimensional trait measures in psychological research has led researchers to use longer questionnaires. However, questionnaires that are too long will inevitably reduce the quality of the completed questionnaires and the efficiency of collection. Computer adaptive testing (CAT) can be used to reduce the test length while preserving the measurement accuracy. However, it is more often used in aptitude testing and involves a large number of parametric assumptions. Applying CAT to psychological questionnaires often requires question-specific model design and preexperimentation. The present article proposes a nonparametric and item response theory (IRT)-independent CAT algorithm. The new algorithm is simple and highly generalizable. It can be quickly used in a variety of questionnaires and tests without being limited by theoretical assumptions in different research areas. Simulation and empirical studies were conducted to demonstrate the validity of the new algorithm in aptitude tests and personality measures.
The questionnaire method has always been an important research method in psychology. The increasing prevalence of multidimensional trait measures in psychological research has led researchers to use longer questionnaires. However, questionnaires that are too long will inevitably reduce the quality of the completed questionnaires and the efficiency of collection. Computer adaptive testing (CAT) can be used to reduce the test length while preserving the measurement accuracy. However, it is more often used in aptitude testing and involves a large number of parametric assumptions. Applying CAT to psychological questionnaires often requires question-specific model design and preexperimentation. The present article proposes a nonparametric and item response theory (IRT)-independent CAT algorithm. The new algorithm is simple and highly generalizable. It can be quickly used in a variety of questionnaires and tests without being limited by theoretical assumptions in different research areas. Simulation and empirical studies were conducted to demonstrate the validity of the new algorithm in aptitude tests and personality measures.
An Item Response Theory Model for Incorporating Response Times in Forced-Choice Measures
Educational and Psychological Measurement, Ahead of Print.
Forced-choice (FC) measures have been widely used in many personality or attitude tests as an alternative to rating scales, which employ comparative rather than absolute judgments. Several response biases, such as social desirability, response styles, and acquiescence bias, can be reduced effectively. Another type of data linked with comparative judgments is response time (RT), which contains potential information concerning respondents’ decision-making process. It would be challenging but exciting to combine RT into FC measures better to reveal respondents’ behaviors or preferences in personality measurement. Given this situation, this study aims to propose a new item response theory (IRT) model that incorporates RT into FC measures to improve personality assessment. Simulation studies show that the proposed model can effectively improve the estimation accuracy of personality traits with the ancillary information contained in RT. Also, an application on a real data set reveals that the proposed model estimates similar but different parameter values compared with the conventional Thurstonian IRT model. The RT information can explain these differences.
Forced-choice (FC) measures have been widely used in many personality or attitude tests as an alternative to rating scales, which employ comparative rather than absolute judgments. Several response biases, such as social desirability, response styles, and acquiescence bias, can be reduced effectively. Another type of data linked with comparative judgments is response time (RT), which contains potential information concerning respondents’ decision-making process. It would be challenging but exciting to combine RT into FC measures better to reveal respondents’ behaviors or preferences in personality measurement. Given this situation, this study aims to propose a new item response theory (IRT) model that incorporates RT into FC measures to improve personality assessment. Simulation studies show that the proposed model can effectively improve the estimation accuracy of personality traits with the ancillary information contained in RT. Also, an application on a real data set reveals that the proposed model estimates similar but different parameter values compared with the conventional Thurstonian IRT model. The RT information can explain these differences.
The Accuracy of Bayesian Model Fit Indices in Selecting Among Multidimensional Item Response Theory Models
Educational and Psychological Measurement, Ahead of Print.
Item response theory (IRT) models are often compared with respect to predictive performance to determine the dimensionality of rating scale data. However, such model comparisons could be biased toward nested-dimensionality IRT models (e.g., the bifactor model) when comparing those models with non-nested-dimensionality IRT models (e.g., a unidimensional or a between-item-dimensionality model). The reason is that, compared with non-nested-dimensionality models, nested-dimensionality models could have a greater propensity to fit data that do not represent a specific dimensional structure. However, it is unclear as to what degree model comparison results are biased toward nested-dimensionality IRT models when the data represent specific dimensional structures and when Bayesian estimation and model comparison indices are used. We conducted a simulation study to add clarity to this issue. We examined the accuracy of four Bayesian predictive performance indices at differentiating among non-nested- and nested-dimensionality IRT models. The deviance information criterion (DIC), a commonly used index to compare Bayesian models, was extremely biased toward nested-dimensionality IRT models, favoring them even when non-nested-dimensionality models were the correct models. The Pareto-smoothed importance sampling approximation of the leave-one-out cross-validation was the least biased, with the Watanabe information criterion and the log-predicted marginal likelihood closely following. The findings demonstrate that nested-dimensionality IRT models are not automatically favored when the data represent specific dimensional structures as long as an appropriate predictive performance index is used.
Item response theory (IRT) models are often compared with respect to predictive performance to determine the dimensionality of rating scale data. However, such model comparisons could be biased toward nested-dimensionality IRT models (e.g., the bifactor model) when comparing those models with non-nested-dimensionality IRT models (e.g., a unidimensional or a between-item-dimensionality model). The reason is that, compared with non-nested-dimensionality models, nested-dimensionality models could have a greater propensity to fit data that do not represent a specific dimensional structure. However, it is unclear as to what degree model comparison results are biased toward nested-dimensionality IRT models when the data represent specific dimensional structures and when Bayesian estimation and model comparison indices are used. We conducted a simulation study to add clarity to this issue. We examined the accuracy of four Bayesian predictive performance indices at differentiating among non-nested- and nested-dimensionality IRT models. The deviance information criterion (DIC), a commonly used index to compare Bayesian models, was extremely biased toward nested-dimensionality IRT models, favoring them even when non-nested-dimensionality models were the correct models. The Pareto-smoothed importance sampling approximation of the leave-one-out cross-validation was the least biased, with the Watanabe information criterion and the log-predicted marginal likelihood closely following. The findings demonstrate that nested-dimensionality IRT models are not automatically favored when the data represent specific dimensional structures as long as an appropriate predictive performance index is used.
Dominance Analysis for Latent Variable Models: A Comparison of Methods With Categorical Indicators and Misspecified Models
Educational and Psychological Measurement, Ahead of Print.
Dominance analysis (DA) is a very useful tool for ordering independent variables in a regression model based on their relative importance in explaining variance in the dependent variable. This approach, which was originally described by Budescu, has recently been extended to use with structural equation models examining relationships among latent variables. Research demonstrated that this approach yields accurate results for latent variable models involving normally distributed indicator variables and correctly specified models. The purpose of the current simulation study was to compare the use of this DA approach to a method based on observed regression DA and DA when the latent variable model is estimated using two-stage least squares for latent variable models with categorical indicators and/or model misspecification. Results indicated that the DA approach for latent variable models can provide accurate ordering of the variables and correct hypothesis selection when indicators are categorical and models are misspecified. A discussion of implications from this study is provided.
Dominance analysis (DA) is a very useful tool for ordering independent variables in a regression model based on their relative importance in explaining variance in the dependent variable. This approach, which was originally described by Budescu, has recently been extended to use with structural equation models examining relationships among latent variables. Research demonstrated that this approach yields accurate results for latent variable models involving normally distributed indicator variables and correctly specified models. The purpose of the current simulation study was to compare the use of this DA approach to a method based on observed regression DA and DA when the latent variable model is estimated using two-stage least squares for latent variable models with categorical indicators and/or model misspecification. Results indicated that the DA approach for latent variable models can provide accurate ordering of the variables and correct hypothesis selection when indicators are categorical and models are misspecified. A discussion of implications from this study is provided.
The Trade-Off Between Factor Score Determinacy and the Preservation of Inter-Factor Correlations
Educational and Psychological Measurement, Ahead of Print.
Regression factor score predictors have the maximum factor score determinacy, that is, the maximum correlation with the corresponding factor, but they do not have the same inter-correlations as the factors. As it might be useful to compute factor score predictors that have the same inter-correlations as the factors, correlation-preserving factor score predictors have been proposed. However, correlation-preserving factor score predictors have smaller correlations with the corresponding factors (factor score determinacy) than regression factor score predictors. Thus, higher factor score determinacy goes along with bias of the inter-correlations and unbiased inter-correlations go along with lower factor score determinacy. The aim of the present study was therefore to investigate the size and conditions of the trade-off between factor score determinacy and bias of inter-correlations by means of algebraic considerations and a simulation study. It turns out that under several conditions very small gains of factor score determinacy of the regression factor score predictor go along with a large bias of inter-correlations. Instead of using the regression factor score predictor by default, it is proposed to check whether substantial bias of inter-correlations can be avoided without substantial loss of factor score determinacy using a correlation-preserving factor score predictor. A syntax that allows to compute correlation-preserving factor score predictors from regression factor score predictors, and to compare factor score determinacy and inter-correlations of the factor score predictors is given in the Appendix.
Regression factor score predictors have the maximum factor score determinacy, that is, the maximum correlation with the corresponding factor, but they do not have the same inter-correlations as the factors. As it might be useful to compute factor score predictors that have the same inter-correlations as the factors, correlation-preserving factor score predictors have been proposed. However, correlation-preserving factor score predictors have smaller correlations with the corresponding factors (factor score determinacy) than regression factor score predictors. Thus, higher factor score determinacy goes along with bias of the inter-correlations and unbiased inter-correlations go along with lower factor score determinacy. The aim of the present study was therefore to investigate the size and conditions of the trade-off between factor score determinacy and bias of inter-correlations by means of algebraic considerations and a simulation study. It turns out that under several conditions very small gains of factor score determinacy of the regression factor score predictor go along with a large bias of inter-correlations. Instead of using the regression factor score predictor by default, it is proposed to check whether substantial bias of inter-correlations can be avoided without substantial loss of factor score determinacy using a correlation-preserving factor score predictor. A syntax that allows to compute correlation-preserving factor score predictors from regression factor score predictors, and to compare factor score determinacy and inter-correlations of the factor score predictors is given in the Appendix.
Identifying Disengaged Responding in Multiple-Choice Items: Extending a Latent Class Item Response Model With Novel Process Data Indicators
Educational and Psychological Measurement, Ahead of Print.
Disengaged responding poses a severe threat to the validity of educational large-scale assessments, because item responses from unmotivated test-takers do not reflect their actual ability. Existing identification approaches rely primarily on item response times, which bears the risk of misclassifying fast engaged or slow disengaged responses. Process data with its rich pool of additional information on the test-taking process could thus be used to improve existing identification approaches. In this study, three process data variables—text reread, item revisit, and answer change—were introduced as potential indicators of response engagement for multiple-choice items in a reading comprehension test. An extended latent class item response model for disengaged responding was developed by including the three new indicators as additional predictors of response engagement. In a sample of 1,932 German university students, the extended model indicated a better model fit than the baseline model, which included item response time as only indicator of response engagement. In the extended model, both item response time and text reread were significant predictors of response engagement. However, graphical analyses revealed no systematic differences in the item and person parameter estimation or item response classification between the models. These results suggest only a marginal improvement of the identification of disengaged responding by the new indicators. Implications of these results for future research on disengaged responding with process data are discussed.
Disengaged responding poses a severe threat to the validity of educational large-scale assessments, because item responses from unmotivated test-takers do not reflect their actual ability. Existing identification approaches rely primarily on item response times, which bears the risk of misclassifying fast engaged or slow disengaged responses. Process data with its rich pool of additional information on the test-taking process could thus be used to improve existing identification approaches. In this study, three process data variables—text reread, item revisit, and answer change—were introduced as potential indicators of response engagement for multiple-choice items in a reading comprehension test. An extended latent class item response model for disengaged responding was developed by including the three new indicators as additional predictors of response engagement. In a sample of 1,932 German university students, the extended model indicated a better model fit than the baseline model, which included item response time as only indicator of response engagement. In the extended model, both item response time and text reread were significant predictors of response engagement. However, graphical analyses revealed no systematic differences in the item and person parameter estimation or item response classification between the models. These results suggest only a marginal improvement of the identification of disengaged responding by the new indicators. Implications of these results for future research on disengaged responding with process data are discussed.
A Comparison of Response Time Threshold Scoring Procedures in Mitigating Bias From Rapid Guessing Behavior
Educational and Psychological Measurement, Ahead of Print.
Rapid guessing (RG) is a form of non-effortful responding that is characterized by short response latencies. This construct-irrelevant behavior has been shown in previous research to bias inferences concerning measurement properties and scores. To mitigate these deleterious effects, a number of response time threshold scoring procedures have been proposed, which recode RG responses (e.g., treat them as incorrect or missing, or impute probable values) and then estimate parameters for the recoded dataset using a unidimensional or multidimensional IRT model. To date, there have been limited attempts to compare these methods under the possibility that RG may be misclassified in practice. To address this shortcoming, the present simulation study compared item and ability parameter recovery for four scoring procedures by manipulating sample size, the linear relationship between RG propensity and ability, the percentage of RG responses, and the type and rate of RG misclassifications. Results demonstrated two general trends. First, across all conditions, treating RG responses as incorrect produced the largest degree of combined systematic and random error (larger than ignoring RG). Second, the remaining scoring approaches generally provided equal accuracy in parameter recovery when RG was perfectly identified; however, the multidimensional IRT approach was susceptible to increased error as misclassification rates grew. Overall, the findings suggest that recoding RG as missing and employing a unidimensional IRT model is a promising approach.
Rapid guessing (RG) is a form of non-effortful responding that is characterized by short response latencies. This construct-irrelevant behavior has been shown in previous research to bias inferences concerning measurement properties and scores. To mitigate these deleterious effects, a number of response time threshold scoring procedures have been proposed, which recode RG responses (e.g., treat them as incorrect or missing, or impute probable values) and then estimate parameters for the recoded dataset using a unidimensional or multidimensional IRT model. To date, there have been limited attempts to compare these methods under the possibility that RG may be misclassified in practice. To address this shortcoming, the present simulation study compared item and ability parameter recovery for four scoring procedures by manipulating sample size, the linear relationship between RG propensity and ability, the percentage of RG responses, and the type and rate of RG misclassifications. Results demonstrated two general trends. First, across all conditions, treating RG responses as incorrect produced the largest degree of combined systematic and random error (larger than ignoring RG). Second, the remaining scoring approaches generally provided equal accuracy in parameter recovery when RG was perfectly identified; however, the multidimensional IRT approach was susceptible to increased error as misclassification rates grew. Overall, the findings suggest that recoding RG as missing and employing a unidimensional IRT model is a promising approach.
Modeling Misspecification as a Parameter in Bayesian Structural Equation Models
Educational and Psychological Measurement, Ahead of Print.
Accounting for model misspecification in Bayesian structural equation models is an active area of research. We present a uniquely Bayesian approach to misspecification that models the degree of misspecification as a parameter—a parameter akin to the correlation root mean squared residual. The misspecification parameter can be interpreted on its own terms as a measure of absolute model fit and allows for comparing different models fit to the same data. By estimating the degree of misspecification simultaneously with structural parameters, the uncertainty about structural parameters reflects the degree of model misspecification. This results in a model that produces more reliable inference than extant Bayesian structural equation modeling. In addition, the approach estimates the residual covariance matrix that can be the basis for diagnosing misspecifications and updating a hypothesized model. These features are confirmed using simulation studies. Demonstrations with a variety of real-world examples show additional properties of the approach.
Accounting for model misspecification in Bayesian structural equation models is an active area of research. We present a uniquely Bayesian approach to misspecification that models the degree of misspecification as a parameter—a parameter akin to the correlation root mean squared residual. The misspecification parameter can be interpreted on its own terms as a measure of absolute model fit and allows for comparing different models fit to the same data. By estimating the degree of misspecification simultaneously with structural parameters, the uncertainty about structural parameters reflects the degree of model misspecification. This results in a model that produces more reliable inference than extant Bayesian structural equation modeling. In addition, the approach estimates the residual covariance matrix that can be the basis for diagnosing misspecifications and updating a hypothesized model. These features are confirmed using simulation studies. Demonstrations with a variety of real-world examples show additional properties of the approach.