Educational and Psychological Measurement, Ahead of Print.
This note demonstrates that the widely used Bayesian Information Criterion (BIC) need not be generally viewed as a routinely dependable index for model selection when the bifactor and second-order factor models are examined as rival means for data description and explanation. To this end, we use an empirically relevant setting with multidimensional measuring instrument components, where the bifactor model is found consistently inferior to the second-order model in terms of the BIC even though the data on a large number of replications at different sample sizes were generated following the bifactor model. We therefore caution researchers that routine reliance on the BIC for the purpose of discriminating between these two widely used models may not always lead to correct decisions with respect to model choice.
Category Archives: Educational and Psychological Measurement
Artificial Neural Networks for Short-Form Development of Psychometric Tests: A Study on Synthetic Populations Using Autoencoders
Educational and Psychological Measurement, Ahead of Print.
Short-form development is an important topic in psychometric research, which requires researchers to face methodological choices at different steps. The statistical techniques traditionally used for shortening tests, which belong to the so-called exploratory model, make assumptions not always verified in psychological data. This article proposes a machine learning–based autonomous procedure for short-form development that combines explanatory and predictive techniques in an integrative approach. The study investigates the item-selection performance of two autoencoders: a particular type of artificial neural network that is comparable to principal component analysis. The procedure is tested on artificial data simulated from a factor-based population and is compared with existent computational approaches to develop short forms. Autoencoders require mild assumptions on data characteristics and provide a method to predict long-form items’ responses from the short form. Indeed, results show that they can help the researcher to develop a short form by automatically selecting a subset of items that better reconstruct the original item’s responses and that preserve the internal structure of the long-form.
Short-form development is an important topic in psychometric research, which requires researchers to face methodological choices at different steps. The statistical techniques traditionally used for shortening tests, which belong to the so-called exploratory model, make assumptions not always verified in psychological data. This article proposes a machine learning–based autonomous procedure for short-form development that combines explanatory and predictive techniques in an integrative approach. The study investigates the item-selection performance of two autoencoders: a particular type of artificial neural network that is comparable to principal component analysis. The procedure is tested on artificial data simulated from a factor-based population and is compared with existent computational approaches to develop short forms. Autoencoders require mild assumptions on data characteristics and provide a method to predict long-form items’ responses from the short form. Indeed, results show that they can help the researcher to develop a short form by automatically selecting a subset of items that better reconstruct the original item’s responses and that preserve the internal structure of the long-form.
Are the Steps on Likert Scales Equidistant? Responses on Visual Analog Scales Allow Estimating Their Distances
Educational and Psychological Measurement, Ahead of Print.
A recurring question regarding Likert items is whether the discrete steps that this response format allows represent constant increments along the underlying continuum. This question appears unsolvable because Likert responses carry no direct information to this effect. Yet, any item administered in Likert format can identically be administered with a continuous response format such as a visual analog scale (VAS) in which respondents mark a position along a continuous line. Then, the operating characteristics of the item would manifest under both VAS and Likert formats, although perhaps differently as captured by the continuous response model (CRM) and the graded response model (GRM) in item response theory. This article shows that CRM and GRM item parameters hold a formal relation that is mediated by the form in which the continuous dimension is partitioned into intervals to render the discrete Likert responses. Then, CRM and GRM characterizations of the items in a test administered with VAS and Likert formats allow estimating the boundaries of the partition that renders Likert responses for each item and, thus, the distance between consecutive steps. The validity of this approach is first documented via simulation studies. Subsequently, the same approach is used on public data from three personality scales with 12, eight, and six items, respectively. The results indicate the expected correspondence between VAS and Likert responses and reveal unequal distances between successive pairs of Likert steps that also vary greatly across items. Implications for the scoring of Likert items are discussed.
A recurring question regarding Likert items is whether the discrete steps that this response format allows represent constant increments along the underlying continuum. This question appears unsolvable because Likert responses carry no direct information to this effect. Yet, any item administered in Likert format can identically be administered with a continuous response format such as a visual analog scale (VAS) in which respondents mark a position along a continuous line. Then, the operating characteristics of the item would manifest under both VAS and Likert formats, although perhaps differently as captured by the continuous response model (CRM) and the graded response model (GRM) in item response theory. This article shows that CRM and GRM item parameters hold a formal relation that is mediated by the form in which the continuous dimension is partitioned into intervals to render the discrete Likert responses. Then, CRM and GRM characterizations of the items in a test administered with VAS and Likert formats allow estimating the boundaries of the partition that renders Likert responses for each item and, thus, the distance between consecutive steps. The validity of this approach is first documented via simulation studies. Subsequently, the same approach is used on public data from three personality scales with 12, eight, and six items, respectively. The results indicate the expected correspondence between VAS and Likert responses and reveal unequal distances between successive pairs of Likert steps that also vary greatly across items. Implications for the scoring of Likert items are discussed.
Evaluating Model Fit of Measurement Models in Confirmatory Factor Analysis
Educational and Psychological Measurement, Ahead of Print.
Confirmatory factor analyses (CFA) are often used in psychological research when developing measurement models for psychological constructs. Evaluating CFA model fit can be quite challenging, as tests for exact model fit may focus on negligible deviances, while fit indices cannot be interpreted absolutely without specifying thresholds or cutoffs. In this study, we review how model fit in CFA is evaluated in psychological research using fit indices and compare the reported values with established cutoff rules. For this, we collected data on all CFA models in Psychological Assessment from the years 2015 to 2020 [math]. In addition, we reevaluate model fit with newly developed methods that derive fit index cutoffs that are tailored to the respective measurement model and the data characteristics at hand. The results of our review indicate that the model fit in many studies has to be seen critically, especially with regard to the usually imposed independent clusters constraints. In addition, many studies do not fully report all results that are necessary to re-evaluate model fit. We discuss these findings against new developments in model fit evaluation and methods for specification search.
Confirmatory factor analyses (CFA) are often used in psychological research when developing measurement models for psychological constructs. Evaluating CFA model fit can be quite challenging, as tests for exact model fit may focus on negligible deviances, while fit indices cannot be interpreted absolutely without specifying thresholds or cutoffs. In this study, we review how model fit in CFA is evaluated in psychological research using fit indices and compare the reported values with established cutoff rules. For this, we collected data on all CFA models in Psychological Assessment from the years 2015 to 2020 [math]. In addition, we reevaluate model fit with newly developed methods that derive fit index cutoffs that are tailored to the respective measurement model and the data characteristics at hand. The results of our review indicate that the model fit in many studies has to be seen critically, especially with regard to the usually imposed independent clusters constraints. In addition, many studies do not fully report all results that are necessary to re-evaluate model fit. We discuss these findings against new developments in model fit evaluation and methods for specification search.
Model Specification Searches in Structural Equation Modeling Using Bee Swarm Optimization
Educational and Psychological Measurement, Ahead of Print.
Metaheuristics are optimization algorithms that efficiently solve a variety of complex combinatorial problems. In psychological research, metaheuristics have been applied in short-scale construction and model specification search. In the present study, we propose a bee swarm optimization (BSO) algorithm to explore the structure underlying a psychological measurement instrument. The algorithm assigns items to an unknown number of nested factors in a confirmatory bifactor model, while simultaneously selecting items for the final scale. To achieve this, the algorithm follows the biological template of bees’ foraging behavior: Scout bees explore new food sources, whereas onlooker bees search in the vicinity of previously explored, promising food sources. Analogously, scout bees in BSO introduce major changes to a model specification (e.g., adding or removing a specific factor), whereas onlooker bees only make minor changes (e.g., adding an item to a factor or swapping items between specific factors). Through this division of labor in an artificial bee colony, the algorithm aims to strike a balance between two opposing strategies diversification (or exploration) versus intensification (or exploitation). We demonstrate the usefulness of the algorithm to find the underlying structure in two empirical data sets (Holzinger–Swineford and short dark triad questionnaire, SDQ3). Furthermore, we illustrate the influence of relevant hyperparameters such as the number of bees in the hive, the percentage of scouts to onlookers, and the number of top solutions to be followed. Finally, useful applications of the new algorithm are discussed, as well as limitations and possible future research opportunities.
Metaheuristics are optimization algorithms that efficiently solve a variety of complex combinatorial problems. In psychological research, metaheuristics have been applied in short-scale construction and model specification search. In the present study, we propose a bee swarm optimization (BSO) algorithm to explore the structure underlying a psychological measurement instrument. The algorithm assigns items to an unknown number of nested factors in a confirmatory bifactor model, while simultaneously selecting items for the final scale. To achieve this, the algorithm follows the biological template of bees’ foraging behavior: Scout bees explore new food sources, whereas onlooker bees search in the vicinity of previously explored, promising food sources. Analogously, scout bees in BSO introduce major changes to a model specification (e.g., adding or removing a specific factor), whereas onlooker bees only make minor changes (e.g., adding an item to a factor or swapping items between specific factors). Through this division of labor in an artificial bee colony, the algorithm aims to strike a balance between two opposing strategies diversification (or exploration) versus intensification (or exploitation). We demonstrate the usefulness of the algorithm to find the underlying structure in two empirical data sets (Holzinger–Swineford and short dark triad questionnaire, SDQ3). Furthermore, we illustrate the influence of relevant hyperparameters such as the number of bees in the hive, the percentage of scouts to onlookers, and the number of top solutions to be followed. Finally, useful applications of the new algorithm are discussed, as well as limitations and possible future research opportunities.
Evaluating Close Fit in Ordinal Factor Analysis Models With Multiply Imputed Data
Educational and Psychological Measurement, Ahead of Print.
Multiple imputation (MI) is one of the recommended techniques for handling missing data in ordinal factor analysis models. However, methods for computing MI-based fit indices under ordinal factor analysis models have yet to be developed. In this short note, we introduced the methods of using the standardized root mean squared residual (SRMR) and the root mean square error of approximation (RMSEA) to assess the fit of ordinal factor analysis models with multiply imputed data. Specifically, we described the procedure for computing the MI-based sample estimates and constructing the confidence intervals. Simulation results showed that the proposed methods could yield sufficiently accurate point and interval estimates for both SRMR and RMSEA, especially in conditions with larger sample sizes, less missing data, more response categories, and higher degrees of misfit. Based on the findings, implications and recommendations were discussed.
Multiple imputation (MI) is one of the recommended techniques for handling missing data in ordinal factor analysis models. However, methods for computing MI-based fit indices under ordinal factor analysis models have yet to be developed. In this short note, we introduced the methods of using the standardized root mean squared residual (SRMR) and the root mean square error of approximation (RMSEA) to assess the fit of ordinal factor analysis models with multiply imputed data. Specifically, we described the procedure for computing the MI-based sample estimates and constructing the confidence intervals. Simulation results showed that the proposed methods could yield sufficiently accurate point and interval estimates for both SRMR and RMSEA, especially in conditions with larger sample sizes, less missing data, more response categories, and higher degrees of misfit. Based on the findings, implications and recommendations were discussed.
The Impact of Measurement Model Misspecification on Coefficient Omega Estimates of Composite Reliability
Educational and Psychological Measurement, Ahead of Print.
Coefficient omega indices are model-based composite reliability estimates that have become increasingly popular. A coefficient omega index estimates how reliably an observed composite score measures a target construct as represented by a factor in a factor-analysis model; as such, the accuracy of omega estimates is likely to depend on correct model specification. The current paper presents a simulation study to investigate the performance of omega-unidimensional (based on the parameters of a one-factor model) and omega-hierarchical (based on a bifactor model) under correct and incorrect model misspecification for high and low reliability composites and different scale lengths. Our results show that coefficient omega estimates are unbiased when calculated from the parameter estimates of a properly specified model. However, omega-unidimensional produced positively biased estimates when the population model was characterized by unmodeled error correlations or multidimensionality, whereas omega-hierarchical was only slightly biased when the population model was either a one-factor model with correlated errors or a higher-order model. These biases were higher when population reliability was lower and increased with scale length. Researchers should carefully evaluate the feasibility of a one-factor model before estimating and reporting omega-unidimensional.
Coefficient omega indices are model-based composite reliability estimates that have become increasingly popular. A coefficient omega index estimates how reliably an observed composite score measures a target construct as represented by a factor in a factor-analysis model; as such, the accuracy of omega estimates is likely to depend on correct model specification. The current paper presents a simulation study to investigate the performance of omega-unidimensional (based on the parameters of a one-factor model) and omega-hierarchical (based on a bifactor model) under correct and incorrect model misspecification for high and low reliability composites and different scale lengths. Our results show that coefficient omega estimates are unbiased when calculated from the parameter estimates of a properly specified model. However, omega-unidimensional produced positively biased estimates when the population model was characterized by unmodeled error correlations or multidimensionality, whereas omega-hierarchical was only slightly biased when the population model was either a one-factor model with correlated errors or a higher-order model. These biases were higher when population reliability was lower and increased with scale length. Researchers should carefully evaluate the feasibility of a one-factor model before estimating and reporting omega-unidimensional.
Correcting for Extreme Response Style: Model Choice Matters
Educational and Psychological Measurement, Ahead of Print.
Extreme response style (ERS), the tendency of participants to select extreme item categories regardless of the item content, has frequently been found to decrease the validity of Likert-type questionnaire results. For this reason, various item response theory (IRT) models have been proposed to model ERS and correct for it. Comparisons of these models are however rare in the literature, especially in the context of cross-cultural comparisons, where ERS is even more relevant due to cultural differences between groups. To remedy this issue, the current article examines two frequently used IRT models that can be estimated using standard software: a multidimensional nominal response model (MNRM) and a IRTree model. Studying conceptual differences between these models reveals that they differ substantially in their conceptualization of ERS. These differences result in different category probabilities between the models. To evaluate the impact of these differences in a multigroup context, a simulation study is conducted. Our results show that when the groups differ in their average ERS, the IRTree model and MNRM can drastically differ in their conclusions about the size and presence of differences in the substantive trait between these groups. An empirical example is given and implications for the future use of both models and the conceptualization of ERS are discussed.
Extreme response style (ERS), the tendency of participants to select extreme item categories regardless of the item content, has frequently been found to decrease the validity of Likert-type questionnaire results. For this reason, various item response theory (IRT) models have been proposed to model ERS and correct for it. Comparisons of these models are however rare in the literature, especially in the context of cross-cultural comparisons, where ERS is even more relevant due to cultural differences between groups. To remedy this issue, the current article examines two frequently used IRT models that can be estimated using standard software: a multidimensional nominal response model (MNRM) and a IRTree model. Studying conceptual differences between these models reveals that they differ substantially in their conceptualization of ERS. These differences result in different category probabilities between the models. To evaluate the impact of these differences in a multigroup context, a simulation study is conducted. Our results show that when the groups differ in their average ERS, the IRTree model and MNRM can drastically differ in their conclusions about the size and presence of differences in the substantive trait between these groups. An empirical example is given and implications for the future use of both models and the conceptualization of ERS are discussed.
Procedures for Analyzing Multidimensional Mixture Data
Educational and Psychological Measurement, Ahead of Print.
The multidimensional mixture data structure exists in many test (or inventory) conditions. Heterogeneity also relatively exists in populations. Still, some researchers are interested in deciding to which subpopulation a participant belongs according to the participant’s factor pattern. Thus, in this study, we proposed three analysis procedures based on the factor mixture model to analyze data in the multidimensional mixture context. Simulations were manipulated with different levels of factor numbers, factor correlations, numbers of latent classes, and class separation. Issues with regard to model selection were discussed at first. The results showed that in the two-class situations the procedures of “factor structure first then class number” (Procedure 1) and “factor structure and class number considered simultaneously” (Procedure 3) performed better than the “class number first then factor structure” (Procedure 2) and yielded precise parameter estimation and classification accuracy. It would be appropriate to choose Procedures 1 and 3 when strong measurement invariance is assumed while using an information criterion, but Procedure 1 saved more time than Procedure 3. In the three-class situations, the performance of all three procedures was limited. Implementations and suggestions have been addressed in this research.
The multidimensional mixture data structure exists in many test (or inventory) conditions. Heterogeneity also relatively exists in populations. Still, some researchers are interested in deciding to which subpopulation a participant belongs according to the participant’s factor pattern. Thus, in this study, we proposed three analysis procedures based on the factor mixture model to analyze data in the multidimensional mixture context. Simulations were manipulated with different levels of factor numbers, factor correlations, numbers of latent classes, and class separation. Issues with regard to model selection were discussed at first. The results showed that in the two-class situations the procedures of “factor structure first then class number” (Procedure 1) and “factor structure and class number considered simultaneously” (Procedure 3) performed better than the “class number first then factor structure” (Procedure 2) and yielded precise parameter estimation and classification accuracy. It would be appropriate to choose Procedures 1 and 3 when strong measurement invariance is assumed while using an information criterion, but Procedure 1 saved more time than Procedure 3. In the three-class situations, the performance of all three procedures was limited. Implementations and suggestions have been addressed in this research.
A Note on Statistical Hypothesis Testing: Probabilifying Modus Tollens Invalidates Its Force? Not True!
Educational and Psychological Measurement, Ahead of Print.
The import or force of the result of a statistical test has long been portrayed as consistent with deductive reasoning. The simplest form of deductive argument has a first premise with conditional form, such as p→q, which means that “if p is true, then q must be true.” Given the first premise, one can either affirm or deny the antecedent clause (p) or affirm or deny the consequent claim (q). This leads to four forms of deductive argument, two of which are valid forms of reasoning and two of which are invalid. The typical conclusion is that only a single form of argument—denying the consequent, also known as modus tollens—is a reasonable analog of decisions based on statistical hypothesis testing. Now, statistical evidence is never certain, but is associated with a probability (i.e., a p-level). Some have argued that modus tollens, when probabilified, loses its force and leads to ridiculous, nonsensical conclusions. Their argument is based on specious problem setup. This note is intended to correct this error and restore the position of modus tollens as a valid form of deductive inference in statistical matters, even when it is probabilified.
The import or force of the result of a statistical test has long been portrayed as consistent with deductive reasoning. The simplest form of deductive argument has a first premise with conditional form, such as p→q, which means that “if p is true, then q must be true.” Given the first premise, one can either affirm or deny the antecedent clause (p) or affirm or deny the consequent claim (q). This leads to four forms of deductive argument, two of which are valid forms of reasoning and two of which are invalid. The typical conclusion is that only a single form of argument—denying the consequent, also known as modus tollens—is a reasonable analog of decisions based on statistical hypothesis testing. Now, statistical evidence is never certain, but is associated with a probability (i.e., a p-level). Some have argued that modus tollens, when probabilified, loses its force and leads to ridiculous, nonsensical conclusions. Their argument is based on specious problem setup. This note is intended to correct this error and restore the position of modus tollens as a valid form of deductive inference in statistical matters, even when it is probabilified.