Thinking About Sum Scores Yet Again, Maybe the Last Time, We Don’t Know, Oh No . . .: A Comment on

Educational and Psychological Measurement, Ahead of Print.
The relative advantages and disadvantages of sum scores and estimated factor scores are issues of concern for substantive research in psychology. Recently, while championing estimated factor scores over sum scores, McNeish offered a trenchant rejoinder to an article by Widaman and Revelle, which had critiqued an earlier paper by McNeish and Wolf. In the recent contribution, McNeish misrepresented a number of claims by Widaman and Revelle, rendering moot his criticisms of Widaman and Revelle. Notably, McNeish chose to avoid confronting a key strength of sum scores stressed by Widaman and Revelle—the greater comparability of results across studies if sum scores are used. Instead, McNeish pivoted to present a host of simulation studies to identify relative strengths of estimated factor scores. Here, we review our prior claims and, in the process, deflect purported criticisms by McNeish. We discuss briefly issues related to simulated data and empirical data that provide evidence of strengths of each type of score. In doing so, we identified a second strength of sum scores: superior cross-validation of results across independent samples of empirical data, at least for samples of moderate size. We close with consideration of four general issues concerning sum scores and estimated factor scores that highlight the contrasts between positions offered by McNeish and by us, issues of importance when pursuing applied research in our field.

Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning

Educational and Psychological Measurement, Ahead of Print.
In various fields, including college admission, medical board certifications, and military recruitment, high-stakes decisions are frequently made based on scores obtained from large-scale assessments. These decisions necessitate precise and reliable scores that enable valid inferences to be drawn about test-takers. However, the ability of such tests to provide reliable, accurate inference on a test-taker’s performance could be jeopardized by aberrant test-taking practices, for instance, practicing real items prior to the test. As a result, it is crucial for administrators of such assessments to develop strategies that detect potential aberrant test-takers after data collection. The aim of this study is to explore the implementation of machine learning methods in combination with multimodal data fusion strategies that integrate bio-information technology, such as eye-tracking, and psychometric measures, including response times and item responses, to detect aberrant test-taking behaviors in technology-assisted remote testing settings.

A Relative Normed Effect-Size Difference Index for Determining the Number of Common Factors in Exploratory Solutions

Educational and Psychological Measurement, Ahead of Print.
Descriptive fit indices that do not require a formal statistical basis and do not specifically depend on a given estimation criterion are useful as auxiliary devices for judging the appropriateness of unrestricted or exploratory factor analytical (UFA) solutions, when the problem is to decide the most appropriate number of common factors. While overall indices of this type are well known in UFA applications, especially those intended for item analysis, difference indices are much more scarce. Recently, Raykov and collaborators proposed a family of effect-size-type descriptive difference indices that are promising for UFA applications. As a starting point, we considered the simplest measure of this family, which (a) can be viewed as absolute and (b) from which only tentative cutoffs and reference values have been provided so far. In this situation, this article has three aims. The first is to propose a relative version of Raykov’s effect-size measure, intended to be used as a complement of the original measure, in which the increase in explained common variance is related to the overall prior estimated amount of common factor variance. The second is to establish reference values for both indices in item-analysis scenarios using simulation. And the third aim (instrumental) is to implement the proposal in both R language and a well-known non-commercial factor analysis program. The functioning and usefulness of the proposal is illustrated using an existing empirical dataset.

An Ensemble Learning Approach Based on TabNet and Machine Learning Models for Cheating Detection in Educational Tests

Educational and Psychological Measurement, Ahead of Print.
The pervasive issue of cheating in educational tests has emerged as a paramount concern within the realm of education, prompting scholars to explore diverse methodologies for identifying potential transgressors. While machine learning models have been extensively investigated for this purpose, the untapped potential of TabNet, an intricate deep neural network model, remains uncharted territory. Within this study, a comprehensive evaluation and comparison of 12 base models (naive Bayes, linear discriminant analysis, Gaussian process, support vector machine, decision tree, random forest, Extreme Gradient Boosting (XGBoost), AdaBoost, logistic regression, k-nearest neighbors, multilayer perceptron, and TabNet) was undertaken to scrutinize their predictive capabilities. The area under the receiver operating characteristic curve (AUC) was employed as the performance metric for evaluation. Impressively, the findings underscored the supremacy of TabNet (AUC = 0.85) over its counterparts, signifying the profound aptitude of deep neural network models in tackling tabular tasks, such as the detection of academic dishonesty. Encouraged by these outcomes, we proceeded to synergistically amalgamate the two most efficacious models, TabNet (AUC = 0.85) and AdaBoost (AUC = 0.81), resulting in the creation of an ensemble model christened TabNet-AdaBoost (AUC = 0.92). The emergence of this novel hybrid approach exhibited considerable potential in research endeavors within this domain. Importantly, our investigation has unveiled fresh insights into the utilization of deep neural network models for the purpose of identifying cheating in educational tests.

An Illustration of an IRTree Model for Disengagement

Educational and Psychological Measurement, Ahead of Print.
Low-stakes test performance commonly reflects examinee ability and effort. Examinees exhibiting low effort may be identified through rapid guessing behavior throughout an assessment. There has been a plethora of methods proposed to adjust scores once rapid guesses have been identified, but these have been plagued by strong assumptions or the removal of examinees. In this study, we illustrate how an IRTree model can be used to adjust examinee ability for rapid guessing behavior. Our approach is flexible as it does not assume independence between rapid guessing behavior and the trait of interest (e.g., ability) nor does it necessitate the removal of examinees who engage in rapid guessing. In addition, our method uniquely allows for the simultaneous modeling of a disengagement latent trait in addition to the trait of interest. The results indicate the model is quite useful for estimating individual differences among examinees in the disengagement latent trait and in providing more precise measurement of examinee ability relative to models ignoring rapid guesses or accommodating it in different ways. A simulation study reveals that our model results in less biased estimates of the trait of interest for individuals with rapid responses, regardless of sample size and rapid response rate in the sample. We conclude with a discussion of extensions of the model and directions for future research.

Wald χ2 Test for Differential Item Functioning Detection with Polytomous Items in Multilevel Data

Educational and Psychological Measurement, Ahead of Print.
Identifying items with differential item functioning (DIF) in an assessment is a crucial step for achieving equitable measurement. One critical issue that has not been fully addressed with existing studies is how DIF items can be detected when data are multilevel. In the present study, we introduced a Lord’s Wald [math] test-based procedure for detecting both uniform and non-uniform DIF with polytomous items in the presence of the ubiquitous multilevel data structure. The proposed approach is a multilevel extension of a two-stage procedure, which identifies anchor items in its first stage and formally evaluates candidate items in the second stage. We applied the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm to estimate multilevel polytomous item response theory (IRT) models and to obtain accurate covariance matrices. To evaluate the performance of the proposed approach, we conducted a preliminary simulation study that considered various conditions to mimic real-world scenarios. The simulation results indicated that the proposed approach has great power for identifying DIF items and well controls the Type I error rate. Limitations and future research directions were also discussed.

Can People With Higher Versus Lower Scores on Impression Management or Self-Monitoring Be Identified Through Different Traces Under Faking?

Educational and Psychological Measurement, Ahead of Print.
According to faking models, personality variables and faking are related. Most prominently, people’s tendency to try to make an appropriate impression (impression management; IM) and their tendency to adjust the impression they make (self-monitoring; SM) have been suggested to be associated with faking. Nevertheless, empirical findings connecting these personality variables to faking have been contradictory, partly because different studies have given individuals different tests to fake and different faking directions (to fake low vs. high scores). Importantly, whereas past research has focused on faking by examining test scores, recent advances have suggested that the faking process could be better understood by analyzing individuals’ responses at the item level (response pattern). Using machine learning (elastic net and random forest regression), we reanalyzed a data set (N = 260) to investigate whether individuals’ faked response patterns on extraversion (features; i.e., input variables) could reveal their IM and SM scores. We found that individuals had similar response patterns when they faked, irrespective of their IM scores (excluding the faking of high scores when random forest regression was used). Elastic net and random forest regression converged in revealing that individuals higher on SM differed from individuals lower on SM in how they faked. Thus, response patterns were able to reveal individuals’ SM, but not IM. Feature importance analyses showed that whereas some items were faked differently by individuals with higher versus lower SM scores, others were faked similarly. Our results imply that analyses of response patterns offer valuable new insights into the faking process.

An Evaluation of Fit Indices Used in Model Selection of Dichotomous Mixture IRT Models

Educational and Psychological Measurement, Ahead of Print.
A Monte Carlo simulation study was conducted to compare fit indices used for detecting the correct latent class in three dichotomous mixture item response theory (IRT) models. Ten indices were considered: Akaike’s information criterion (AIC), the corrected AIC (AICc), Bayesian information criterion (BIC), consistent AIC (CAIC), Draper’s information criterion (DIC), sample size adjusted BIC (SABIC), relative entropy, the integrated classification likelihood criterion (ICL-BIC), the adjusted Lo–Mendell–Rubin (LMR), and Vuong-Lo–Mendell–Rubin (VLMR). The accuracy of the fit indices was assessed for correct detection of the number of latent classes for different simulation conditions including sample size (2,500 and 5,000), test length (15, 30, and 45), mixture proportions (equal and unequal), number of latent classes (2, 3, and 4), and latent class separation (no-separation and small separation). Simulation study results indicated that as the number of examinees or number of items increased, correct identification rates also increased for most of the indices. Correct identification rates by the different fit indices, however, decreased as the number of estimated latent classes or parameters (i.e., model complexity) increased. Results were good for BIC, CAIC, DIC, SABIC, ICL-BIC, LMR, and VLMR, and the relative entropy index tended to select correct models most of the time. Consistent with previous studies, AIC and AICc showed poor performance. Most of these indices had limited utility for three-class and four-class mixture 3PL model conditions.

Measuring Unipolar Traits With Continuous Response Items: Some Methodological and Substantive Developments

Educational and Psychological Measurement, Ahead of Print.
In recent years, some models for binary and graded format responses have been proposed to assess unipolar variables or “quasi-traits.” These studies have mainly focused on clinical variables that have traditionally been treated as bipolar traits. In the present study, we have made a proposal for unipolar traits measured with continuous response items. The proposed log-logistic continuous unipolar model (LL-C) is remarkably simple and is more similar to the original binary formulation than the graded extensions, which is an advantage. Furthermore, considering that irrational, extreme, or polarizing beliefs could be another domain of unipolar variables, we have applied this proposal to an empirical example of superstitious beliefs. The results suggest that, in certain cases, the standard linear model can be a good approximation to the LL-C model in terms of parameter estimation and goodness of fit, but not trait estimates and their accuracy. The results also show the importance of considering the unipolar nature of this kind of trait when predicting criterion variables, since the validity results were clearly different.

Using Multiple Imputation to Account for the Uncertainty Due to Missing Data in the Context of Factor Retention

Educational and Psychological Measurement, Ahead of Print.
Although parallel analysis has been found to be an accurate method for determining the number of factors in many conditions with complete data, its application under missing data is limited. The existing literature recommends that, after using an appropriate multiple imputation method, researchers either apply parallel analysis to every imputed data set and use the number of factors suggested by most of the data copies or average the correlation matrices across all data copies, followed by applying the parallel analysis to the average correlation matrix. Both approaches for pooling the results provide a single suggested number without reflecting the uncertainty introduced by missing values. The present study proposes the use of an alternative approach, which calculates the proportion of imputed data sets that result in k (k = 1, 2, 3 . . .) factors. This approach will inform applied researchers of the degree of uncertainty due to the missingness. Results from a simulation experiment show that the proposed method can more likely suggest the correct number of factors when missingness contributes to a large amount of uncertainty.