The Identification of Guessing Patterns in Progress Testing as a Machine Learning Classification Problem

doi:10.21203/rs.3.rs-4731140/v1

Download PDF

Research Article

The Identification of Guessing Patterns in Progress Testing as a Machine Learning Classification Problem

https://doi.org/10.21203/rs.3.rs-4731140/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Background

The detection of guessing patterns in low-stakes progress testing could naturally be understood as a statistical classification problem where test takers are assigned to groups according to probabilities given by a machine learning model. However, the relevant literature on this topic does not include many examples where this approach is discussed; to date, the strategies applied to tackle this problem have been mostly based either on rapid response counting or the detection of unusual answer patterns.

Methods

On the basis of 14,897 participations in the Progress Test Medizin test – which takes place twice a year since 1999 in selected medical schools of Germany, Austria and Switzerland - we formulate the identification of guessing patterns as a binary classification problem. Next, we compare the performance of a logistic regression algorithm in this setup to that of the nonparametric person-fit indices included in R´s PerFit package. Finally, we determine probability thresholds based on the values of the logistic regression functions obtained from the algorithm.

Results

Comparison of logistic regression algorithm with person-fit indices

The logistic regression algorithm included in Python´s Scikit-Learn reached ROC-AUC scores of 0.886 to 0.903 depending on the dataset, while the 11 person-fit indices analysed returned ROC-AUC scores of 0.548 to 0.761.

Best feature set

Datasets based on aggregate scores yielded better results than those were the sets of answers to every item were considered as individual features. The best results were reached with a feature set containing only two parameters (self-monitoring accuracy and number of answered questions); considering the amount of time spent on the test did not lead to any performance improvement.

Probability thresholds

Based on the values of the logistic regression function generated by the applied algorithm, it is possible to establish thresholds above which there is at least a 90% chance of having guessed most answers.

Conclusions

In our setting, logistic regression clearly outperformed nonparametric person-fit indices in the task of identifying guessing patterns. We attribute this result to the greater flexibility of machine learning methods, which makes them more adaptable to diverse test environments than person-fit indices.

Progress test

Low-stakes tests

Disengaged test taking

Guessing patterns

Supervised machine learning

Binary classification

Logistic regression

Nonparametric person-fit indices

Research object

Despite the many advantages of the low stakes approach to progress testing, one obvious consequence of it is that test takers are for the most part free to show their lack of motivation by delivering a suboptimal performance; as a consequence of this, test scores may fail to accurately reflect the knowledge they are intended to gauge [1] [2] [3]. Education theorists, psychometricians and test organizers are then challenged to find distinctive traits that might help identifying disengaged participants. Among these traits, guessing is one of the most conspicuous; it has been frequently mentioned in the literature as both a construct-irrelevant factor [3] and a problem that affects the validity of test scores [4], being also widely regarded as a clear indicator of low test-taking effort, particularly when done within a short time span (“rapid guessing”) [2].

The earliest statistics intended to measure disengaged test taking were person-fit indices aimed at identifying unusual or deviant response patterns, which could hypothetically be attributed to “nonfitting response behavior such as guessing, cheating, or extremely creative behavior” [5]. By 1981, Harnisch and Linn [6] had identified two types of person-fit indices, depending on whether or not they were based on Item Response Theory; IRT-based indices like those developed by Drasgow, Levine et al. [7] came to be known as parametric person-fit statistics, whereas indices not related to IRT were termed non-parametric person fit statistics [8].

In 1995, Deborah L. Schnipke submitted her doctoral dissertation for approval, titled “Assessing speededness in computer-based tests using item response times” [9]. Steven L. Wise and Xiaojing Kong [2] used Schnipke´s research as a basis to claim that rapid guessing can be used to assess test taking effort; under this premise, they introduced an index named response time effort (RTE), which, given a participant p in a test T, measured the ratio of items responded by p without rapid guessing to the total number of items in T. The work of Wise and Kong gave birth to a whole new family of methods and models built upon the identification of rapid guessing patterns; Wise went on to produce a sizable body of work where this approach is formalized and articulated, with particular attention to the determination of the response time thresholds under which rapid guessing is presumed [10] [11] [12] [13] [14]. Other authors have opted for constructing mixture models where responses showing test-taking engagement (“solution behaviour") follow the IRT norm, while rapid responses are distributed according to random patterns [15] [16] [17].

In any case, both person-fit indices and methods based on rapid guessing present well-documented shortcomings and implementation issues [1] [15]; all in all, one can say that none of these two approaches has resulted in a standard method that is used by default in real life assessments as of 2024. This is why we claim there is room for a third generation of methods to detect disengaged or anomalous test taking; this new generation might very well be defined by the usage of machine learning techniques with the potential to be fine-tuned to identify any kind of irregular exam behaviour. The detection of test participants with unusual response patterns can easily be viewed as a supervised machine learning classification problem; one only needs to determine beforehand the exact type of response pattern to be identified.

Indeed, some research has been published recently on the performance of machine learning methods in the identification of different groups of non-serious test takers; for example, Zopluoglu [18] has used XGBoost to detect item preknowledge in large-scale tests, while Zhen and Zhu [19] have developed an ensemble learning method to detect cheating in educational tests, a problem which Kamalov et al. [20] have approached by training a recurrent neural network with sequential exam data. Of particular interest to us are the results of Nazari et al. [21], who have compared the results of random forests and person-fit indices in the detection of careless responders, concluding that random forests predict careless responding more accurately than any person-fit index included in the “PerFit” package of the R programming language.

The Progress Test Medizin (PTM) test is an ongoing, regularly scheduled test that provides a data corpus containing more than 10,000 instances per academic year. The model we want to construct is meant to be fully operational in the day-to-day practice of the PTM test; hence it must be not only theoretically sound, but also user-friendly and transparent. This practical orientation defines the gap we want to help closing with this study.

Research questions

The main purpose of this paper is to determine whether it is possible to identify guessing test takers through the application of a logistic regression algorithm. To this end, the following research questions will be examined throughout this paper:

Is logistic regression an adequate algorithm for identifying guessing patterns?
Which set of features delivers the best results in this setting?
Is it possible to identify guessing patterns on aggregate scores as opposed to considering all items as features?
How does a machine learning model compare to person-fit indices as well as methods based on rapid responses when it comes to identify guessing patterns?

Guessing test takers – The PTM case

The Progress Test Medizin (PTM) is a computer-based test issued by the Charité-Universitätsmedizin Berlin and administered at 17 German, Austrian and Swiss medical schools [22]; it consists of 200 multiple choice questions at graduate level [23], the same for all participating faculties. The exam is conducted through four different platforms (ePT, Moodle, LPlus and ILIAS), among which the ePT platform collects the most extensive information about student performance and test-taking behaviour. The ePT platform features an answering system in which students must report their certainty that their answers are indeed true. They are shown a Likert scale with three confidence levels (sure, likely and guessed); questions marked as “guessed” are not scored and do not count towards their global result. This allows students to make guesses without having to expect any penalties or gains from them [22] [24].

The PTM is graded according to a formula scoring method whereby every correct answer marked as “sure” or “likely” is awarded 1 point, but every incorrect answer marked as “sure” or “likely” receives a -1 score [24] [25]. Therefore, the penalty for a wrong guess is much harsher than the 1/(C-1) points – with C being the number of response alternatives – deducted in the formula scoring method used for example in the SAT test [4]. While formula scoring methods are on the whole intended to discourage guessing [26], this choice of penalty suggests that the designers of the PTM test were strongly interested in reducing guessed responses to an absolute minimum, probably because adequate self-assessment is of capital importance in medical practice; as Kämmer et al. [24] point out, highly confident physicians might potentially be in danger of committing diagnostic errors as they request fewer diagnostic tests. Moreover, participants with unusual answer patterns (as detected by a person-fit index or other methods) have been repeatedly excluded from studies based on PTM data [23] [24] [25]; this provides further support to the notion that guessing is regarded in the context of the PTM test as a factor that might jeopardise its validity, both as a formative tool and as an instrument of measure.

It is also worth mentioning that some faculties in the PTM consortium do not collect data about response times per question. This means that methods based on the detection of rapid guessing are not directly applicable in these faculties; hence the need for a procedure to identify guessing test takers that might suit all participating medical schools.

Definitions

Any supervised binary classification problem needs two predefined classification groups to be formulated. When it came to define these two groups for our classification problem, we considered that the propensity to guess was the most identifiable indicator for this purpose given the two sources of information that we could use, namely the Likert scales provided by participants and the response time data collected by the test platform. Therefore, we divided responses into non-guessed answers, self-reported guesses (obtained from the Likert scales) and rapid guesses (extracted from data on response time). Self-reported guesses and rapid guesses may overlap, but none of these groups include any non-guessed response; therefore, it makes sense to conceive non-guessed and guessed answers (including both self-reported and rapid guesses) as opposing response categories. It is then straightforward to follow a majority rule and allocate response patterns to one group or another depending on whether more than 50% of their individual responses were guessed. This reasoning leads to the following instrumental definitions (which are meant to be valid only within the context of this paper):

Self-reported guess: We define “self-reported guess” as any answer marked as guessed by the test participants themselves.

Rapid guess: We define “rapid guess” as any response submitted less than x seconds after the previous one (or after starting the test in the case of responses to the first question), where x is defined by applying the normative threshold method with a 10 percent threshold (NT10) described by Wise and Ma in [12]. We have also set a lower threshold at 4 seconds in view of the very low share of correct answers among those submitted in 3 seconds or less (cf. Table 1).

Guessed answer: We define “guessed answer” as any answer labelled as either “self-reported guess” or “rapid guess”.

Guessing test taker: Participants are identified as “guessing test takers” when more than 50% of their responses were classified as guessed answers.

Guessing pattern: We define “guessing pattern” as a sequence of answers pertaining to a guessing test taker.

The idea that there might be a difference between self-reported guesses and rapid guesses is supported by the breakdown of test answers by confidence levels, which shows that the percentage of correct answers among self-reported guesses was 36.18% for the test administered in the winter term 2020 [22]. This share is significantly higher than the expected percentage of correct answers for rapid guesses or random answer patterns; the probability of randomly choosing the right answer for a question also picked at random amounts to just 0.2353 for the test administered in the winter term 2020 [22]. This corroborates the hypothesis, also postulated by Kämmer et al. [24], that guessed answers were often not chosen entirely at random, but rather given by test takers who choose the option they consider most likely to be correct according to what they know, even if they do not yet possess enough knowledge to provide a more confident response.

Features

We decided to construct a baseline feature set with only two variables (self-monitoring accuracy and share of answered questions) and evaluated to which extent adding the time spent on the exam to the input features improves the model results (if at all). Self-monitoring accuracy can be measured by the share of correct answers among those submitted (that is, excluding omitted responses) [24]. It can be considered almost tautological that guessing participants will usually have lower levels of self-monitoring accuracy than non-guessing ones, but the implications of this obvious fact have not been explored much further; the usage of self-monitoring accuracy as a criterion to identify guessing patterns is rarely mentioned in the literature. Karay et al. [25] do acknowledge the existence of a relation between self-monitoring accuracy and guessing propensity in the context of formative tests with a “don´t know” option, going on to assume that all incorrect answers in the analysed test (the PTM as of 2011) were guessed.

The importance given to rapid responses in the literature about detection of careless test takers prompted us to determine the convenience of including the time spent on the exam as a model feature, under the assumption that guessing test takers would typically devote less time to the test. In any case, it is possible that the baseline feature set already succeeds in capturing test-taking effort, thereby making it unnecessary to consider the time spent of the exam as well; this is why we want to compare how our logistic regression models fare with and without this feature.

Subsequently, we investigated if considering each question as a feature offers any improvement with respect to the baseline model. Our reason to do so is that appropriateness measurement methods are based on assessing patterns extracted from the sequence of answers given by each participant; therefore, every single response (or even the absence of it) is relevant to the result. We wanted to explore whether the performance of a logistic regression model is any better when every response is taken as input.

Logistic regression

We chose logistic regression as binary classification method because it offers the advantage of delivering results that would be eventually easier to explain to other interested parties (e.g. students) due to the relative simplicity of its logistic function; the inner workings of the other algorithms would be much harder to grasp for non-experts. This advantage in transparency makes logistic regression the foremost candidate if we are required to communicate our results to the public on a regular basis. Logistic regression has been described as “a parametric method used for examining the relationship between a binary response variable (one that is categorical having only two categories) and a set of independent predictor variables that can be either continuous or categorical” [27]. It can be conceived as a linear regression model where the dependent variable is a natural logarithm [28], according to the expression [29]. In the context of this study, the variables stand for the possible features (number of answers, self-monitoring accuracy, time spent on the test); and represent the respective probabilities of each binary outcome. This makes it possible to express the threshold between guessing and non-guessing patterns as a simple linear equation, for example , where a would stand for the number of answers, s for the self-monitoring accuracy, and t for the time spent on the test.

Self-reporting on confidence

Within our framework, the labelling of answers as “guessed” depends mostly on the self-reporting of participants rather than on response time, to such extent that 94.84% of all answers marked as guessed in our dataset were self-reported guesses. In order to build reliable training and test sets for our models, we needed to remove data entries that are either inaccurate or inconsistent, or whose credibility cannot be verified due to lack of relevant information. In particular, we had to check whether self-reporting by participants matched a consensual definition of the confidence levels they were shown (“sure”,”likely” and “guessed”)

Our idea was to assign a fixed meaning to the labels “sure”, “likely” and “guessed”, that is, to define them in such a way that they all match numerically quantifiable degrees of certainty based on the average share of correct answers associated to each level. Thus, “sure” would imply a probability of 0.812 that an answer is correct; this probability would be 0.5847 for “likely” and 0.38 for “guessed”. For each participation, we postulated the null hypothesis that the underlying probability distribution for the number of correct answers among those marked as “sure” is the binomial probability distribution B(N,P), where N is the total number of answers labelled as “sure”, and P=0.812 is given by the proportion of correct answers among those labelled as “sure” in the whole dataset (excluding rapid responses). This null hypothesis would be verified via a two-tailed test with a p-value of 0.0027, related to the three-sigma interval [0.00135; 0.99865] of a normal probability distribution; our intention here was to detect only the most extreme discrepancies from the norm given by P(correct | sure)=0.812, so that we had the possibility of discarding participations that are truly incompatible with this value. Same with answers labelled as “likely” (P(correct | likely) = 0.5847) and “guessed” (P(correct | guessed) = 0.38). In short, we carried out three hypothesis tests per participation, that is, one for each level in our Likert scale for confidence.

Before conducting these hypothesis tests, one must ensure that there is enough information available to do so, that is, there are enough answers so that at least one event E such that P(E)<p/2, where p is the p-value, is possible. For example, this does not happen if we have 3 guessed answers with P(correct)=0.38; in this case, the least likely event would be to answer all three questions correctly, and the probability of this event would be P(3 correct answers) = 0.38³= 0.054872>0.0135=p/2. Since this is the least likely event, we see that there would be no event that could refute the null hypothesis, so we would need a larger sample of answers in order to test it. But we are working with real data collected within the framework of an established formative test, so we cannot obtain any more answers from the participants who did not provide them in the first place.

In summary, these hypothesis tests could return three possible outcomes: non-rejection of the null hypothesis, rejection of the null hypothesis, or inconclusiveness due to lack of information. We removed all entries where none of the three hypothesis tests could be carried out (that is, all tests were inconclusive due to lack of information) or the null hypothesis was rejected in at least one of the hypothesis tests, suggesting that the particular meaning they assign to at least one of the labels “sure”, “likely” and “guessed” differs significantly from their consensual definition based on the average values mentioned above.

We tested all 24,084 PTM participations registered through the ePT platform from October 2020 to June 2022 in order to examine their suitability for our study; 15,347 participations (63.72%) passed all three hypothesis tests, while 6,258 (25.98%) failed at least one of them and were thus not included in our dataset. Finally, there were 2,479 participations (10.29%) for which the test was inconclusive due to lack of information; 1,152 such cases (4.78%) concerned participants who did not respond any question.

Case-specific accuracy

We do not make public allegations of guessing for all possible or likely cases, but only for the most flagrant instances of this behaviour. Therefore, we are also interested in determining whether our method might provide a degree of certainty in identifying guessing test takers. We have then ranked all test set instances according to their algorithm-assigned probabilities of a positive identification, focusing ourselves on the results of the highest percentiles, so we can establish a threshold above which the algorithm´s decision can be regarded as fully dependable.

Comparison with person-fit indices

In order to answer our research question about the performance of logistic regression vis-à-vis person-fit indices and methods based on rapid responses, we tested 11 dichotomous non-parametric person-fit indices included in the R package PerFit [30]. Since the PTM is not an IRT-based test, parametric person-fit indices were not assessed; furthermore, we decided not to analyse the NCI statistic on the grounds that it is linearly related to another person-fit index also included in the PerFit package (the GNormed statistic, also known as “normed Guttman errors”). The procedure employed for this comparison is roughly similar to the one applied by Nazari et al. in [21]: for the test set of every PTM test included in our study, we computed the ROC-AUC scores of the prediction vectors of each method (i.e., the vectors containing the values with which predictions are made) against the vectors containing the actual response pattern labels (1 for guessing patterns, 0 for other patterns).

While we have chosen to set the guessing test taker threshold at a basis value of 50%, we have also explored how our method compares to person-fit indices when this threshold is set at a different value. To avoid large imbalances, we have only considered hypothetical thresholds where none of the two classification groups accounts for more than 80% of the data entries; under this rule, the range of possible threshold values goes from T=19% to T=64%. We therefore repeated the procedure described in the previous paragraph for the 46 integer values in this range (T {19%, 20%, 21%,…,64%} ), with guessing test takers defined as participants whose share of guessed answers exceeds T. This helped us determine whether the performance of our method against person-fit indices is independent of possible cutoff values.

Pipeline

Dataset

We have used an anonymized dataset containing data from the four PTM tests administered from October 2020 to June 2022, known as PT43, PT44, PT45 and PT46 [23], that we found out to be reliable according to the considerations explained in the “Self-reporting on confidence” section of this paper; this dataset contains records of 15,347 participations from ten medical schools in Germany, Austria and Switzerland. We have also considered the partial datasets corresponding to each PTM test from PT43 to PT46; these five datasets were combined with five different sets of features (see Table 2) to produce 22 logistic regression models. Combinations involving the global dataset and the set of individual answers to each question were not considered, since participants in different tests were not asked the same questions; therefore, the inclusion of features based on specific questions does not make sense for datasets with data from multiple tests.

As a necessary preprocessing step, we decided to discard all data entries belonging to the following categories:

1. Participations by students enrolled in their “practical year” (junior residency). Students in their “practical year” are not considered for this study because their participation in the PTM test is voluntary; moreover, not all universities in the PTM consortium give these students the possibility to take part in the test [22]. Hence, we discarded 380 participations associated to such students.

2. Participations lacking reliable data about the amount of time spent on the test. 70 further participations were discarded because data about the amount of time spent on the test was unreliable or missing.

Our final dataset includes 14,897 participations, of which5,116 have been submitted by guessing test takers according to our definition.

Model selection and hyperparameter setting

Logistic regression was implemented using the sklearn.linear_model.LogisticRegression module of the Python library scikit-learn [31] [32]. All datasets used in this study were split into a training set with 80% of the data and a test set with 20% of the data. Parameter optimization was carried out with scikit-learn's RandomizedSearchCV, which implements a randomized search over parameters, where each setting is sampled from a distribution over possible hyperparameter values [33]. We ran RandomizedSearchCV 1000 times, keeping the hyperparameter sets that provided the best ROC-AUC score for each of the 22 combinations of data and features examined in this study.

Metrics

The accuracy levels shown in tables 3, 4 and 5 under “Accuracy (cv)” correspond to the highest accuracies reached with each algorithm-input combination as determined by 10-fold cross validation performed using the scikit-learn function cross_val_score(). 10-fold cross validation is a procedure whereby the training set is divided into 10 smaller sets (“folds”); then, each fold is used in turn as a test set while the other nine function as training sets. The value returned by 10-fold cross validation is the mean of the values computed for every iteration of the procedure [34].

The results shown in tables 6, 7 and 8 refer to the final evaluation of the test set after the classification task. Thresholds to identify the most likely cases of guessing patterns in a real setting were derived from the precision values for the subsets including the 5%, 10%, 15% and 20% of test set items with the highest algorithm-assigned probability to correspond to guessing test takers. These values, together with their associated confidence intervals, are shown in tables 6, 7 and 8 respectively under “Precision (95th percentile)”, “Precision (90th percentile)”, “Precision (85th percentile)” and “Precision (80th percentile)”.

Since many person-fit indices rely strongly on the identification of uncommon answer sequences, we have based our comparison between logistic regression and person-fit indices on the four partial datasets, in order to avoid comparing sequences which do not refer to the same questions. All person-fit index computations were carried out with the R package PerFit; the computation of the ROC-AUC scores was made with the R package pROC [35].

Overall performance

As we can see in Tables 3 and 6, the results of the two models involving the complete dataset are very similar to each other, with identical cross-validated ROC-AUC and accuracy scores (0.894 and 0.824, respectively) and minimal differences in precision for the cases most clearly identified by the algorithm as guessing test takers. Therefore, adding the time spent on the exam as a feature does not bring any significant performance gain.

Table 4 shows that models built with data from partial datasets (that is, datasets including participations from only one test) and aggregate features (i.e. without taking into account results of individual questions) returned roughly similar results to those related to the complete dataset; cross-validated ROC-AUC scores varied from 0.887 to 0.902, with a mean value of 0.895. Again, there is no steady improvement when one adds the time spent on the exam to the feature set.

On the other hand, all models where every question was individually considered as a feature performed worse than the models built on the same datasets with aggregate results (cf. Tables 5 and 8). This detrimental effect of considering whole answer sequences instead of aggregate results might also be relevant when assessing person-fit indexes, which are mostly based on evaluating series of individual answers in order to identify anomalous response patterns.

Case-specific accuracy

When we consider the 10% of test takers with the highest algorithm-determined probabilities of being guessing test takers, we find that our algorithm identifies at least 93.52% of them correctly taking confidence intervals into account (see Table 3); it is thus feasible to set a threshold at the value of the logistic regression function for the percentile 90 and state that there is a more than 93% chance that a test participant exceeding this threshold is a guessing test taker.

As a rule, models based on the complete dataset will be more convenient for us to set the thresholds that we will use in our day-to-day practice, since the larger size of this dataset implies narrower confidence intervals.

Comparison with person-fit indices

All the person-fit indices we tested on our partial datasets returned clearly worse results than the logistic regression algorithm used throughout this study (cf. Table 9, Figure 1, Figure 2). The highest ROC-AUC score reached with any person-fit index was 0.761, with a mean of 0.676 for the 44 experiments conducted; under the same experimental settings, the logistic regression algorithm achieved ROC-AUC scores ranging from 0.886 to 0.903, with a mean of 0.891 for the 8 experiments conducted. The four best performing person-fit indexes were Kane and Brennan´s agreement, disagreement and dependability statistics and the Guttman errors index, with ROC-AUC scores ranging from 0.671 to 0.761. The poorest results were recorded for the Personal Biserial index (0.591 for the PT44 dataset) and the Modified Caution Statistic (0.592 for the PT46 dataset).

According to our results, this performance advantage of logistic regression does not depend on the position of the classification threshold; Figures 3 to 6 show that the ROC-AUC scores of the logistic regression models with the “baseline” and “baseline+time” feature sets are consistently higher than those of person-fit indices throughout the range of thresholds from 19% to 64%, with differences of at least 0.1 between any implementation of the logistic regression algorithm and the best placed person-fit index.

We have applied the logistic regression algorithm from the Python library ScikitLearn on 22 different combinations of datasets and feature sets (see Table 2); we have subsequently measured the performance of this algorithm as shown by the ROC-AUC scores obtained for each combination of dataset and feature set, also comparing these scores to those yielded by the person-fit indices included in R´s PerFit package. In addition, we have explored whether a probability threshold can be set at some value of the logistic regression function, in order to identify test participations with a particularly high likelihood of having guessed most of their answers.

In the preceding paragraphs we have also given a detailed account of the methods we applied, starting with the necessary definitions and following with the rationale according to which we selected the datasets and feature sets used in this study. Further, we have explained how we dealt with the self-reported information on confidence levels provided by PTM participants; our pipeline and preprocessing steps were also described.

These efforts are intended to serve the overall purpose of this study, which is to determine the suitability of a logistic regression algorithm to identify guessing patterns within the context of the PTM test. This goal is structured around the four research questions listed at the end of the Introduction, whose answers will be discussed in the following section.

Answers to research questions

Is logistic regression an adequate algorithm for identifying guessing patterns?

Overall, our framework for identifying random responders with machine learning classification algorithms has fulfilled our expectations. We have observed that it delivers a better performance in our experimental setting than all non-parametric person-fit indices analysed in this study, as we can see in Table 9; Table 6 shows that its accuracy is even close to 1 for the test takers with the highest algorithm-determined probabilities of being a guessing test taker. An analysis of rapid responses might as well deliver an imprecise depiction of the state of affairs on guessing; while 70.04% of the answers provided by the subset of guessing test takers in our final dataset were either guessed or rapid responses, the share of rapid responses in this subset amounts to just 5.8%, which indicates that using rapid responses alone to identify guessing test takers could lead to underestimating the problem.

A successful machine learning method captures the structure of the training set and fits the items of theoretically similar datasets onto this structure, so that the implemented procedure morphs to match the training set and the classification problem to be solved. As one can see in [30], person-fit indices offer fewer options for parameter adjustment, which might make them less suitable for datasets with specific properties.

Which set of features delivers the best results in our setting?

Tables 3, 4 and 5 show that the best ROC-AUC scores overall were reached with just two or three features: either self-monitoring accuracy and number of answered items, or these two plus the amount of time spent on the test. Removing the time parameter does not usually lead to worse scores; using only two features could be a viable option if data on time spent on the test are incomplete or unreliable.

Is it possible to identify guessing patterns on aggregate scores as opposed to considering all items as features?

Our experiments with aggregate scores have returned better outcomes than those performed on the same datasets including all items as features; the results of person-fit indices with our datasets also indicate that the differential behaviour in answering specific questions is not that relevant for us when it comes to identify guessing test takers.

How does a machine learning model compare to person-fit indices as well as methods based on rapid responses when it comes to identify guessing patterns?

Hosmer, Lemeshow and Sturdivant [36] have stated that a test intended towards discriminating items must show a ROC-AUC score of at least 0.70 to be considered acceptable for that purpose; a ROC-AUC score of 0.50 indicates that an index does not discriminate at all between categories. Our results with the classification threshold set at 50% (cf. Table 9) show that logistic regression models surpassed the 0.70 boundary in all cases, with ROC-AUC scores ranging from 0.886 to 0.903 depending on the dataset and features. This is also true for Kane and Brennan´s dependability statistic, but the ROC-AUC scores of this index range from 0.703 to 0.761, thus being nowhere near those of logistic regression. Seven further person-fit indices surpassed the 0.70 boundary only in some cases, whereas the three worst performing person-fit indices (Personal Biserial, U3 and ZU3) always returned ROC-AUC scores below 0.70 in this setting. While the ROC-AUC scores of most person-fit indices improve as we raise the classification threshold, so do the ROC-AUC scores of logistic regression models, so that the advantage of the latter remains almost constant all throughout (cf. Figures 3 to 6).

Many of the nonparametric indices included in the PerFit package are based one way or another on Guttman errors [6], defined as instances where a test taker, confronted with a pair of questions Q₁ and Q₂ where P(Q₁ right) < P(Q₂ right), gives a correct answer to Q₁, but not to Q₂ [5]. It is true that random responders might tend to fall into this more than serious test takers, because the difficulty level of a question does not matter much when the answer is to be randomly chosen. Nevertheless, non-serious test takers are not the only ones who might commit Guttman errors; for example, students who are particularly proficient in certain subjects would easily commit these errors by giving correct answers to difficult questions belonging to their field of expertise while failing to respond less demanding questions about other topics. As a formative test intended to cover the whole curriculum of a five-year medical degree, the PTM might be a case in point about these limitations of non-parametric person-fit indices.

As for methods based on rapid responses, their main shortcoming in the context of our dataset could arguably lie on the fact that not all guessing test takers are rapid responders [37]; depending on the test environment, students who do not resort to rapid responses could even represent the majority among guessing participants. This is something we also observe in our data: among the 5% of test takers with the largest share of guessed answers, only 29.2% showed an above-average share of rapid responses. It is true that methods based on rapid responses might boast a relatively high precision rate; 91.6% of all test takers with a share of rapid responses higher than 10% also have a share of random responses higher than 50%, a fact which supports the idea that rapid responders are often also guessing test takers. But since the opposite is not necessarily true in most cases, methods based on rapid responses might lack from an insufficient recall rate, particularly in the case of datasets like ours, where the rapid response rate is relatively low.

Limitations and challenges

PTM formula scoring

PTM participants are strongly encouraged not to respond questions whose answer they do not know; the formula scoring used is also very intent on penalizing wrong answers much more than admissions of ignorance. This means, however, that our results might not be generalizable to examination designs where test takers are instructed to answer all questions and every response is relevant to the final score. In any case, this paper is not intended as a guidebook to be followed to the letter, but rather as a set of experiments whose design could eventually be adjusted to fit the conditions of other tests.

Allocation of a fixed numerical meaning to Likert scale levels

We have allocated a fixed numerical value to the each of the three levels of our Likert scale on confidence, subsequently excluding participations that diverge significantly from these values. It might be argued that the resulting dataset would be one where serious participants are overrepresented with respect to their real share among test takers. While this might be true to some extent, we needed to ensure a certain level of consistency in our dataset; the discarded data entries corresponded either to participants who were not present in the test to the same degree as engaged test takers (for example early leavers or non-responders) or to those who did not understand the distinction between the answer categories “sure”, “likely” and “guessed” as most other participants did.

Practical implications

It is an established practice in the PTM consortium to identify participations showing unusual answer patterns and exclude them from the calculation of mean values and aggregate figures [23] [24] [25]; the method described in this study is intended to be used for this exact purpose. All participants whose answer pattern is identified as unusual are informed of it by means of a warning notice in their personalized test reports, which says “You have possibly not completed the test in full or just guessed. Your result is therefore excluded from the comparison groups” (in German: “Du hast den Test möglicherweise unvollständig bearbeitet oder nur geraten. Dein Ergebnis ist daher von der Vergleichsgruppenbildung ausgeschlossen“). By being told that their results are excluded from the comparison groups, PTM participants also get the implicit message that they did not approach the test as expected; therefore, they did not make the most of it as a formative tool.

This means that PTM participants will receive an opinion from their faculty about their exam behaviour based on the results of the method we present here. Hence our method is meant to transcend the realm of academic discussion and become a part of the exchange between faculty and students in a medical education setting; against this background, we must make it clear that our claims are backed by strong evidence. This is the reason why we aimed to detect participants with a probability higher than 90% of having guessed most of their answers; this definition is empirically justifiable, which might make it easier to be understood and assimilated by test participants.

We have formulated the identification of guessing patterns in the PTM test as a machine learning classification problem; a logistic regression algorithm was tested on this framework. The results hint at the possibility that machine learning algorithms could substantially improve the performance of non-parametric person-fit indices. The main advantage of machine learning methods in this regard lies on their configuration flexibility, which makes them appropriate for diverse and untypical test environments, and also for identifying guessing patterns on the basis of real data.

PTM

Progress Test Medizin

Progress Test

AUC

Area under the curve

ROC

Receiver Operating Curve

Confusion matrix

Cross-validated

lrb

Logistic regression, baseline

lrbt

Logistic regression, baseline + time

KBA

Kane and Brennan´s agreement statistic

KBD

Kane and Brennan´s disagreement statistic

KBE

Kane and Brennan´s dependability statistic

Guttman errors

Normed Guttman errors

Personal biserial

Caution statistic

Mcs

Modified caution statistic

Baseline

b + t

Baseline + time

b + q

Baseline + questions

b + t + q

Baseline + time + questions

Questions only

Schüttpelz-Brauns, K., Kadmon, M., Kiessling, C., Karay, Y., Gestmann, M. & Kämmer, J.E.(2018). Identifying low test-taking effort during low-stakes tests with the new Test-taking Effort Short Scale (TESS) – development and psychometrics. BMC Med Educ18, 101, doi:10.1186/s12909-018-1196-0.
Wise, S.L. & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18, 163–183.
Wise, S.L. & DeMars, C.E. (2010). Examinee Noneffort and the Validity of Program Assessment Results, Educational Assessment, 15:1, 27-41, doi:10.1080/10627191003673216
Bereby‐Meyer, Y., Meyer, J., & Flascher, O.M. (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15(4), 313-327, doi: 10.1002/bdm.417
Meijer, R. (1994). The Number of Guttman Errors as a Simple and Powerful Person-Fit Statistic. Applied Psychological Measurement - APPL PSYCHOL MEAS. 18. 311-314, doi: 10.1177/014662169401800402.
Harnisch, D.L., & Linn, R.L. (1981). Analysis of Item Response Patterns: Questionable Test Data and Dissimilar Curriculum Practices. Journal of Educational Measurement, 18(3), 133–146.
Drasgow, F., Levine, M.V., & McLaughlin, M.E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59-79.
Karabatsos, G. (2003) Comparing the Aberrant Response Detection Performance of Thirty-Six Person-Fit Statistics, Applied Measurement in Education, 16:4, 277-298, doi: 10.1207/S15324818AME1604_2
Schnipke, D.L. (1995). Assessing speededness in computer-based tests using item response times (Unpublished doctoral dissertation). Johns Hopkins University, Baltimore, MD
Wise, S.L. (2018): Controlling construct-irrelevant factors through computer-based testing: disengagement, anxiety, & cheating, Education Inquiry,doi:10.1080/20004508.2018.1490127
Wise, S.L. & Gao, L. (2017). A general approach to measuring test-taking effort on computer-based tests. Applied Measurement in Education, 30(4), 343–354, doi:10.1080/08957347.2017.1353992
Wise, S.L. & Ma, L. (2012). Setting response time thresholds for a CAT Item Pool: The Normative Threshold Method; Annual meeting of the National Council on Measurement in Education, Vancouver, Canada: 163-183
Wise, S.L. (2019) An Information-Based Approach to Identifying Rapid-Guessing Thresholds, Applied Measurement in Education, 32:4, 325-336, doi:10.1080/08957347.2019.1660350
Wise, S.L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61, doi:10.1111/emip.12165
Ulitzsch, E., Von Davier, M., & Pohl, S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73(S1), 83–112, doi:10.1111/bmsp.12188.
van der Linden, W.J. (2007). A Hierarchical Framework for Modeling Speed and Accuracy on Test Items. Psychometrika72, 287–308, doi:10.1007/s11336-006-1478-z
Pokropek, A. (2016). Grade of membership response time model for detecting guessing behaviors. Journal of Educational and Behavioral Statistics, 41(3), 300–325, doi:10.3102/1076998616636618
Zopluoglu, C. (2019). Detecting Examinees With Item Preknowledge in Large-Scale Testing Using Extreme Gradient Boosting (XGBoost). Educational and Psychological Measurement. 79, doi:10.1177/0013164419839439.
Zhen, Y., & Zhu, X. (2023). An Ensemble Learning Approach Based on TabNet and Machine Learning Models for Cheating Detection in Educational Tests. Educational and Psychological Measurement, 0(0). https://doi.org/10.1177/00131644231191298
Kamalov, F., Sulieman, H. & Santandreu Calonge, D. (2021) Machine learning based approach to exam cheating detection. PLoS ONE16(8): e0254340, doi:10.1371/journal.pone.0254340
Nazari, S., Leite, W.L. & Huggins-Manley, A. C. (2021). Detecting Careless Responding to Assessment Items in a Virtual Learning Environment Using Person-fit Indices and Random Forest. In Hsiao, I., Sahebi, S., Couchet, B., and Vie J.,(Eds.), Proceedings of the 14th International Conference on Educational Data Mining (pp. 635–640). International Educational Data Mining Society.
Sieg, M., Roselló Atanet, I., Tomova, M.T., Schoeneberg, U., Sehy, V., Mäder, P. & März, M. (2023). Discovering unknown response patterns in progress test data to improve the estimation of student performance. BMC Med Educ23, 193, doi:10.1186/s12909-023-04172-w
Sehy, V., Roselló Atanet, I., Sieg, M., Struzena, J. & März, M. (2022). "Effects of COVID-19 Pandemic on Progress Test Performance in German-Speaking Countries", Education Research International, vol. 2022, Article ID 3023026, doi: 10.1155/2022/3023026
Kämmer, J.E., Hautz, W.E. & März M. (2020). Self-monitoring accuracy does not increase throughout undergraduate medical education. Med Educ.2020; 1–8, doi: 10.1111/medu.14057
Karay, Y., Schauber, S.K., Stosch, C. & Schüttpelz-Brauns, K. (2015). Karay, Y., Schauber, S. K., Stosch, C. & Schüttpelz-Brauns, K. (2015). Computer Versus Paper—Does It Make Any Difference in Test Performance? Teaching and Learning in Medicine, 27(1), 57–62. doi:10.1080/10401334.2014.979175
Espinosa, M.P. & Gardeazábal, J. (2005). On the strategic equivalence of multiple-choice test scoring rules. DFAE-II WP Series, ISSN1988-088X, No. 20, 2005. WP 2005-20.
Wang, C., Keith, S.W., Fontaine, K.R. & Allison, D.B. (2006). 14 - Statistical Issues for Longevity Studies in Animal Models, In Conn, P.M. (Ed.), Handbook of Models for Human Aging, Academic Press (pp. 153-164), doi:/10.1016/B978-012369391-4/50015-1.
LaValley, M. P. (2008). Logistic regression. Circulation 117, 2395–2399, doi: 10.1161/CIRCULATIONAHA.106.682658
Sperandei, S. (2014). Understanding logistic regression analysis. Biochem Med. 24:12–8, doi: 10.11613/BM.2014.003
Tendeiro, J.N. (2022), Package ‘PerFit’. https://cran.r-project.org/web/packages/PerFit/PerFit.pdf. Accessed 17 Oct 2023.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res. 12: 2825–30, doi:10.48550/arXiv.1201.0490
scikit-learn (2023), https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 3 Apr 2023.
scikit-learn (2023), https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search. Accessed 3 Apr 2023.
scikit-learn (2023), https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation. Accessed 3 Apr 2023.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C. & Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77, doi:10.1186/147121051277
Hosmer, D. W, Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. Third edition / Hoboken (N.J.): Wiley.
Nagy, G., Ulitzsch, E., & Lindner, M. A. (2023). The role of rapid guessing and test-taking persistence in modelling test-taking engagement. Journal of Computer Assisted Learning, 39(3), 751–766.doi: 10.1111/jcal.12719

Tables 1 to 9 are available in the Supplementary Files section

The authors declare no competing interests.

Tables.docx

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

The Identification of Guessing Patterns in Progress Testing as a Machine Learning Classification Problem

Status:

Version 1

Abstract

Figures

Introduction

Methods/Experimental

Results

Discussion

Answers to research questions

Is logistic regression an adequate algorithm for identifying guessing patterns?

Which set of features delivers the best results in our setting?

Limitations and challenges

PTM formula scoring

Allocation of a fixed numerical meaning to Likert scale levels

Practical implications

Conclusion

Abbreviations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1