Table 3 and Table 4 present the trait-specific test data performances in QWK after the random search cross-validations of our main experiments for ASAP and MEWS, respectively. The best-performing hyperparameter settings for each model can be found in the Appendix (Table A3).
4.1 Features versus contextual embeddings
In RQ 1, we aimed to compare the performance of a feature-based and a contextual embedding-based model. In Tables 3 and 4, the QWKs of the feature- and the embedding-based DNN predictions on the test data can be found in the (prompt-specific) third and fourth rows, respectively. For comparison, the same results measures in PCC can be found in the Appendix (Table A4 and A5). Across traits and prompts, the feature-based model outperformed the embedding-based model in 11 out of 16 cases. However, the performance of both models was similar. The feature-based model achieved an overall average QWK of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}= .614\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.673\)), and the embedding-based model achieved an overall average of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}= .563\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.625\)). The T-test across traits and prompts implied no significant differences between the two approaches (p = .345). In addition, it became apparent that the embedding-based model fell short, especially in the trait organization of the two MEWS prompts (QWK differences of \({{\Delta }\text{Q}\text{W}\text{K}}_{MEWS 1}=.31\) and \({{\Delta }\text{Q}\text{W}\text{K}}_{MEWS 2}=.33\), respectively), while performing almost equally well across all other traits (and prompts). Furthermore, the same pattern for the trait organization was evident in ASAP 2 but not in ASAP 1. Nevertheless, this finding seems plausible as (even contextual) embeddings might not carry information about an essay's (meta-) structure, which is relevant for human annotators judging student essays. In contrast, such information, for instance the number of paragraphs, is represented in the feature set.
Table 3
Quadratic Weighted Kappa Across ASAP Essay Traits
| Content | Organization | Word choice | Sentence fluency | Conventions |
ASAP 1 | | | | | |
N-Gram reg. | .536 | .511 | .515 | .491 | .481 |
Feature reg. | .678 | .635 | .672 | .636 | .623 |
Feature DNN | .693 | .657 | .690 | .645 | .639 |
DistilBERT | .713 | .666 | .677 | .675 | .666 |
Hybrid | .743 | .672 | .673 | .681 | .648 |
M. & B. (2018)1 | .67 | .60 | .64 | .62 | .61 |
M. & B. (2020)2 | .703 | .664 | .675 | .648 | .638 |
ASAP 2 | | | | | |
N-Gram reg. | .552 | .541 | .548 | .396 | .402 |
Feature reg. | .637 | .658 | .686 | .672 | .684 |
Feature DNN | .664 | .662 | .698 | .688 | .699 |
DistilBERT | .651 | .591 | .686 | .674 | .685 |
Hybrid | .688 | .686 | .715 | .736 | .685 |
M. & B. (2018)1 | .61 | .58 | .60 | .59 | .62 |
M. & B. (2020)2 | .617 | .623 | .630 | .603 | .601 |
Note. The best performing model for each trait and prompt is printed in bold. reg. = ridge regression.
1 Performance benchmarks in terms of QWK from Mathias & Bhattacharyya (2018).
2 Performance benchmarks in terms of QWK from Mathias & Bhattacharyya (2020).
Table 4
Model Performances Across MEWS Essay Traits
| Content | Organization | Language quality |
MEWS 1 (AD) | | | |
N-Gram reg. | .330 | .142 | .442 |
Feature reg. | .423 | .509 | .662 |
Feature DNN | .380 | .482 | .648 |
DistilBERT | .396 | .171 | .556 |
Hybrid | .463 | .521 | .698 |
Human Threshold1 | .66 | .68 | .71 |
MEWS 2 (TE) | | | |
N-Gram reg. | .289 | .167 | .464 |
Feature reg. | .435 | .507 | .654 |
Feature DNN | .377 | .517 | .688 |
DistilBERT | .355 | .192 | .667 |
Hybrid | .376 | .528 | .723 |
Human threshold1 | .52 | .77 | .72 |
Note. The best performing model for each trait and prompt is printed in bold. reg. = ridge regression.
1 Human rater agreement in terms of QWK.
Beside these differences in the trait organization, no systematic superiority of one approach was found across traits. For ASAP, QWKs even implied more systematic differences between prompts than between traits. While the embedding-based DNN outperforms the feature-based DNN in four out of five traits of ASAP 1, the feature-based DNN outperforms the embedding-based DNN in all traits of ASAP 2.
Furthermore, we compared these two models against two simpler baseline models. These baseline models used a ridge regression with n-grams versus the feature input. The prompt-specific first and second rows of Tables 3 and 4 Model represent the test data performance for each trait. The comparison of our DNN target models with the n-gram baseline model revealed that both target models consistently outperformed. The one-sided T-tests indicated significant performance advantages (\({p}_{features}\) < .001 and \({p}_{embeddings}\) = .010, respectively). However, comparisons to the feature-based linear regression baseline only partly revealed advantages for the target models, and T-tests were not significant (\({p}_{features}=.818\), \({p}_{embeddings}=.465\)). The feature-based linear baseline model even performed consistently above the embedding-based DNN across all traits of the MEWS prompts (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{embeddings vs. baseline 1}=-.01\)). The feature-based DNN also fell short in four out of six traits in the MEWS prompts compared to the feature baseline (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{features vs. baseline 1}=-.02\)). Regarding the two ASAP prompts, however, the two DNN approaches almost consistently performed above the feature-based baseline. However, the differences were small (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{features vs. baseline 2}=.02\) and \({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{embeddings vs. baseline 2}=.01\)). These relatively small advantages imply that nonlinearities and interactions among features (as well as embeddings) were of minor importance when scoring the essay traits (see also Table A3 in the Appendix). This finding also matches expectations as raters typically follow strict judgment guidelines for benchmark scoring. Such guidelines are almost exclusively based on linear, additive scoring rules.
4.2 Hybrid Architecture
The goal of RQ 2 was to compare a hybrid model architecture containing both feature types – linguistic features and contextual embeddings – to the single-resource models. The trait-specific test set performance of the hybrid model is represented in the fifth row of each prompt in Tables 3 and 4. The hybrid model achieved an average performance of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}= .640\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.681\)). As expected, the hybrid model outperformed the single-resource models in most traits (12 out of 16) across prompts. However, the one-sided T-test comparing the performance of the feature-based model to the hybrid was not significant (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{hybrid-features}=.03\), \(p= .507\)). The difference between the embedding-based DNN and the hybrid also failed significance (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{hybrid-embeddings}=.08\), \(p= .156\)). Despite the non-significant results, the hybrid consistently proved to perform better than the single-resource models across prompts and traits. This finding meets expectations and is in line with recent findings from holistic scoring (Bai & Stede, 2022; Uto et al., 2020). Furthermore, it implies that both types of input indeed capture partially different text information relevant for scoring essay traits. Thus, both types of input complemented each other to a certain extent, even when most of the text information relevant for assessing essay traits seemed to be captured by both input types. This is plausible when considering that both single- resource models already achieved high QWK in almost all traits and prompts.
A closer look at the different traits revealed that the largest average gains comparing the feature model to the hybrid were apparent in the content and language traits (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{content}= .04, {\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{language}= .04)\). However, the advantages of the hybrid model were only slightly smaller for the organization traits on average (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{organization}= .02\)). For this comparison, the three language traits word choice, sentence fluency, and conventions, used in the ASAP + + analytic scoring rubric, were all used as measures for language (to match the less detailed dimensionality of the MEWS rubric). However, a closer look at these three language traits in ASAP + + revealed that the performance on the trait conventions was least likely to benefit from the combined input of the hybrid model.
These findings are not surprising as the employed features hardly capture content-related information, and the contextual embeddings were a decisive contribution in this respect. Therefore, the most considerable performance gains had been expected for trait content. However, the powerful properties of contextual embeddings regarding language and writing style have also been repeatedly proven in recent years. In this context, the successful interplay of features and contextual embeddings for the language traits also seems to be expectable.
A closer look at the trait-specific gains comparing the embeddings-based and the hybrid model revealed the highest performance gains for the organization traits (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{organization}= .16\), \({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{content}= .02, {\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{language}= .02\)). As mentioned above, this result also corresponds to our expectations, since embeddings hardly capture any information about the meta-structure of essays.
4.3 Ablation Tests
To shed more light on the interplay of contextual embeddings and specific feature types when scoring certain essay traits, we ran two series of ablation tests. In the first series, we iteratively supplemented the embedding-based DNN with one feature type. In doing so, we tracked the performance gains of these extended models compared to the DNNs that only relied on embeddings. These comparisons allowed us to explore essay characteristics that could hardly be covered by the contextual embeddings but by the appropriate features, thus improving model performance. Figure 4 presents the respective results in terms of QWK change (i.e., \({\Delta }\text{Q}\text{W}\text{K}\)) for each trait and prompt. Across prompts, performance gains on the content traits appeared most often when the morphological complexity features supplemented the contextual embeddings. Supplementing the contextual embedding input, length features turned out to be most important for the organization traits. Furthermore, lexical sophistication, error, and occurrence features were most likely to achieve performance advantages across the language traits. Again, these findings seem reasonable. Length features describing the meta-structure of the essays provide structural information that embeddings cannot capture. In the context of the assessment of language traits, text characteristics, such as spelling or grammar errors, are also no natural ingredients of embeddings but are undoubtedly important to judge the language quality of student essays. The same applies to lexical sophistication and occurrence features, which describe aspects of language quality inaccessible by contextual embeddings. An interesting finding is that morphological complexity features were most relevant for the content traits. On the one hand, morphological complexity might not carry content-related information. On the other hand, comparatives and superlatives might be highly relevant inflections in argumentative writing. For example, these inflections can be relevant when different arguments are contrasted or weighted to draw conclusions. Students’ ability to contrast and weight is essential for good argumentative writing.
In the second series of ablation tests, we explored the unique contribution of single feature types. We used the complete hybrid architecture and iteratively removed one of the nine feature types. Figure 5 shows the performance drops for each trait- and prompt-specific model and the nine re-analyses. Consistent performance drops across prompts indicate that a particular feature type contains trait-relevant information and that the contextual embeddings and the other features do not capture this information. The results imply that the performance of models for the content traits dropped across all four prompts when readability and syntactic complexity features were removed. Therefore, both seem to contain unique information relevant to the assessment of content that the other feature types or contextual embeddings could not captured. Consistent performance drops were apparent when removing cohesion and, again, readability features from the trait models for organization. When removing occurrence, length, or error features, performance almost consistently decreased across the language traits. Throughout traits, length features, in particular, emerged as an essential feature type capturing important and unique text characteristics for judging the student essays.
Figure 4
Ablation Tests Tracking Highest Performance Gains ( \(\varDelta QWK\) ) by Adding one Type of Features to the Embedding-based Models
Ablation Tests Tracking Highest Performance Drops ( \(\varDelta QWK\) ) by Removing one Type of Features from the Hybrid Models
4.4 Cross-Prompt Scoring
Tables 5 and 6 present the cross-prompt performance of the DNN models trained on the ASAP and MEWS corpora respectively. The analyses show that across models and traits, the performance drop (\(\varDelta \text{Q}\text{W}\text{K}\)) when comparing the within-prompt performance to the cross-prompt performance was between − .01 and − .30 (\(\varDelta \text{P}\text{C}\text{C} \text{r}\text{a}\text{n}\text{g}\text{e}=[-.15;-.01]\)). For the test data of the MEWS 2 organization trait, the embedding-based model trained on MEWS 1 even slightly outperformed the embedding-based model trained on MEWS 2 (\(\varDelta \text{Q}\text{W}\text{K}=.02;\) i.e., the cross-prompt performance was better than the within-prompt performance). However, the embedding-based models generally worked very poorly for the MEWS organization trait.
Regarding the models trained on the MEWS prompts and ASAP 1, the feature-based models outperformed the embedding-based models in cross-prompt performance across traits. However, the embedding-based models trained on ASAP 2 consistently outperformed the feature-based model in cross-prompt performance. T-tests revealed no significant cross-prompt scoring advantages for the feature-based DNN (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}= .49\); \({\stackrel{-}{\text{P}\text{C}\text{C}}}_{features}= .52\)) compared to the embedding-based model (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}= .42; {\stackrel{-}{\text{P}\text{C}\text{C}}}_{embeddings}= .45\)) (p = .131). Unsurprisingly, the results of these cross-prompt performance comparisons are in line with the within-prompt patterns (see Tables 3 and 4). However, adjusting for the within-prompt performance still implies slight advantages for the feature approach (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{features}= -.12\), \({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{embeddings}= -.15\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{features}= -.09\), \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{embeddings}= -.11\)).
Table 5
ASAP Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)
| Content | | Organization | | Word choice | | Sentence fluency | | Conventions |
| Test: ASAP 1 | Test: ASAP 2 | | Test: ASAP 1 | Test: ASAP 2 | | Test: ASAP 1 | Test: ASAP 2 | | Test: ASAP 1 | Test: ASAP 2 | | Test: ASAP 1 | Test: ASAP 2 |
Training: ASAP 1 | | | | | | | | | | | | | | |
Features | .69 | .56 (-.13) | | .66 | .52 (-.14) | | .69 | .55 (-.14) | | .65 | .56 (-.09) | | .64 | .54 (-.10) |
DistilBERT | .71 | .60 (-.11) | | .67 | .56 (-.09) | | .68 | .58 (-.10) | | .68 | .62 (-.16) | | .67 | .56 (-.11) |
Hybrid | .74 | .61 (-.13) | | .67 | .50 (-.17) | | .67 | .51 (-.16) | | .68 | .63 (-.05) | | .65 | .56 (-.09) |
| Test: ASAP 2 | Test: ASAP 1 | | Test: ASAP 2 | Test: ASAP 1 | | Test: ASAP 2 | Test: ASAP 1 | | Test: ASAP 2 | Test: ASAP 1 | | Test: ASAP 2 | Test: ASAP 1 |
Training: ASAP 2 | | | | | | | | | | | | | | |
Features | .66 | .54 (-.12) | | .66 | .51 (-.15) | | .70 | .54 (-.16) | | .69 | .54 (-.15) | | .70 | .50 (-.20) |
DistilBERT | .65 | .43 (-.22) | | .59 | .36 (-.23) | | .69 | .46 (-.23) | | .67 | .50 (-.27) | | .69 | .49 (-.20) |
Hybrid | .69 | .43 (-.26) | | .69 | .45 (-.24) | | .72 | .48 (-.24) | | .74 | .51 (-.23) | | .69 | .51 (-.18) |
Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)QWK).
Table 6
MEWS Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)
| Content | | Organization | | Language quality |
| Test: MEWS 1 | Test: MEWS 2 | | Test: MEWS 1 | Test: MEWS 2 | | Test: MEWS 1 | Test: MEWS 2 |
Training: MEWS 1 | | | | | | | | |
Features | .38 | .22 (-.16) | | .48 | .40 (-.08) | | .65 | .56 (-.09) |
DistilBERT | .40 | .12 (-.28) | | .17 | .16 (-.01) | | .56 | .47 (-.09) |
Hybrid | .46 | .16 (-.30) | | .52 | .43 (-.09) | | .70 | .59 (-.11) |
| Test: MEWS 2 | Test: MEWS 1 | | Test: MEWS 2 | Test: MEWS 1 | | Test: MEWS 2 | Test: MEWS 1 |
Training: MEWS 2 | | | | | | | | |
Features | .38 | .33 (-.05) | | .52 | .49 (-.03) | | .69 | .62 (-.07) |
DistilBERT | .36 | .18 (-.18) | | .19 | .21 (.02) | | .67 | .54 (-.13) |
Hybrid | .38 | .30 (-.08) | | .53 | .46 (-.07) | | .72 | .66 (-.06) |
Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)QWK).
The hybrid model also outperformed the embeddings-based approach regarding cross-prompt scoring but also just fell short of the feature-based model on average (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}= .48; {\stackrel{-}{\text{P}\text{C}\text{C}}}_{hybrid}= .52\)). Surprisingly, adjusting for the within-prompt performance, the hybrid model even performed worse than both single approaches (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{hybrid}= -.16\), \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{hybrid}= -.12\)). However, T-tests revealed that these differences were not statistically significantly different from zero.
Furthermore, we also explored trait-specific cross-prompt performance losses. Across models, the most remarkable drop in model performance from within-prompt to cross-prompt scoring was revealed for the content traits (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{content}= -.17\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{content}=-.09\)). The comparably smallest drop was apparent for the organization traits (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{language}=-.10\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{language}=-.05\)). This finding is again in line with expectations, as the topics changes between prompts and thus also feature importance might vary depending on the prompt. In addition, indicators for language and organizational text quality might be more stable across different writing prompts.