Neural Networks or Linguistic Features? - Comparing Different Machine-Learning Approaches for Automated Assessment of Text Quality Traits Among L1- and L2-Learners’ Argumentative Essays

doi:10.21203/rs.3.rs-3979182/v1

Download PDF

Research Article

Neural Networks or Linguistic Features? - Comparing Different Machine-Learning Approaches for Automated Assessment of Text Quality Traits Among L1- and L2-Learners’ Argumentative Essays

https://doi.org/10.21203/rs.3.rs-3979182/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 13 Sep, 2024

Read the published version in International Journal of Artificial Intelligence in Education →

You are reading this latest preprint version

Recent investigations in automated essay scoring research imply that hybrid models, which combine feature engineering and the powerful tools of deep neural networks (DNNs), reach state-of-the-art performance. However, most of these findings are from holistic scoring tasks. In the present study, we use a total of four prompts from two different corpora consisting of both L1 and L2 learner essays annotated with three trait scores (e.g., content, organization and language quality). In our main experiments, we compare three variants of trait-specific models using different inputs: (1) models based on 220 linguistic features, (2) models using essay-level contextual embeddings from the distilled version of the pre-trained transformer BERT (DistilBERT), and (3) a hybrid model using both types of features. Results imply that when trait-specific models are trained based on a single-resource, the feature-based models slightly outperform the embedding-based models. These differences are most prominent for the organization traits. The hybrid models outperform the single-resource models, indicating that linguistic features and embeddings indeed capture partially different aspects relevant for the assessment of essay traits. To gain more insights into the interplay between both feature types, we run ablation tests for single feature groups. Trait-specific ablation tests across prompts indicate that the embedding-based models can most consistently be enhanced in content assessment when combined with morphological complexity features. Most consistent performance gains in the organization traits are achieved when embeddings are combined with length features, and most consistent performance gains in the assessment of the language traits when combined with lexical complexity, error, and occurrence features. Cross-prompt scoring again reveals slight advantages for the feature-based models.

automated essay scoring

student essays

essay traits

feature engineering

pre-trained transformers

Deep-Neural-Networks

Assessing students’ free-text answers (e.g., argumentative essays) is an important task for artificial intelligence (AI) and natural language processing (NLP) in education. This also involves developing tutoring systems based on AI-driven assessment procedures (e.g., Bai & Stede, 2022; Mathias & Bhattacharyya, 2020). Advantages of such systems that are mentioned in existing research involve (1) reduced workload for teachers, (2) immediate information about the performance level of their students without extensive manual correction effort, (3) more frequent and instant feedback for students, and (4) a consistent assessment procedure that is, for instance, not bound to human attention processes (see, e.g., Ramesh & Sanampudi, 2022; Uto, 2021; Yan, 2020). Recent research has emphasized the promise of AI-based tutoring systems in supporting students closely during writing and learning processes (e.g., Hussein et al., 2019; Injadat et al., 2021). However, to support students in the context of complex writing tasks such as argumentative essays, an accurate and comprehensive assessment of several aspects of writing is necessary (Bai & Stede, 2022). Such fine-grained scoring of different aspects is a challenging problem that is largely unresolved in the field of automated essay scoring (AES) (see, Horbach et al., 2017; Kumar & Boulanger, 2021). Different AES approaches using machine learning methods have been proposed to face the challenges of AES. Like in many NLP tasks, two general model types have been proposed: feature engineering and deep neural networks (DNN) (Bai & Stede, 2022; Ke & Ng, 2019; Kusuma et al., 2022; Uto, 2021). Recent studies have shown that hybrid models, combining both approaches, can outperform models based on a single resource (e.g., Dasgupta et al., 2018; Uto et al., 2020; see also Mizumoto & Eguchi, 2023). However, most of these comparisons used holistic scoring methods (i.e., assigning one overall grade per essay) (Lagakis & Demetriadis, 2021). The holistic approach, however, provides assessment on a superordinate level that is not suitable for meaningful tutorial feedback or in-depth diagnosis of students’ writing abilities (Condon & Elliot, 2022; Narciss, 2008). Moreover, from a methodological point of view, holistic scoring makes it impossible to disentangle possible strengths (and weaknesses) of the different AES approaches regarding certain aspects of text quality (Andrade, 2018). In the current study, we therefore compare the performance of different AES approaches scoring analytic essay rubrics (also referred to as traits in the following). For this purpose, we use four different argumentative prompts, containing English essays written by L1 students (from the ASAP corpus) and L2 students (from the MEWS corpus, Keller et al., 2020; Rupp et al., 2019). We consider analytic benchmark scores assigned by trained human raters representing different aspects of text quality, such as language quality, organization, and content. In doing so, we compare the two single-input approaches (linguistic features vs. essay-level contextual embeddings from DistilBERT) and explore which approach to analytic scoring is superior regarding a given aspects of text quality. Furthermore, we investigate whether the hybrid model outperforms the two single-resource approaches across prompts and traits. The hybrid model is expected to be superior to the single-source models as it uses both types of inputs, while the single-source models are each based on a single type of input. In addition, we use ablation tests (i.e., stepwise removal/addition of certain model input components) to uncover types of linguistic features that are especially important for specific traits and that can hardly be captured by (essay-level) contextual embeddings and are thus particularly relevant for the hybrid architecture. Finally, we examine the cross-prompt performance of the models within the L1 and L2 corpora. These research goals lead us to the following research questions (RQ) that guide our experiments:

How do feature-based and text-level contextual-embedding-based models differ regarding their performance on scoring certain aspects of text quality?

Under which conditions does a hybrid approach outperform the single models across different aspects of text quality?

Which linguistic feature types carry information not covered by the contextual embeddings of DistilBERT and are therefore most important for the hybrid approach?

How do the different model architectures differ regarding cross-prompt performance?

To answer these questions we have organized the article into five sections, as follows. Section 2 introduces different AES approaches and discusses their respective presumed advantages and disadvantages. Section 3 describes the datasets, the different model architectures, and the training procedures used in the present study. In Section 4, we present the results of our experiments. Finally, we discuss our results and outline limitations along possible directions for future research in Section 5.

Most machine learning (ML) approaches to AES follow a supervised learning strategy (Ke & Ng, 2019), where humans’ assessments of a given set of student essays are used as benchmark scores to train ML models. Afterwards, these models are used to assign scores to new texts written in response to the same or new prompts.

Key characteristics of ML models used in supervised learning AES tasks are (1) the way texts (i.e., student essays) are represented as numerical features and (2) the actual ML architecture being employed (e.g., linear regression or DNN; see, Ramesh & Sanampudi, 2022). Both aspects are related and two overarching AES approaches have been distinguished in recent studies: first, feature engineering where domain experts decide which features to use for a specific task, and second, DNN approaches (Ke & Ng, 2019; Kusuma et al., 2022), where the model learns a suitable representation on its own. In the upcoming section, we will outline key characteristics of both approaches. Our focus will be on key model architectures most relevant for the comparisons carried out in the present study (see Bai & Stede, 2022; Uto, 2021). A more detailed and comprehensive overview can be found in Ramesh and Sanampudi (2022), Lagakis and Demetriadis (2021), and Uto (2021).

2.1 Feature Engineering Approach

AES models following the feature engineering approach are based on a theory-driven way of translating text into numerical data using NLP methods (X. Chen & Meurers, 2016; Crossley, 2020; McNamara et al., 2014; Zesch et al., 2015). Those features range from simple length-based representations, such as the number of words or paragraphs, to highly elaborated linguistic constructs, such as coherence (Mesgar & Strube, 2018) or cohesion measures (e.g., Crossley et al., 2017).

The feature engineering approach is the traditional AES approach (Lagakis & Demetriadis, 2021). Over the last decade, many tools have been proposed that provide the user with a vast range of linguistic features (X. Chen & Meurers, 2016; Crossley, 2019; Kumar & Boulanger, 2021; Kyle et al., 2018). In the following sections, we will provide a brief overview of common types of features applied in AES tasks.

2.1.1 Length and Occurrence Features

In the context of formal education, student essays are written under a specific time limit (time writing, see, Weigle, 2002). Therefore, essay length in terms of words, sentences, or paragraphs, has been demonstrated to be a powerful predictor of human scores (see, e.g., Fleckenstein, Meyer, et al., 2020; Zesch et al., 2015). Furthermore, ratios of words per sentence or sentences per paragraph can be interpreted as a proxy for syntactic complexity. Other length features typically used in AES models are mean word length (in characters) or the total number of unique words (e.g., J. Chen et al., 2016).

Occurrence features are closely related to length-based features and include, for instance, the counting of nouns, proper nouns, verbs, adjectives as well as special characters. Classification of words into word types is known as part-of-speech tagging (POS; e.g., Mitkov & Voutilainen, 2012). To do POS tagging, natural language processing (NLP) tools such as the Python libraries spaCy or NLTK can be applied. In addition, the ratios of specific word types to the total number of words are also frequently used (X. Chen & Meurers, 2016).

2.1.2 Error Features

Another important aspect of student writing that usually factors into performance evaluations is the number of errors (e.g., typos or grammar errors). Thereby, error ratios are often calculated, such as the proportion of spelling errors to the total number of words. LanguageTool is a powerful tool to automatize error counts.

2.1.3 Features Relating to Lexical Diversity and Sophistication

Common indicators for the lexical diversity of student essays are type-token ratios. Tokens are defined as all individual words in a text whereas types are defined as unique words. Thus, if the type-token ratio is close to one, lexical diversity is high. If the type-token ratio is close to zero, lexical diversity is low.

Common features to represent the lexical sophistication of essays are (weighted) counting of occurrences on large word-frequency corpora such as the British National Corpus (BNC) or the Brown frequency list. For example, to determine the predominant use of "easy words", the top 1000 of the BNC word list have been used as a reference (X. Chen & Meurers, 2016). Conversely, "difficult words", for example, have been defined as not being included in the top 2000 of the BNC word list (X. Chen & Meurers, 2016). However, these values are rather arbitrary and might also be adapted and aligned with the respective students’ characteristics.

2.1.4 Morphological Complexity Features

Morphological complexity measures are related to type-token ratios, but instead of reflecting lexical diversity, they capture the range of different inflections used (Brenzina & Pallotti, 2019). For instance, a text with diverse inflected forms such as “writing, wrote, writes” would be deemed to have a higher morphological complexity than one that merely repeats the same form like “writing, writing, writing”. Morphological complexity measures can be calculated by taking the ratio of unique inflectional forms to the sum of all tokens of a given word class (per text, paragraph, or sentence), offering a quantitative insight into the diversity of morphological forms.

2.1.5 Syntactic Complexity Features

Dependency parsing captures the grammatical relationships between words, offering a structured representation of sentences that reveals their underlying syntactic properties and is also implemented in NLP tools like spaCy (e.g., Nivre, 2010). Dependency parsing has been used in AES context to count, for instance, the number of fragment clauses, prepositional phrases, coordinate clauses, or relative clauses (e.g., X. Chen & Meurers, 2016). This can provide valuable insight into a student's ability to compose sophisticated sentences and present complex topics and ideas.

2.1.6 Cohesion Features

Cohesion refers to the lexical linking within a text, providing the reader with a sense of flow and consistency. In general, more cohesive texts allow the reader to better follow the ideas presented and to understand links between different topics (Crossley et al., 2017). The lexical overlap between consecutive text segments, such as sentences or paragraphs, can numerically operationalize cohesion. Another measure for cohesion is, for instance, the frequent usage of connectors that structures the essay.

2.1.7 Feature Engineering Approach – Advantages and Challenges

One of the main advantages of the feature engineering approach is the theory-driven way of pre-processing the text-inherent information before feeding it into an ML model. This process of creating features provides a high degree of control over what information may be used by the algorithm. Feature engineering and feature selection might also be adapted to specific types of essays or learner populations. Furthermore, the explicit, theory-based approach of feature selection forms a prerequisite for explainable AES scores (answering how a given score is determined). Therefore, the feature-based approach has usually been combined with ML model architectures that allow a high amount of explainability, such as linear regression, logistic regression, random forests, or decision trees. This approach, however, is rarely combined with DNNs, whose hidden layers (often referred to as the “black box” of DNNs) make it difficult to understand and interpret the calculation of scores from a subjective point of view.

One primary challenge of the feature-based approach is the adequate representation of content, an element many consider pivotal, if not the most critical aspect of essays (e.g., Ramesh & Sanampudi, 2022; see also Perelman, 2014). A potential strategy to incorporate content in the realm of feature engineering without a loss of explainability is the application of bag-of-words or n-gram techniques (e.g., Ke & Ng, 2019). These methods employ either word frequencies (uni-grams) or word sequence frequencies (n-grams) to represent an essay's content in AES tasks. Typically, these approaches are used in conjunction with stop-word filtering and lemmatization. Nevertheless, the representation of content through bag-of-words or n-gram techniques remains significantly limited, reducing it merely to the occurrence of specific words or chunks. This fails to account for the contextual nature of language, wherein a word's meaning heavily relies on its surrounding lexical environment. Additionally, n-gram techniques pose a risk of feature explosion when every word and word sequence within a given text corpus is represented as an independent feature. However, a powerful alternative to process text data and encode the content of a text has been proposed in the context of DNNs, namely word embeddings.

2.2 Deep-Neural-Networks

Recent applications of DNNs in AES primarily rely on word embeddings (e.g., Beseiso & Alzahrani, 2020; Rodriguez et al., 2019; Uto et al., 2020). The basic idea of word embeddings is to represent the meaning of words with specific loadings (i.e., numerical values) on several latent dimensions. Each dimension represents a different (and largely unknown) semantic meaning. Each word has a unique set of loadings representing its meaning as a vector in an N-dimensional semantic space. Words with similar meanings have similar loading patterns (i.e., a similar vector representation in the semantic space). The number of latent dimensions N serves as a hyperparameter and can be set to arbitrary values. For instance, the embedding layer of the BERT-base model consists of 768 dimensions. Training embedding models involves a DNN that learns to predict words based on their surrounding words (i.e., the context). After extensive training on large samples of authentic texts, the final embeddings capture nuanced semantic relationships, such as syntactic and thematic similarity between words. Currently, pre-trained vector spaces, such as Word2Vec, are accessible and have been trained on extensive text corpora.

Based on such word embeddings, several text-processing DNN architectures, like recurrent-neural networks (RNNs) or long-short-term models (LSTMs), have been developed, and many of them have also been adopted for AES tasks (e.g., Alikaniotis et al., 2016; Taghipour & Ng, 2016; Uto & Okano, 2020). One further challenge in processing text arises from the fact that the meaning of a word is never fixed, but highly affected by the context in which it appears. Thus, the words’ latent representations should not be fixed either, but rather changed and adapted according to context. To tackle this issue, various advanced model architectures have been proposed, with attention mechanisms representing a groundbreaking development (Vaswani et al., 2017). Attention mechanisms facilitate the dynamic adjustment of word embeddings based on the surrounding words, enabling models to better capture the meaning of words in a given context. The implementation of such attention mechanisms in large pre-trained transformer models has recently led to significant improvements and breakthroughs in various NLP tasks. In AES, the application of transformer models has also led to state-of-the-art performances (Bai & Stede, 2022; Uto et al., 2020; Wang et al., 2022; Xue et al., 2021). On the one hand, DNN models provide a powerful approach to AES with no need for elaborated feature engineering and with the promise of capturing content much better than n-gram or other content feature approaches such as prompt-similarity analysis or topic dictionaries. On the other hand, contextual embeddings are latent representations of textual information, which complicates the goal of explainable AES.

2.3 Hybrid Models

Several recent AES applications have suggested that contextual embedding-based DNNs and feature engineering approaches should not be considered as competing (see, e.g., Ke & Ng, 2019). Instead, relevant research has indicated that they could complement each other to form a combined model (Bai & Stede, 2022; Kusuma et al., 2022). As demonstrated, for instance, by Uto et al. (2020) or Beseiso and Alzahrani (2020), such combined models, typically referred to as hybrid models, can outperform single-resource models (see also, e.g., Dasgupta et al., 2018 and Mizumoto & Eguchi, 2023). This result seems quite intuitive, as both approaches use different strategies to process text data and thus might capture different aspects of essay quality. However, the application of hybrid models has so far only been applied to holistic scoring. Using holistic scoring tasks makes it impossible to determine which approach has its merits in terms of which aspects of text quality. This limitation might be overcome with analytic AES applications.

To address our research questions, we analyzed argumentative student essays written in response to different argumentative writing prompts. Using only argumentative prompts ensured that similar aspects of text quality (also referred to as traits in the following), were relevant across prompts and corpora. Additionally, the aspect of content is generally of particular importance in argumentative essays. We used English essays from L1 and L2 learners to assess the generalizability of the results across different learner populations (see, e.g., Crossley, 2020).

3.1 Datasets

To compare the performance of different AES approaches regarding different aspects of text quality, we used four argumentative prompts from two different corpora. Two of these prompts (\({N}_{1}=1783;{N}_{2}=1800)\)stem from the widely-used ASAP competition. These two prompts are the only ones from the ASAP corpus that involve argumentative writing. Both prompts contain essays written by American L1 learners.Mathias and Bhattacharyya(2018) introduced analytic labels for the two argumentative prompts from ASAP via the so-called ASAP + + annotation. This system of annotation contains five aspects of text quality: content, organization, word choice, sentence fluency, and conventions (more details can be found in the Appendix and in Mathias & Bhattacharyya, 2018).

To expand our analyses to L2 learners, we also included two prompts (\({N}_{3}=1179;{N}_{4}=1112)\)from the MEWS corpus (Measuring Writing Skills in English as a Second Language; Fleckenstein, Keller, et al., 2020; Rupp et al., 2019). These essays were labeled analytically in the context of the TrACE project (Training Assessment Competencies in English as a Foreign Language; Keller et al., under review). The MEWS corpus contains argumentative essays written by German and Swiss L2 upper secondary school students (Keller et al., 2020). The analytic labels contain three traits: content, organization, and language quality. This dataset is available on OSF.

The four writing prompts, as well as further information and descriptive statistics, can be found in Table 1. Figure 1 represents the prompt-specific distributions of essay lengths. The essays of the L1 learners are slightly longer on average and the distribution is noticeably wider.

Table 1

*Prompts of ASAP and MEWS used in the Present Study*
Corpus	Learners	Number of labelled essays	Mean essay length in tokens (SD)	Essay traits	Score range	Prompt
ASAP	L1	1783	357.3 (115.7)	Content Organization Word Choice Sentence Fluency Conventions	1–6	More and more people use computers, but not everyone agrees that this benefits society. Those who support advances in technology believe that computers have a positive effect on people. They teach hand-eye coordination, give people the ability to learn about faraway places and people, and even allow people to talk online with other people. Others have different ideas. Some experts are concerned that people are spending too much time on their computers and less time exercising, enjoying nature, and interacting with family and friends. Write a letter to the editor of a newspaper about how computers affect society today.
ASAP	L1	1800	377.4 (153.5)	Content Organization Word Choice Sentence Fluency Conventions	1–6	Write a persuasive essay to a newspaper reflecting your views on censorship in libraries. Do you believe that certain materials, such as books, music, movies, magazines, etc., should be removed from the shelves if they are found offensive? Support your position with convincing arguments from your own experience, observations, and/or reading.
MEWS	L2	1179	303.4 (84.0)	Content Organization Language	0–6	Do you agree or disagree to the following statement: Television advertising directed toward young children (aged two to five) should not be allowed.
MEWS	L2	1112	308.0 (82.1)	Content Organization Language	0–6	Do you agree or disagree to the following statement: A teacher’s ability to relate well with students is more important than excellent knowledge of the subject being taught.

Note

The white dashed lines mark the respective mean values

3.1.1 Benchmark Scores and Rater Effects

While the ASAP + + trait scores are already provided as adjudicated true scores, the TrACE trait scores were available as double-rated raw rater data (i.e., at least two scores per essay and analytic rubric; details can be found in Appendix Table A1 and in Keller et al., under review). As proposed by Uto and Okano (2020), we employed an IRT-based rater model to account for systematic rater effects. To do so, we used the software facets (Linacre, 2023). However, to keep model complexity low, we decided to account for rater severity effects only (not, for instance, for rater centrality/extremity or consistency, but see Uto & Okano, 2020 or Robitzsch & Steinfeld, 2017, for alternative rater modeling approaches, which can also be combined with AES models).

Figure 2 shows the distributions of the analytic target labels of each essay corpus. All targets are approximately normally distributed, except for the two organization traits of MEWS 1 and 2, which are highly negatively skewed.

3.2 Model Inputs

Our guiding research questions focused on the performance of different ML approaches to AES that rely on different input resources. We created a standard DNN architecture (Fig. 3A and 3B) in tensorflow, which was used for all input types. However, it is well known that model performance depends not only on the characteristics of the input vectors (e.g., linguistic features vs. contextual embeddings vs. hybrid) but also on the model architecture (e.g., depths of the DNN) and the fit between the input vector and the model architecture. We, therefore, systematically varied the hyperparameters of the model architecture in a random search procure (e.g., Bergstra, & Bengio, 2012). This procedure is described in detail in the subsection Training procedures. In the following, we first introduce the two different types of model inputs – linguistic features and contextual embeddings.

Note

Number of layers, dropout rate, and number of units per Dense layer were varied during the random search procedure.

3.2.1 Linguistic Features

We created a set of 220 different linguistic features representing all relevant text features typically included in feature-based AES models following X. Chen & Meurers, 2016; Ke & Ng, 2019; Kumar & Boulanger, 2021; Zesch et al., 2015. Table 2 presents all feature types with examples from our feature set (a comprehensive list of all 220 features can be found in the Appendix Table A2 and on OSF). We used the Python library spaCy for POS tagging, LanguageTool for error detection, and the BNC, SUBLEX, and NGSL word lists as well as word lists from the psycholinguistic database (e.g., brown frequency list) to count easy words (i.e., frequently used words in large text corpora) and difficult words (i.e., less frequently used words in large text corpora).

3.2.2 DistilBERT’s Contextual Embeddings

Recent comparisons and reviews of AES applications employing pre-trained transformer models indicated that performance hardly increases when these models are fine-tuned (Mayfield & Black, 2020; Rodriguez et al., 2019). Additionally, runtime and computational demands largely increase due to the extensive fine-tuning processes when using such large models. To keep runtime low, we followed suggestions by Mayfield and Black (2020) and used the distilled version of BERT (DistilBERT; Sanh et al., 2019). Furthermore, we kept the DistilBERT layers frozen (i.e., not trainable). However, we supplemented these non-trainable DistilBERT layers with an essay-level maximum pooling layer. In doing so, we received a contextual embedding vector of length 768 for each essay, which served as the input vector for our AES architecture (Fig. 3B).

3.3 Training Procedure and Model Architectures

3.3.1 Main Experiments

To train and evaluate our models, we followed a five-fold cross-validation strategy. For the ASAP datasets, we employed the predefined splits introduced by Ke & Ng (2016) that imply a 60-20-20 split in training, validation, and test data. These predefined splits had also been used for trait scoring of the ASAP + + dataset by Mathias & Bhattacharyya (2018, 2020).

For the MEWS corpus, we also employed five-fold cross-validation. However, because of the considerably smaller datasets in MEWS, we decided to use 70% of each dataset as the training set, 10% of the data as the validation set, and 20% as the test data in each fold. To find the best epoch for each run, we used an early-stopping callback function that tracked the validation loss. The model showing the best performance on the validation set across folds was finally used for evaluation with the test data.

All DNN model architectures were set up in tensorflow (python code can be found on the OSF repository). We designed our AES models as regression models. The values indicating the trait-specific essay qualities are ordinally scaled. Since they range, for instance, from 1 = high quality to 6 = low quality and thus can be assumed to be continuous, we decided against classification approaches (see Beseiso & Alzahrani, 2020, for a comparison of classification vs. regression AES models). Thus, we used a single unit with linear activation in the output layer and the mean squared error (MSE) as the loss function in all DNN architectures.

Table 2

*Feature Types with Examples from the Feature Set*
Feature type	Number of features	Example features
Length features	15	Number of words Number of paragraphs Number of sentences
Occurrence features	30	Number of nouns Number of formal words Number of unique nouns
Error features	9	Error ratio Grammar error ratio Punctuation error ratio
Morphological complexity features	11	Number of finite verbs Number of non-third person singular verb Ratio of comparatives
Cohesion features	8	Number of connectors Number of unique connectors Mean noun overlap with previous sentence
Readability features	15	Flesch score Integration cost Average number of sentences per 100 words
Lexical diversity features	59	Type-token ratio Type-token ratio lexical words Global edit distance
Lexical sophistication features	62	BNC easy word ratio SUBLEX easy word ratio Brown Frequencies lexical word ratio
Syntactic complexity features	6	Number of subordinate clauses Number of fragment sentences Mean tokens before main verb

We employed the Adam optimizer and the Mean Squared Error (MSE) loss function. For each trait of each prompt, different models were trained varying the type of input (features vs. embeddings vs. hybrid). In addition, we systematically changed the hyperparameters defining the model architectures to ensure valid comparisons across the different AES approaches. In doing so, we used a random search procedure (Bergstra & Bergio, 2012) varying the hyperparameters, learning rate (5e^− 3, 1e^− 3, 5e^− 4, 1e^− 4), number of dense layers (0, 1, 2) units per dense layer (64, 128, 256) and dropout rates (0.2, 0.3, 0.4, 0.5, 0.6). During the random search, we tested 50% of this parameter space.

3.3.2 Hybrid Model

The hybrid architecture used both input resources – linguistic features and essay-level contextual embeddings from DistilBERT. In the first step, the two types of inputs were separately processed through additional Dense and Dropout layers in two parallel model parts (Fig. 3C). The hyperparameters of the feature input part were determined by the best performing models from the corresponding feature-based models. However, it turned out that the hybrid architecture was much more difficult to optimize and that additional Dense Layers hardly improved model performance of the embedding-based models (see Appendix Table A1). Therefore, we decided to employ a reduced second parallel model part for the embeddings. This second part using the embedding input was only fed through one additional dropout layer and then directly into the concatenation layer (Fig. 3C). As the first part of the model architecture was fixed, we only varied the dropout rate for the second (i.e., the embedding input) part of the model (0.3, 0.4, 0.5, 0.6, 0.7) and the learning rate of the Adam optimizer (5e^− 3, 1e^− 3, 5e^− 4, 1e^− 4). The concatenation layer was incorporated as the last stage of the hybrid model before the (single-unit) output layer. This implies that interactions between linguistic features and contextual embeddings were enabled in this hybrid architecture (Fig. 3C).

3.3.3 Ablation Tests

To answer RQ 3, we ran two types of ablation tests to gain more insights into the interplay of contextual embeddings and linguistic features. In the first series of ablation tests, we always started with the embedding-based DNN. From there on, we ran a reduced form of the hybrid model, supplementing the contextual embeddings with one feature group. In doing so, we distinguished nine types of features (see Table 2). Thus, we reran each trait- and prompt-specific model nine times using the same cross-validation procedure as in the main experiment.

For the second series of ablation tests, we took the full hybrid model as the starting point. In an iterative process, we reran each model nine times, each time removing a different feature group from the input. Again, we applied the same cross-validation procedure as in the main experiments.

3.3.4 Cross-Prompt Scoring

For cross-prompt scoring (RQ 3), we relied on the hyperparameter settings of the respective best-performing model of each prompt and trait. Again, we employed the cross-validation procedure outlined above but used the complete data from the respective other prompt within a given corpus as the test data instead.

3.3.5 Two Linear Regression Baselines

We additionally compared the three DNNs to two simpler baseline models from Scikit-learn. In doing so, we (1) combined a linear ridge regression with an N-gram-vectorizer with stop word filtering as input and (2) a linear ridge regression with our feature set as input (which will be described in the following sub-section). The n-gram baseline models allowed us to disentangle whether and to what extent more complex text processing, as provided by pre-trained transformer models using embeddings and attention mechanisms, are superior to simpler text processing using the classical n-gram approach in automated essay trait scoring tasks. Furthermore, the feature baseline model allowed us to explore whether (and to what extent) complex model architectures (i.e., DNN architectures) are superior compared to simple linear models in automated essay trait scoring. To fit the baseline models, we used the same cross-validation procedure and a grid search approach varying n-gram range (unigrams, bigrams, trigrams) and the alpha parameter of the ridge regression (1e^− 4, 1e^− 3, 1e^− 2, 1e^− 1, 1, 10, 100). The best-performing model on the validation sets across folds was evaluated with the test data.

3.4 Evaluation Metrics

We used quadratic weighted kappa (QWK) as evaluation metric. QWK is the most frequently used metric in AES tasks and had also been reported in the course of previous analyses of the ASAP + + datasets (Mathias & Bhattacharyya, 2018, 2020).

A QWK value of one indicates perfect agreement between predicted scores and benchmarks, a value of zero corresponds to a chance agreement and a negative value represents systematic disagreement, with minus one as the extremum corresponding to complete disagreement.

To also compare model- and trait-specific performance across prompts, we employed average QWK (\(\stackrel{-}{\text{Q}\text{W}\text{K}}\); see, e.g., Taghipour & Ng, 2016) as well as mean QWK differences (\(\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}\)). However, as averaging QWK across different scales is notoriously problematic (e.g., Doewes et al., 2023), we also used average Pearson Correlation Coefficient (PCC) as an additional metric to compare model performance across traits.

T-Tests

To examine whether one approach (feature vs. embedding vs. hybrid) performed significantly better than the other approaches, we used pairwise T-tests. T-tests were calculated based on the QWK scores across datasets and traits (e.g., Uto et al., 2020). We employed one-sided testing for comparisons against the baseline models.

Table 3 and Table 4 present the trait-specific test data performances in QWK after the random search cross-validations of our main experiments for ASAP and MEWS, respectively. The best-performing hyperparameter settings for each model can be found in the Appendix (Table A3).

4.1 Features versus contextual embeddings

In RQ 1, we aimed to compare the performance of a feature-based and a contextual embedding-based model. In Tables 3 and 4, the QWKs of the feature- and the embedding-based DNN predictions on the test data can be found in the (prompt-specific) third and fourth rows, respectively. For comparison, the same results measures in PCC can be found in the Appendix (Table A4 and A5). Across traits and prompts, the feature-based model outperformed the embedding-based model in 11 out of 16 cases. However, the performance of both models was similar. The feature-based model achieved an overall average QWK of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}= .614\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.673\)), and the embedding-based model achieved an overall average of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}= .563\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.625\)). The T-test across traits and prompts implied no significant differences between the two approaches (p = .345). In addition, it became apparent that the embedding-based model fell short, especially in the trait organization of the two MEWS prompts (QWK differences of \({{\Delta }\text{Q}\text{W}\text{K}}_{MEWS 1}=.31\) and \({{\Delta }\text{Q}\text{W}\text{K}}_{MEWS 2}=.33\), respectively), while performing almost equally well across all other traits (and prompts). Furthermore, the same pattern for the trait organization was evident in ASAP 2 but not in ASAP 1. Nevertheless, this finding seems plausible as (even contextual) embeddings might not carry information about an essay's (meta-) structure, which is relevant for human annotators judging student essays. In contrast, such information, for instance the number of paragraphs, is represented in the feature set.

Table 3

*Quadratic Weighted Kappa Across ASAP Essay Traits*
	Content	Organization	Word choice	Sentence fluency	Conventions
ASAP 1
N-Gram reg.	.536	.511	.515	.491	.481
Feature reg.	.678	.635	.672	.636	.623
Feature DNN	.693	.657	.690	.645	.639
DistilBERT	.713	.666	.677	.675	.666
Hybrid	.743	.672	.673	.681	.648
M. & B. (2018)¹	.67	.60	.64	.62	.61
M. & B. (2020)²	.703	.664	.675	.648	.638
ASAP 2
N-Gram reg.	.552	.541	.548	.396	.402
Feature reg.	.637	.658	.686	.672	.684
Feature DNN	.664	.662	.698	.688	.699
DistilBERT	.651	.591	.686	.674	.685
Hybrid	.688	.686	.715	.736	.685
M. & B. (2018)¹	.61	.58	.60	.59	.62
M. & B. (2020)²	.617	.623	.630	.603	.601

Note. The best performing model for each trait and prompt is printed in bold. reg. = ridge regression.

¹ Performance benchmarks in terms of QWK from Mathias & Bhattacharyya (2018).

² Performance benchmarks in terms of QWK from Mathias & Bhattacharyya (2020).

Table 4

*Model Performances Across MEWS Essay Traits*
	Content	Organization	Language quality
MEWS 1 (AD)
N-Gram reg.	.330	.142	.442
Feature reg.	.423	.509	.662
Feature DNN	.380	.482	.648
DistilBERT	.396	.171	.556
Hybrid	.463	.521	.698
Human Threshold¹	.66	.68	.71
MEWS 2 (TE)
N-Gram reg.	.289	.167	.464
Feature reg.	.435	.507	.654
Feature DNN	.377	.517	.688
DistilBERT	.355	.192	.667
Hybrid	.376	.528	.723
Human threshold¹	.52	.77	.72

Note. The best performing model for each trait and prompt is printed in bold. reg. = ridge regression.

¹ Human rater agreement in terms of QWK.

Beside these differences in the trait organization, no systematic superiority of one approach was found across traits. For ASAP, QWKs even implied more systematic differences between prompts than between traits. While the embedding-based DNN outperforms the feature-based DNN in four out of five traits of ASAP 1, the feature-based DNN outperforms the embedding-based DNN in all traits of ASAP 2.

Furthermore, we compared these two models against two simpler baseline models. These baseline models used a ridge regression with n-grams versus the feature input. The prompt-specific first and second rows of Tables 3 and 4 Model represent the test data performance for each trait. The comparison of our DNN target models with the n-gram baseline model revealed that both target models consistently outperformed. The one-sided T-tests indicated significant performance advantages (\({p}_{features}\) < .001 and \({p}_{embeddings}\) = .010, respectively). However, comparisons to the feature-based linear regression baseline only partly revealed advantages for the target models, and T-tests were not significant (\({p}_{features}=.818\), \({p}_{embeddings}=.465\)). The feature-based linear baseline model even performed consistently above the embedding-based DNN across all traits of the MEWS prompts (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{embeddings vs. baseline 1}=-.01\)). The feature-based DNN also fell short in four out of six traits in the MEWS prompts compared to the feature baseline (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{features vs. baseline 1}=-.02\)). Regarding the two ASAP prompts, however, the two DNN approaches almost consistently performed above the feature-based baseline. However, the differences were small (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{features vs. baseline 2}=.02\) and \({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{embeddings vs. baseline 2}=.01\)). These relatively small advantages imply that nonlinearities and interactions among features (as well as embeddings) were of minor importance when scoring the essay traits (see also Table A3 in the Appendix). This finding also matches expectations as raters typically follow strict judgment guidelines for benchmark scoring. Such guidelines are almost exclusively based on linear, additive scoring rules.

4.2 Hybrid Architecture

The goal of RQ 2 was to compare a hybrid model architecture containing both feature types – linguistic features and contextual embeddings – to the single-resource models. The trait-specific test set performance of the hybrid model is represented in the fifth row of each prompt in Tables 3 and 4. The hybrid model achieved an average performance of \({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}= .640\) (\(\stackrel{-}{\text{P}\text{C}\text{C}}=.681\)). As expected, the hybrid model outperformed the single-resource models in most traits (12 out of 16) across prompts. However, the one-sided T-test comparing the performance of the feature-based model to the hybrid was not significant (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{hybrid-features}=.03\), \(p= .507\)). The difference between the embedding-based DNN and the hybrid also failed significance (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{hybrid-embeddings}=.08\), \(p= .156\)). Despite the non-significant results, the hybrid consistently proved to perform better than the single-resource models across prompts and traits. This finding meets expectations and is in line with recent findings from holistic scoring (Bai & Stede, 2022; Uto et al., 2020). Furthermore, it implies that both types of input indeed capture partially different text information relevant for scoring essay traits. Thus, both types of input complemented each other to a certain extent, even when most of the text information relevant for assessing essay traits seemed to be captured by both input types. This is plausible when considering that both single- resource models already achieved high QWK in almost all traits and prompts.

A closer look at the different traits revealed that the largest average gains comparing the feature model to the hybrid were apparent in the content and language traits (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{content}= .04, {\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{language}= .04)\). However, the advantages of the hybrid model were only slightly smaller for the organization traits on average (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{organization}= .02\)). For this comparison, the three language traits word choice, sentence fluency, and conventions, used in the ASAP + + analytic scoring rubric, were all used as measures for language (to match the less detailed dimensionality of the MEWS rubric). However, a closer look at these three language traits in ASAP + + revealed that the performance on the trait conventions was least likely to benefit from the combined input of the hybrid model.

These findings are not surprising as the employed features hardly capture content-related information, and the contextual embeddings were a decisive contribution in this respect. Therefore, the most considerable performance gains had been expected for trait content. However, the powerful properties of contextual embeddings regarding language and writing style have also been repeatedly proven in recent years. In this context, the successful interplay of features and contextual embeddings for the language traits also seems to be expectable.

A closer look at the trait-specific gains comparing the embeddings-based and the hybrid model revealed the highest performance gains for the organization traits (\({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{organization}= .16\), \({\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{content}= .02, {\stackrel{-}{{\Delta }\text{Q}\text{W}\text{K}}}_{language}= .02\)). As mentioned above, this result also corresponds to our expectations, since embeddings hardly capture any information about the meta-structure of essays.

4.3 Ablation Tests

To shed more light on the interplay of contextual embeddings and specific feature types when scoring certain essay traits, we ran two series of ablation tests. In the first series, we iteratively supplemented the embedding-based DNN with one feature type. In doing so, we tracked the performance gains of these extended models compared to the DNNs that only relied on embeddings. These comparisons allowed us to explore essay characteristics that could hardly be covered by the contextual embeddings but by the appropriate features, thus improving model performance. Figure 4 presents the respective results in terms of QWK change (i.e., \({\Delta }\text{Q}\text{W}\text{K}\)) for each trait and prompt. Across prompts, performance gains on the content traits appeared most often when the morphological complexity features supplemented the contextual embeddings. Supplementing the contextual embedding input, length features turned out to be most important for the organization traits. Furthermore, lexical sophistication, error, and occurrence features were most likely to achieve performance advantages across the language traits. Again, these findings seem reasonable. Length features describing the meta-structure of the essays provide structural information that embeddings cannot capture. In the context of the assessment of language traits, text characteristics, such as spelling or grammar errors, are also no natural ingredients of embeddings but are undoubtedly important to judge the language quality of student essays. The same applies to lexical sophistication and occurrence features, which describe aspects of language quality inaccessible by contextual embeddings. An interesting finding is that morphological complexity features were most relevant for the content traits. On the one hand, morphological complexity might not carry content-related information. On the other hand, comparatives and superlatives might be highly relevant inflections in argumentative writing. For example, these inflections can be relevant when different arguments are contrasted or weighted to draw conclusions. Students’ ability to contrast and weight is essential for good argumentative writing.

In the second series of ablation tests, we explored the unique contribution of single feature types. We used the complete hybrid architecture and iteratively removed one of the nine feature types. Figure 5 shows the performance drops for each trait- and prompt-specific model and the nine re-analyses. Consistent performance drops across prompts indicate that a particular feature type contains trait-relevant information and that the contextual embeddings and the other features do not capture this information. The results imply that the performance of models for the content traits dropped across all four prompts when readability and syntactic complexity features were removed. Therefore, both seem to contain unique information relevant to the assessment of content that the other feature types or contextual embeddings could not captured. Consistent performance drops were apparent when removing cohesion and, again, readability features from the trait models for organization. When removing occurrence, length, or error features, performance almost consistently decreased across the language traits. Throughout traits, length features, in particular, emerged as an essential feature type capturing important and unique text characteristics for judging the student essays.

Figure 4

Ablation Tests Tracking Highest Performance Gains ( \(\varDelta QWK\) ) by Adding one Type of Features to the Embedding-based Models

Ablation Tests Tracking Highest Performance Drops ( \(\varDelta QWK\) ) by Removing one Type of Features from the Hybrid Models

4.4 Cross-Prompt Scoring

Tables 5 and 6 present the cross-prompt performance of the DNN models trained on the ASAP and MEWS corpora respectively. The analyses show that across models and traits, the performance drop (\(\varDelta \text{Q}\text{W}\text{K}\)) when comparing the within-prompt performance to the cross-prompt performance was between − .01 and − .30 (\(\varDelta \text{P}\text{C}\text{C} \text{r}\text{a}\text{n}\text{g}\text{e}=[-.15;-.01]\)). For the test data of the MEWS 2 organization trait, the embedding-based model trained on MEWS 1 even slightly outperformed the embedding-based model trained on MEWS 2 (\(\varDelta \text{Q}\text{W}\text{K}=.02;\) i.e., the cross-prompt performance was better than the within-prompt performance). However, the embedding-based models generally worked very poorly for the MEWS organization trait.

Regarding the models trained on the MEWS prompts and ASAP 1, the feature-based models outperformed the embedding-based models in cross-prompt performance across traits. However, the embedding-based models trained on ASAP 2 consistently outperformed the feature-based model in cross-prompt performance. T-tests revealed no significant cross-prompt scoring advantages for the feature-based DNN (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{features}= .49\); \({\stackrel{-}{\text{P}\text{C}\text{C}}}_{features}= .52\)) compared to the embedding-based model (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{embeddings}= .42; {\stackrel{-}{\text{P}\text{C}\text{C}}}_{embeddings}= .45\)) (p = .131). Unsurprisingly, the results of these cross-prompt performance comparisons are in line with the within-prompt patterns (see Tables 3 and 4). However, adjusting for the within-prompt performance still implies slight advantages for the feature approach (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{features}= -.12\), \({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{embeddings}= -.15\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{features}= -.09\), \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{embeddings}= -.11\)).

Table 5

*ASAP Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)*
	Content		Organization		Word choice		Sentence fluency		Conventions
	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2
Training: ASAP 1
Features	.69	.56 (-.13)	.66	.52 (-.14)	.69	.55 (-.14)	.65	.56 (-.09)	.64	.54 (-.10)
DistilBERT	.71	.60 (-.11)	.67	.56 (-.09)	.68	.58 (-.10)	.68	.62 (-.16)	.67	.56 (-.11)
Hybrid	.74	.61 (-.13)	.67	.50 (-.17)	.67	.51 (-.16)	.68	.63 (-.05)	.65	.56 (-.09)
	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1
Training: ASAP 2
Features	.66	.54 (-.12)	.66	.51 (-.15)	.70	.54 (-.16)	.69	.54 (-.15)	.70	.50 (-.20)
DistilBERT	.65	.43 (-.22)	.59	.36 (-.23)	.69	.46 (-.23)	.67	.50 (-.27)	.69	.49 (-.20)
Hybrid	.69	.43 (-.26)	.69	.45 (-.24)	.72	.48 (-.24)	.74	.51 (-.23)	.69	.51 (-.18)

Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)QWK).

Table 6

*MEWS Cross-Prompt Scoring Performance in Terms of QWK (and Comparing Cross-Prompt and Within-Prompt Performance)*
	Content		Organization		Language quality
	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2
Training: MEWS 1
Features	.38	.22 (-.16)	.48	.40 (-.08)	.65	.56 (-.09)
DistilBERT	.40	.12 (-.28)	.17	.16 (-.01)	.56	.47 (-.09)
Hybrid	.46	.16 (-.30)	.52	.43 (-.09)	.70	.59 (-.11)
	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1
Training: MEWS 2
Features	.38	.33 (-.05)	.52	.49 (-.03)	.69	.62 (-.07)
DistilBERT	.36	.18 (-.18)	.19	.21 (.02)	.67	.54 (-.13)
Hybrid	.38	.30 (-.08)	.53	.46 (-.07)	.72	.66 (-.06)

Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)QWK).

The hybrid model also outperformed the embeddings-based approach regarding cross-prompt scoring but also just fell short of the feature-based model on average (\({\stackrel{-}{\text{Q}\text{W}\text{K}}}_{hybrid}= .48; {\stackrel{-}{\text{P}\text{C}\text{C}}}_{hybrid}= .52\)). Surprisingly, adjusting for the within-prompt performance, the hybrid model even performed worse than both single approaches (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{hybrid}= -.16\), \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{hybrid}= -.12\)). However, T-tests revealed that these differences were not statistically significantly different from zero.

Furthermore, we also explored trait-specific cross-prompt performance losses. Across models, the most remarkable drop in model performance from within-prompt to cross-prompt scoring was revealed for the content traits (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{content}= -.17\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{content}=-.09\)). The comparably smallest drop was apparent for the organization traits (\({\stackrel{-}{\varDelta \text{Q}\text{W}\text{K}}}_{language}=-.10\); \({\stackrel{-}{\varDelta \text{P}\text{C}\text{C}}}_{language}=-.05\)). This finding is again in line with expectations, as the topics changes between prompts and thus also feature importance might vary depending on the prompt. In addition, indicators for language and organizational text quality might be more stable across different writing prompts.

In the present study, we compared different supervised ML models for automated trait scoring of student essays using four argumentative prompts from L1 and L2 upper secondary students. Results implied small performance advantages for trait-specific models based on an extensive set of features compared to models based on contextual embeddings that stem from the pre-trained transformer DistilBERT. The differences between the two approaches were particularly evident in the organization traits. However, since contextual embeddings do not require extensive feature engineering, this approach can serve as a valuable baseline model for essay trait scoring, performing significantly better than an n-gram baseline model in our experiments. The hybrid approach, using both input types, consistently outperformed the two single resource models across traits. Ablation tests revealed that the performance of the embedding-based models was consistently enhanced in content assessment when combined with morphological complexity features. In addition, performance gains were consistently achieved in organization assessment when combined with length features and in the assessment of language traits when combined with lexical complexity, error, and occurrence features. The feature-based models exhibited slight advantages in cross-prompt scoring over the embedding-based and hybrid models. When comparing trait-specific cross-prompt and within-prompt performance, losses were slightly larger in trait content across ML approaches and prompts compared to organization and language traits.

5.1 Limitations and Future Research

Despite the various models considered and the extensive experiments run, the present study also has limitations that imply several directions for future research.

First, even considering L1 and L2 learners’ essays, the present investigation is limited to upper high school / secondary school students of three countries (American L1 students and German and Swiss L2 students). The performance of different models might vary with learner populations and should be extended, for instance, to primary school (e.g., Trüb et al., under review) or higher education contexts (e.g., Beseiso et al., 2021).

Second, pooling contextualized embeddings on the essay level indeed implies a loss of information that is captured by transformer models. This essay-level pooling approach is only one possibility of using transformer models in AES tasks (see, e.g., Xue et al., 2021). Future studies might explore transformer models’ potential, for example, for feature engineering. Valuable strategies might be to use section-level embeddings or cosine similarities with prompts or best-practice solutions (see, Bexte et al., 2022, 2023). Furthermore, sentence-level embeddings can be used for calculating cohesion measures.

Third, there are other essential topics in AES applications such as fairness and algorithms’ vulnerability to cheating behavior. Future studies could compare feature-based and embedding-based AES models regarding fairness and cheating behavior in trait assessment (see, e.g., Ding et al., 2020; also Bai & Stede, 2022).

Fourth, performance of supervised ML models highly depends on the number of training examples. This might explain to a certain extent the performance differences between ASAP and MEWS prompts in our experiments. However, further systematic experiments varying the amount of training data across ML approaches and prompts would be needed to quantify the relevance of training data size. Such investigations might also consider active learning approaches to minimize the required number of training examples (e.g., Firoozi et al., 2023; Horbach & Palmer, 2016).

Fifth, the power of large language models (LLMs; i.e., extensively pre-trained generative transformer models such as GPT-4) have recently entered the AI world. They also offer new possibilities to the field of AES applications. First approaches have, for instance, explored their potential to be included in an LLM-based hybrid model (Mizumoto & Eguchi, 2023).

5.2 Practical Implications

The present study has several implications, especially for creating feedback tools and tutoring systems in the context of student essay evaluation. In our experiments, the feature engineering approach performed as well or better than the embedding approach across essay traits. Since the feature approach can provide more explainability and, thus, more concrete practical information for student feedback, we consider the feature approach as the most promising alley for implementing real-life AES tools. However, in AES applications, an embedding-based DNN approach can serve as a valuable baseline that is easy to set up as no feature engineering is required. Furthermore, our experiments imply that a hybrid approach can increase performance compared to single-resource models. Feature engineering approaches can benefit from embedding-based model inputs, especially scoring content and language quality traits.

In future applications, the hybrid approaches could be chosen for the summative assessment of essay traits if a sufficiently large amount of training data is available. The feature engineering approach, on the other hand, could be used primarily for formative feedback due to its explainability.

Table A1

*Interrater Agreement of the TrACE Analytic Annotation of the MEWS Corpus*
	Interrater correlation Mean	Interrater correlation Median	Weighted Cohen’s Kappa Mean	Weighted Cohen’s Kappa Median
Language quality	0.75	0.77	0.72	0.74
MEWS 1 (AD)	0.73	0.73	0.71	0.71
MEWS 2 (TE)	0.75	0.76	0.72	0.73
Organization	0.72	0.73	0.72	0.73
MEWS 1 (AD)	0.68	0.70	0.68	0.70
MEWS 2 (TE)	0.76	0.76	0.77	0.76
Content	0.64	0.64	0.61	0.61
MEWS 1 (AD)	0.68	0.70	0.66	0.67
MEWS 2 (TE)	0.56	0.57	0.52	0.52

Table A2

*Feature Types*
Feature type	Features
Length Features	1. Nb. of words 2. Nb. of unique tokens 3. Nb. of letters 4. Nb. of sentences 5. Nb. of paragraphs 6. Nb. of syllables
Occurrence Features	1. Nb. of nouns 2. Nb. of verbs 3. Nb. of adjectives 4. Nb. of conjunctions 5. Nb. of adverbs 6. Nb. of possessive pronouns 7. Nb. of unique nouns 8. Nb. of unique verbs 9. Nb. of unique adjectives 10. Nb. of unique adverbs 11. Nb. of “wh”-adverbs 12. Nb. of determiners 13. Nb. of lexical words 14. Nb. of unique lexical words 15. Nb. of foreign words 16. Nb. of stopwords 17. Nb. of formal words 18. Nb. of deictic words 19. Nb. of symbols 20. Nb. of punctuations
Error Features	1. Nb. of errors 2. Nb. of grammar errors 3. Nb. of punctuation errors 4. Nb. of typos errors 5. Ratio Nb. of errors / words 6. Ratio Nb. of grammar errors / words 7. Ratio Nb. of punctuation errors / words 8. Ratio Nb. of typos errors / words
Morphological complexity	1. Nb. of comparatives 2. Nb. of superlatives 3. Nb. of finite verbs 4. Nb. of non-third person singular verb 5. Nb. of infinitive verbs 6. Ratio of comparatives 7. Ratio of superlatives 8. Ratio of finite verbs 9. Ratio of non-third person singular verb 10. Ratio of infinitive verbs
Cohesion	1. Nb. of connectors 2. Nb. of unique connectors 3. Mean noun overlap with previous sentence 4. Mean verb overlap with previous sentence 5. SD noun overlap with previous sentence 6. SD verb overlap with previous sentence 7. Ratio of connectors 8. Ratio of unique connectors
Readability	1. Flesch Score 2. Dale-Chall Score 3. Gunning-Flog Index 4. Integration Cost 5. Average nb. of sentences per 100 words 6. Average nb. of words per 100 letters 7. Words per sentences 8. Type-token ratio easy words 9. Type-token ratio easy nouns 10. Type-token ratio easy verbs 11. Type-token ratio easy adverbs 12. Type-token ratio easy adjectives 13. Integration cost 14. Heylinghen-F-Score
Lexical Diversity	1. Type-token ratio 2. Type-token ratio nouns 3. Type-token ratio verbs 4. Type-token ratio adjectives 5. Type-token ratio conjunctions 6. Type-token ratio lexical words 7. Type-token ratio functional words 8. Type-token ratio deictic words 9. Type-token ratio “wh”-adverbs 10. Type-token ratio infinitive verbs 11. Global edit distance
Lexical Sophistication	1. BNC easy words 2. NGSL easy words 3. SUBLEX easy words 4. BNC easy nouns 5. NGSL easy nouns 6. SUBLEX easy nouns 7. BNC easy verbs 8. NGSL easy verbs 9. SUBLEX easy verbs 10. Brown Frequencies token 11. Brown Frequencies type 12. Brown Frequencies lex. words 13. Brown Frequencies func. words 14. Thorndike Frequencies token 15. Thorndike Frequencies type 16. Thorndike Frequencies lex. words 17. Thorndike Frequencies func. words 18. MRC Frequencies token 19. MRC Frequencies type 20. MRC Frequencies lex. words 21. MRC Frequencies func. words
Syntactic complexity	1. Nb. subordinate clauses 2. Nb. fragment sentences 3. Nb. of noun phrases 4. Mean tokens before main verb 5. Nb. of complex noun phrases 6. Nb. of unknown constituents 7. Nb. of postnominal modifiers per complex noun phrase 8. Integration cost 9. Ratio subordinate clauses 10. Ratio fragment sentences 11. Ratio of noun phrases 12. SD tokens before main verb 13. Ratio of complex noun phrases 14. Ratio of unknown constituents 15. Ratio of postnominal modifiers per complex noun phrase

Note. For all features we additionally calculated several ratios and distribution parameters (i.e., means and standard deviations) for several features.

Table A3

*Best Hyperparameter Settings for each Model, Trait, and Dataset*
	Trait	Learning rate	Number of Layers	Number of Units	Dropout Rate
ASAP 1
Features	Content	0.001	1	128	0.3
	Organization	0.005	1	64	0.4
	Word Choice	0.0005	1	256	0.4
	Sentence Fluency	0.001	2	64	0.2
	Conventions	.001	1	256	0.3
DistilBERT	Content	0.001	0	128	0.3
	Organization	0.001	0	256	0.4
	Word Choice	0.001	0	128	0.4
	Sentence Fluency	0.001	0	128	0.4
	Conventions	0.001	0	128	0.4
Hybrid	Content	0.001	1	128	0.3/0.4
	Organization	0.005	1	64	0.4/0.4
	Word Choice	0.0005	1	256	0.4/0.4
	Sentence Fluency	0.001	2	64	0.2/0.4
	Conventions	0.001	1	256	0.3/0.4
ASAP 2
Features	Content	0.001	1	128	0.3
	Organization	0.0005	2	128	0.2
	Word Choice	0.001	2	256	0.3
	Sentence Fluency	0.0005	2	256	0.3
	Conventions	0.0005	2	256	0.4
DistilBERT	Content	0.001	0	256	0.4
	Organization	0.0005	2	256	0.3
	Word Choice	0.001	0	64	0.4
	Sentence Fluency	0.001	0	64	0.4
	Conventions	0.001	0	256	0.4
Hybrid	Content	0.001	1	128	0.3/0.4
	Organization	0.0005	2	128	0.2/0.4
	Word Choice	0.001	2	256	0.3/0.4
	Sentence Fluency	0.0005	2	256	0.3/0.4
	Conventions	0.0005	2	256	0.4/0.4
MEWS 1 (AD)
Features	Content	0.0005	1	128	0.4
	Organization	0.0005	1	256	0.3
	Language	0.0005	2	128	0.3
DistilBERT	Content	0.0005	2	64	0.4
	Organization	0.005	2	64	0.4
	Language	0.005	2	256	0.2
Hybrid	Content	0.001	1	128	0.4/0.7
	Organization	0.001	1	256	0.2/0.7
	Language	0.001	1	64	0.3/0.4
MEWS 2 (TE)
Features	Content	0.001	1	256	0.2
	Organization	0.0005	2	256	0.3
	Language	0.0005	2	128	0.3
DistilBERT	Content	0.005	2	64	0.3
	Organization	0.001	2	256	0.3
	Language	0.001	0	256	0.2
Hybrid	Content	0.001	1	256	0.2/0.7
	Organization	0.001	2	256	0.3/0.7
	Language	0.001	2	128	0.3/0.5

Table A4

*Pearson Correlation Coefficients for ASAP Within- and Cross-Prompt Performance*
	Content		Organization		Word Choice		Sentence Fluency		Conventions
	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2
Training: ASAP 1
Features	.711	.659 (-.05)	.674	.653 (-.02)	.709	.646 (-.06)	.665	.631 (-.04)	.667	.640 (-.03)
DistilBERT	.722	.634 (-.09)	.678	.644 (-.04)	.690	.655 (-.03)	.686	.654 (-.04)	.677	.643 (-.04)
Hybrid	.730	.654 (-.08)	.679	.612 (-.07)	.682	.624 (-.06)	.702	.665 (-.03)	.681	.664 (-.02)
	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1	Test: ASAP 2	Test: ASAP 1
Training: ASAP 2
Features	.688	.688 (-.00)	.687	.658 (-.02)	.719	.644 (-.08)	.707	.644 (-.07)	.722	.620 (-.10)
DistilBERT	.671	.658 (-.01)	.638	.599 (-.04)	.703	.639 (-.06)	.692	.636 (-.05)	.705	.613 (-.10)
Hybrid	.667	.666 (-.00)	.679	.648 (-.03)	.712	.649 (-.06)	.704	.641 (-.06)	.706	.632 (-.08)

Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)PCC).

Table A5

*Pearson Correlation Coefficients for MEWS Within- and Cross-Prompt Performance*
	Content		Organization		Language quality
	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2
Training: MEWS 1
Features	.440	.318 (-.12)	.556	.495 (-.06)	.680	.671 (-.01)
DistilBERT	.490	.260 (-.23)	.267	.243 (-.03)	.605	.589 (-.02)
Hybrid	.510	.381 (-.23)	.593	.559 (-.03)	.728	.708 (-.02)
	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1	Test: MEWS 2	Test: MEWS 1
Training: MEWS 2
Features	.432	.388 (-.04)	.579	.569 (-.01)	.718	.651 (-.07)
DistilBERT	.368	.257 (-.11)	.289	.251 (-.04)	.694	.580 (-.11)
Hybrid	.420	.360 (-.06)	.594	.559 (-.03)	.746	.690 (-.06)

Note. Differences between cross-prompt and within-prompt performance are represented in brackets (\(\varDelta\)PCC).

Author Contribution

J.L. and A.H. wrote the main manuscript. J.L. and F.J. set up the code and conducted the experiments. J.L. and R.T. prepared the datasets. A.H., J.M., and S.K. supervised the project. S.K., J.L., J.M., and T.J. supported the essay rating procedure. J.L. created the figures and tables. J.M. and S.K. raised the financial support for the project that led to this publication. All authors reviewed the manuscript.

Funding

This work was supported by the German Research Foundation under Grant Number MO 648/25-2; and the Swiss National Science Foundation under Grant Number 197968.

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic Text Scoring Using Neural Networks. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 715–725). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1068.
Andrade, H. L. (2018). Feedback in the Context of Self-Assessment. In A. A. Lipnevich, & J. K. Smith (Eds.), The Cambridge Handbook of Instructional Feedback (pp. 376–408). Cambridge University Press. https://doi.org/10.1017/9781316832134.019.
Bai, X., & Stede, M. (2022). A Survey of Current Machine Learning Approaches to Student Free-Text Evaluation for Intelligent Tutoring. International Journal of Artificial Intelligence in Education, 1–39. https://doi.org/10.1007/s40593-022-00323-0.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).
Beseiso, M., & Alzahrani, S. (2020). An Empirical Analysis of BERT Embedding for Automated Essay Scoring. International Journal of Advanced Computer Science and Applications, 11(10). https://doi.org/10.14569/IJACSA.2020.0111027.
Beseiso, M., Alzubi, O. A., & Rashaideh, H. (2021). A novel automated essay scoring approach for reliable higher educational assessments. Journal of Computing in Higher Education, 33(3), 727–746. https://doi.org/10.1007/s12528-021-09283-1.
Bexte, M., Horbach, A., & Zesch, T. (2022). Similarity-Based Content Scoring - How to Make S-BERT Keep Up With BERT. In E. Kochmar, J. Burstein, A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, & T. Zesch (Eds.), Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) (pp. 118–123). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.bea-1.16.
Bexte, M., Horbach, A., & Zesch, T. (2023). Similarity-Based Content Scoring - A more Classroom-Suitable Alternative to Instance-Based Scoring? In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 1892–1903). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.119.
Chen, J., Fife, J. H., Bejar, I. I., & Rupp, A. A. (2016). Building e-rater ® Scoring Models Using Machine Learning Methods. ETS Research Report Series, 2016(1), 1–12. https://doi.org/10.1002/ets2.12094.
Chen, X., & Meurers, D. (2016). CTAP: A Web-Based Tool Supporting Automatic Complexity Analysis. https://doi.org/10.17863/CAM.39630.
Condon, W., & Elliot, N. (2022). Liz Hamp Lyons: A life in Writing Assessment. Assessing Writing, 53, 100651. https://doi.org/10.1016/j.asw.2022.100651.
Crossley, S. A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research, 11(vol(11 issue 2), 251–270. https://doi.org/10.17239/jowr-2019.11.02.01.
Crossley, S. A. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research, 11(vol(11 issue 3), 415–443. https://doi.org/10.17239/jowr-2020.11.03.01.
Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment Analysis and Social Cognition Engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 49(3), 803–821. https://doi.org/10.3758/s13428-016-0743-z.
Dasgupta, T., Naskar, A., Dey, L., & Rupsa, S. (2018). Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, 93–102.
Ding, Y., Riordan, B., Horbach, A., Cahill, A., & Zesch, T. (2020). Don’t take nswvtnvakgxpm for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 882–892). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.76.
Doewes, A., Kurdhi, N., & Saxena, A. (2023). Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring. In 16th International Conference on Educational Data Mining, EDM 2023 (pp. 103–113). International Educational Data Mining Society (IEDMS).
Firoozi, T., Mohammadi, H., & Gierl, M. J. (2023). Using Active Learning Methods to Strategically Select Essays for Automated Scoring. Educational Measurement: Issues and Practice, 42(1), 34–43. https://doi.org/10.1111/emip.12537.
Fleckenstein, J., Keller, S., Krüger, M., Tannenbaum, R. J., & Köller, O. (2020). Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study. Assessing Writing, 43, 100420. https://doi.org/10.1016/j.asw.2019.100420.
Fleckenstein, J., Meyer, J., Jansen, T., Keller, S., & Köller, O. (2020). Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment. Frontiers in Psychology, 11, 562462. https://doi.org/10.3389/fpsyg.2020.562462.
Horbach, A., & Palmer, A. (2016). Investigating Active Learning for Short-Answer Scoring. In J. Tetreault, J. Burstein, C. Leacock, & H. Yannakoudakis (Eds.), Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 301–311). Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-0535.
Horbach, A., Scholten-Akoun, D., Ding, Y., & Zesch, T. (2017). Fine-grained essay scoring of a complex writing task for native speakers. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 159–168.
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. https://doi.org/10.7717/peerj-cs.208.
Injadat, M., Moubayed, A., Nassif, A. B., & Shami, A. (2021). Machine learning towards intelligent systems: applications, challenges, and opportunities. Artificial Intelligence Review, 54(5), 3299–3348. https://doi.org/10.1007/s10462-020-09948-w.
Ke, Z., & Ng, V. (2019). Automated Essay Scoring: A Survey of the State of the Art. IJCAI, 6300–6308.
Keller, S. D., Fleckenstein, J., Krüger, M., Köller, O., & Rupp, A. A. (2020). English writing skills of students in upper secondary education: Results from an empirical study in Switzerland and Germany. Journal of Second Language Writing, 48, 100700. https://doi.org/10.1016/j.jslw.2019.100700.
Keller, S. D., Lohmann, J. F., Trueb, R., Fleckenstein, J., Meyer, J., Jansen, T., & Möller, J. (under review). Language quality, content, structure: what analytic ratings tell us about EFL writing skills at upper secondary school in Germany and Switzerland.
Kumar, V. S., & Boulanger, D. (2021). Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined? International Journal of Artificial Intelligence in Education, 31(3), 538–584. https://doi.org/10.1007/s40593-020-00211-5.
Kusuma, J. S., Halim, K., Pranoto, E. J. P., Kanigoro, B., & Irwansyah, E. (2022). Automated Essay Scoring Using Machine Learning. In 2022 4th International Conference on Cybernetics and Intelligent System (ICORIS) (pp. 1–5). IEEE. https://doi.org/10.1109/ICORIS56080.2022.10031338.
Kyle, K., Crossley, S. A., & Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 50(3), 1030–1046. https://doi.org/10.3758/s13428-017-0924-4.
Lagakis, P., & Demetriadis, S. (2021). Automated essay scoring: A review of the field. In 2021 International Conference on Computer, Information and Telecommunication Systems (CITS) (pp. 1–6). IEEE. https://doi.org/10.1109/CITS52676.2021.9618476.
Linacre, J. M. (2023). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Winsteps.com. https://www.winsteps.com/manuals.htm.
Mathias, S., & Bhattacharyya, P. (2018). ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). https://aclanthology.org/L18-1187.
Mathias, S., & Bhattacharyya, P. (2020). Can Neural Networks Automatically Score Essay Traits? In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, I. Pilán, H. Yannakoudakis, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 85–91). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.bea-1.8.
Mayfield, E., & Black, A. W. (2020). Should You Fine-Tune BERT for Automated Essay Scoring? In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, I. Pilán, H. Yannakoudakis, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151–162). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.bea-1.15.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge University Press. https://doi.org/10.1017/CBO9780511894664.
Mesgar, M., & Strube, M. (2018). A Neural Local Coherence Model for Text Quality Assessment. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4328–4339). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1464.
Mitkov, R., & Voutilainen, A. (2012). Part-of-Speech Tagging (Vol. 1). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199276349.013.0011.
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050.
Narciss, S. (2008). Feedback strategies for interactive learning tasks. In J. M. Spector, M. D. Merrill, J. J. G. van Merriënboer, & M. P. Driscoll (Eds.), Handbook of Research on Educational Communications and Technology (3rded.) (p. 125-144). Mahwah, NJ: Lawrence Erlbaum.
Nivre, J. (2010). Dependency Parsing. Language and Linguistics Compass, 4(3), 138–152. https://doi.org/10.1111/j.1749-818X.2010.00187.x.
Perelman, L. (2014). When the state of the art is counting words. Assessing Writing, 21, 104–111. https://doi.org/10.1016/j.asw.2014.05.001.
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2.
Rodriguez, P. U., Jafari, A., & Ormerod, C. M. (2019). Language models and Automated Essay Scoring. https://doi.org/10.48550/arXiv.1909.09482.
Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., & Köller, O. (2019). Automated Essay Scoring at Scale: A Case Study in Switzerland and Germany. ETS Research Report Series, 2019(1), 1–23. https://doi.org/10.1002/ets2.12249.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019, October 2). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. http://arxiv.org/pdf/1910.01108v4.
Taghipour, K., & Ng, H. T. (2016). A Neural Approach to Automated Essay Scoring. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1882–1891). Association for Computational Linguistics. https://doi.org/10.18653/v1/d16-1193.
Trueb, R., Möller, J., Lohmann, J. F., Jansen, T., & Keller, S. D. (under review). An empirical analysis of teacher judgment accuracy in the assessment of young foreign language learners’ texts at CEFR levels A1 and A2.
Uto, M. (2021). A review of deep-neural automated essay scoring models. Behaviormetrika, 48(2), 459–484. https://doi.org/10.1007/s41237-021-00142-y.
Uto, M., & Okano, M. (2020). Robust Neural Automated Essay Scoring Using Item Response Theory. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Lecture Notes in Computer Science. Artificial Intelligence in Education (Vol. 12163, pp. 549–561). Springer International Publishing. https://doi.org/10.1007/978-3-030-52237-7_44.
Uto, M., Xie, Y., & Ueno, M. (2020). Neural automated essay scoring incorporating handcrafted features. Proceedings of the 28th International Conference on Computational Linguistics, 6077–6088.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need https://doi.org/10.48550/arXiv.1706.03762.
Wang, X., Lee, Y., & Park, J. (2022). Automated Evaluation for Student Argumentative Writing: A Survey. https://doi.org/10.48550/arXiv.2205.04083.
Weigle, S. C. (2002). Assessing Writing. Cambridge University Press.
Xue, J., Tang, X., & Zheng, L. (2021). A Hierarchical BERT-Based Transfer Learning Approach for Multi-Dimensional Essay Scoring. Ieee Access : Practical Innovations, Open Solutions, 9, 125403–125415. https://doi.org/10.1109/ACCESS.2021.3110683.
Yan, D. (2020). Handbook of automated scoring: Theory into practice. Chapman and Hall/CRC Statistics in the Social and Behavioral Sciences Ser. CRC Press LLC. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6124217.
Zesch, T., Wojatzki, M., & Scholten-Akoun, D. (2015). Task-Independent Features for Automated Essay Grading. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 224–232). Association for Computational Linguistics. https://doi.org/10.3115/v1/W15-0626.

https://www.kaggle.com/c/asap-aes
https://spacy.io
www.nltk.org
https://osf.io/zbmxh/?view_only=c595c9dcdbca4262bfcec8a74a65e1e6
https://osf.io/zbmxh/?view_only=c595c9dcdbca4262bfcec8a74a65e1e6
https://languagetool.org
https://osf.io/zbmxh/?view_only=c595c9dcdbca4262bfcec8a74a65e1e6
A DNN with zero Dense layers corresponds to a standard linear regression. However, we added a dropout mechanism in our application (Fig. 3).
https://scikit-learn.org/stable/

No competing interests reported.

Download PDF

Journal Publication

published 13 Sep, 2024

Read the published version in International Journal of Artificial Intelligence in Education →

Editorial decision: Revision requested
19 Jul, 2024
Reviews received at journal
18 Jul, 2024
Reviewers agreed at journal
08 Jul, 2024
Reviews received at journal
12 Apr, 2024
Reviewers agreed at journal
03 Apr, 2024
Reviewers invited by journal
03 Apr, 2024
Editor assigned by journal
28 Feb, 2024
Submission checks completed at journal
28 Feb, 2024
First submitted to journal
22 Feb, 2024

You are reading this latest preprint version

Neural Networks or Linguistic Features? - Comparing Different Machine-Learning Approaches for Automated Assessment of Text Quality Traits Among L1- and L2-Learners’ Argumentative Essays

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Overview of Different AES Approaches

2.1 Feature Engineering Approach

2.1.1 Length and Occurrence Features

2.1.2 Error Features

2.1.3 Features Relating to Lexical Diversity and Sophistication

2.1.4 Morphological Complexity Features

2.1.5 Syntactic Complexity Features

2.1.6 Cohesion Features

2.1.7 Feature Engineering Approach – Advantages and Challenges

2.2 Deep-Neural-Networks

2.3 Hybrid Models

3. Method

3.1 Datasets

3.1.1 Benchmark Scores and Rater Effects

3.2 Model Inputs

3.2.1 Linguistic Features

3.2.2 DistilBERT’s Contextual Embeddings

3.3 Training Procedure and Model Architectures

3.3.1 Main Experiments

3.3.2 Hybrid Model

3.3.3 Ablation Tests

3.3.4 Cross-Prompt Scoring

3.3.5 Two Linear Regression Baselines

3.4 Evaluation Metrics

4. Results

4.1 Features versus contextual embeddings

4.2 Hybrid Architecture

4.3 Ablation Tests

4.4 Cross-Prompt Scoring

5. Discussion

5.1 Limitations and Future Research

5.2 Practical Implications

Appendix

Declarations

Author Contribution

Funding

Competing Interests

References

Footnotes

Additional Declarations

Status:

Journal Publication

Version 1