2.1 Coding Frameworks of CR Assessment
According to the Framework for K-12 Science Education (NRC, 2011), fostering deep science understanding and reasoning in K-12 students is achieved by providing support for three dimensions of science learning: disciplinary core ideas, scientific and engineering practices, and crosscutting concepts. This paradigm provides educators with a holistic means of nurturing students' knowledge-in-use skills when constructing explanations for phenomena and addressing real-world problems within the educational process. Scientific CR assessments tied to these three dimensions help gauge students' proficiency in establishing coherence and making sense of these elements (Underwood et al., 2018).
Obtaining a precise and reliable CR assessment involves an iterative process that includes the development, utilization, and refinement of an expert-validated scoring rubric along with expert scores (Nehm et al., 2010). Prior research studies have employed two primary coding schemes to characterize the quality of student CRs. Analytic rubrics consist of dichotomous criteria designed to ascertain the presence or absence of construct-relevant ideas in student responses (Kaldaras and Haudek, 2022). Each analytic rubric bin represents a distinct concept, and each response must be scored for each analytic rubric bin. While a response may receive a score of “1” in multiple bins due to the coexistence of multiple concepts, some rubric bins may be mutually exclusive. In these instances, not all bins receive a score of “1” based on the design of descriptors by researchers and educators to assess students’ understanding of a specific item with precision. Analytic rubrics are often favored in educational settings for their reliability compared to other coding schemes as they evaluate key content components of reasoning and can provide specific feedback to students (Jönsson and Svingby, 2007; Jescovitch et al., 2019). A second coding approach—using a holistic rubric—involves the use of multi-leveled coding schemes with the goal of providing a singular, overall judgment of students’ CRs grounded in the accuracy or quality of their explanation or reasoning (Jescovitch et al., 2021). This holistic approach is typically most suitable when the overall excellence of a response surpasses the combined merit of its individual components (Tomas, Whitt, Lavelle-Hill, and Severn, 2019). It aims to capture general features of answer quality (e.g., organization, style, and persuasiveness) relying on raters’ sensitivities to the construct (Klein et al., 1998), backgrounds and knowledge (Zhai, Haudek, Stuhlsatz, and Wilson, 2020). It is widely documented that the reliability of holistic scores can be influenced by various sources of measurement errors such as raters' effects, the writer's individual characteristics, and the writing prompt itself used to elicit a writing sample (Barkaoui, 2007; Wang and Troia, 2023).
In the field of science education, there is inadequate evidence supporting the universal superiority of one coding scheme over the other for human coding (Tomas, Whitt, Lavelle-Hill, and Severn, 2019). The choice of coding method is highly dependent upon the specific writing constructs being assessed and the intended purpose of the entire judgement. However, a prevalent notion exists that quantitative measures are less susceptible to threats to internal validity when compared to qualitative scoring rubrics (Troia, Shen, and Brandon, 2019). To mitigate potential bias, researchers suggested both the deconstruction of holistic rubrics and reconstruction of analytic rubrics to singular levels, respectively, to ensure the rubric descriptors measure the intended aspects in a quantitative, measurable manner (Jescovitch et al., 2019; Martin and Graulich, 2023). Previous studies have shown the decomposition of holistic rubrics of multi-dimensional science CRs, which can be achieved by breaking down a holistic rubric into discrete conceptual components which form the basis of analytic rubrics, and be used with high interrater reliability (Kaldaras, Yoshida and Haudek, 2022). Conversely, other studies acknowledge the sophisticated construct implied by the holistic nature of assessment. This perspective is evident in Jescovitch et al.’s study (2021), where analytic codes were amalgamated using validated Boolean logic to align to hypothesized learning progression levels in science education.
2.2 Automated Analysis of CR Assessment
A valid and reliable coding framework in science education serves not only to assist educators in capturing students’ conceptual acquisition based on rubric descriptors but also holds promise for improving the reliability of future automated scoring tools by serving as labels for model training and validation based on human raters’ codes. Supervised machine learning (ML) models have demonstrated significant success in both holistic and analytic scoring using various algorithms to evaluate students’ contextualized science CRs (see Jescovitch et al., 2019; Zhai, Haudek, and Ma, 2023; Zhai, He, and Krajcik, 2022). In a systematic review conducted by Zhai and Yin et al. (2020), the synthesis of 45 studies substantiated the effectiveness of supervised ML models in scoring scientific responses composed by K-16 students, as indicated by a median Cohen’s Kappa of 0.72 across all investigated studies. Zhai and colleagues also concluded a thought-provoking observation that the predominant focus of existing ML studies in this synthesis was on replacing human efforts rather than deepening them, such as gaining a deeper understanding of what cognitive factors can contribute to students' performance in scientific tasks.
Given the limitations in automated scoring models, it is vital to consider integrating more nuanced writing constructs in CR assessments. It may include individual differences related to sociocultural and cognitive factors (e.g., Crossley, Allen, Snow, and McNamara, 2016), along with academic attributes (e.g., Murphy and Yancey, 2008; Wang and Troia, 2023) such as keywords, words frequency, sentence structure, text length, given their substantial influence on essay quality and characteristics. For instance, coherent and qualified CRs tend to demonstrate a greater and more appropriate use of academic, sophisticated vocabulary, coupled with a more advanced level of syntactic complexity in explaining scientific phenomena (Wang et al., 2023). An exemplary approach in this domain is the Constructed Response Classifier (CRC; see Noyes et al., 2020) which comprises an 8-classification ML-based algorithm ensemble implemented in R program (Jurka et al., 2012). In the CRC, each student response is treated as a document, with coding rubric bins considered as classes. Text features, extracted as n-grams, TF-IDF, stemmed at the word level, serve as input variables for classification algorithms. The target variable for training these algorithms is expert-assigned codes. The eight ML algorithms independently provide categorizations, and the final prediction is determined by a majority vote.
Unsupervised pre-trained language representation (LR) models have displayed considerable potential in text classification, particularly in assessing scoring levels within science education. An illustrative example of this application is evident in the work of Cochran, Cohn, Hastings, et al. (2023), wherein LR models were employed to identify the causal structure present in students’ scientific explanations. Transformer-based Natural Language Processing (NLP) models, as exemplified by prominent instances such as BERT and GPT, have become the de facto industry standard for a diverse range of NLP downstream tasks (Cochran, Cohn, Rouet, and Hastings, 2023; Wulff et al., 2023). Prior research (e.g., Cochran et al., 2022) has consistently highlighted the effectiveness of BERT-based transformers in evaluating students’ responses to STEM-related questions. These models undergo an initial phase of pre-training on extensive corpora to acquire generalized language representations, followed by fine-tuning for specific tasks to incorporate domain-specific knowledge. In contrast to conventional word vectorization methods (e.g., one-hot encoding, word2vec, GloVe), pretrained LR models, e.g., BERT, dynamically represent words by avoiding the assignment of specific and fixed embedding vectors to words. These models have garnered recognition for their outstanding performance across 11 downstream tasks. However, the direct use of embeddings extracted from LR models to address domain-specific NLP problems often results in suboptimal performance, primarily due to the knowledge gap between training corpora and domain-specific contexts (Liu et al., 2020). To tackle this challenge, specialized contextualized embedding LR models such as SciBERT (Beltagy, Lo, and Cohan, 2019) and BioBERT (Lee et al., 2020) have been developed and utilized extensively, which are trained on large-scale scientific and biomedical corpora, respectively. Additionally, SciEdBERT (Liu et al., 2023), explicitly designed for science education contexts, emphasizes the importance of domain-specific pretraining based on the data derived from prominent science education journals. It introduces a generalized strategy for automating science education tasks, particularly those related to scoring and text classification.
2.3 Contextualized Word Embedding
Within the NLP realm, word embeddings refer to representations in a continuous space for words that preserve both semantic and syntactic similarities between them (Chen, Perozzi, Al-Rfou, and Skiena, 2013). The fundamental distinction between a word’s non-contextualized core meaning and the senses expressed in specific linguistic contexts can be clarified through the understanding and analysis of contextualized word embeddings (Hofmann, Pierrehumbert, and Schütze, 2020). This involves aligning type-level representations with token-level representations based on the linguistic context. The integration of contextualized word embeddings into pretrained learning models has markedly improved performance across diverse tasks compared to the capabilities of using static word embeddings that exclusively capture type-level representations (see Selva Birunda and Kanniga Devi, 2021, for a review). Recognizing that the linguistic properties and meanings of words can vary across extralinguistic contexts (e.g., time and social space; see Rudolph and Blei, 2018), comprehending the contextualized word embeddings of words is particularly paramount. In educational assessment, understanding the contextualized word embeddings of domain-specific academic vocabulary holds significant value. For instance, Technical Language Processing techniques can efficiently extract text-based scientific information using word embeddings. These embeddings are subsequently utilized in constructing language models to reflect students’ understanding of topical background, thereby practically enhancing overall modeling performance (Kumar, Starly, and Lynch, 2023).
Similar to static word embeddings, contextualized word embeddings are usually generated through training on extensive unlabeled corpora using some variant of Large Language Models, as exemplified by the BERT architecture (Devlin et al., 2019). BERT employs transformer encoders featuring a self-attention mechanism and a masked language modeling target to predict missing words in a sentence by considering both left and right contexts of a target word simultaneously. This unique contextualization sets BERT apart from other language models. In contrast to static word embeddings, this approach significantly enhances performance by incorporating self-attention and the non-directionality inherent in the language modeling task. Notably, the BERT Tokenizer utilizes a predetermined vocabulary of 30,522 distinct tokens, each assigned a unique token ID. During BERT pretraining on extensive corpora such as Wikipedia and Book Corpus datasets, the model learns to convert each token ID into contextualized embeddings, which effectively captures nuanced information for each token within the provided sequence.
While Google’s BERT stands out as a leading contextualized embedding model for most domain-specific tasks, followed by ELMo, GPT, and XLNET (Yunianto, Permanasari, and Widyawan, 2020), it encounters challenges in certain domain-specific tasks due to limited knowledge connections between specific and open domains. These challenges are accentuated when dealing with small-scale and imbalanced datasets, where essential information for fine-tuning is often lacking to capture task-specific nuances and contextual details. One potential solution to this problem is to pretrain a model with an emphasis on domain specificity by generating domain-specific contextualized embeddings rather than relying on the publicly provided ones that offer generalized word embeddings for predetermined words and subwords. Nevertheless, pretraining such models from scratch can be time-consuming and computationally expensive, rendering it impractical for most users. Moreover, implementing this approach in educational contexts poses significant challenges, including potential issues with fine-tuning instabilities (Dodge et al., 2020; Lee, Cho, and Kang, 2019), and raises ethical concerns regarding diversity and representation (Baird and Schuller, 2020; Yan et al., 2023) given the inherent constraints of small and skewed datasets.
2.4 Environmental Ontology
The term “ontology” finds its origins in philosophy, representing a collection of concepts utilized to depict tangible objects in the world (Smith, 2012). In the current era marked by an information explosion and AI revolution, ontology emerges as a potent method for storing, organizing, and retrieving valuable information (Asim et al., 2019). Functioning as an abstract description system for knowledge representation within specific domains, ontology takes on the fundamental task of capturing domain knowledge and concepts. Its applicability now widely extends to the AI research community, facilitating the processing and reuse of existing data for communication among programs, services, agents, and users (Rahman and Hussain, 2020). In an ontology, definitions establish connections between the names of entities (represented as nodes in a knowledge graph) in the universe of discourse (e.g., classes, relations, functions, or other objects) and “human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms” (as cited in Gruber, 1995, p.908).
By organizing ontological classes (terms) hierarchically and describing relationships between terms with a limited set of relational descriptors (e.g., part_of, is_a, located_in, instance_of), ontology establishes a standardized vocabulary for representing entities in a given domain (Arp, Smith, and Spear, 2015). However, applications leveraging domain ontologies necessitate the quantification of relationships between two terms. The semantic similarity between terms, given the underlying domain ontology, serves as suitable measure for such relationships. For example, in a provided ontology like wildlife ontology, the computation of semantic distance using geometrical metrics such as cosine similarity reveals the relatedness between embedding vectors of a seed term (e.g., animal) and those of its sibling terms (e.g., mammal, reptile, insect, amphibian, mollusk) within the hierarchical structure of the domain. This nuanced representation proves particularly valuable when dealing with terms lacking exact synonyms, as ontologies provide related siblings that share overlapping meanings and relationships. The precision introduced by these relationships enhances the detailed and nuanced representation of information within the specified domain, extending beyond mere synonymy to capture subtle variations and distinctions among related terms.
A diverse array of ontologies is available on the DBpedia website[1]. For example, in environmental science, the Environment Ontology (EnvO) is noteworthy, encompassing three hierarchies, including subclasses such as biome, environmental feature, and environmental material (refer to Buttigieg et al., 2013 for details). An optimal approach to annotating entities with EnvO involves combining classes from each hierarchy to comprehensively describe an environmental system from these three different perspectives. Baker and colleagues (2009) have expounded on the significant contributions of ontologies, such as EnvO, to the development of educational assessments and learning design. By enhancing domain transparency for students, ontologies can effectively reveal important elements or concepts and their interrelationships. A notable feature of a domain ontology lies in its capability to convey the importance of a class through its level of connectivity, potentially guiding educational decisions about which ideas are central to the teaching and learning of a science domain. This integration of ontologies into educational assessments not only fosters a deeper understanding of domain-specific concepts but also provides a structured framework for developing and evaluating essential skills in alignment with contemporary educational goals (e.g., Libarkin and Kurdziel, 2006).