Extending a Pretrained Language Model (BERT) using an Ontological Perspective to Classify Students’ Scientific Expertise Level from Written Responses

doi:10.21203/rs.3.rs-3879583/v1

Download PDF

Research Article

Extending a Pretrained Language Model (BERT) using an Ontological Perspective to Classify Students’ Scientific Expertise Level from Written Responses

https://doi.org/10.21203/rs.3.rs-3879583/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The complex and interdisciplinary nature of scientific concepts presents formidable challenges for students in developing their knowledge-in-use skills. The utilization of computerized analysis for evaluating students’ contextualized constructed responses offers a potential avenue for educators to develop personalized and scalable interventions, thus supporting the teaching and learning of science consistent with contemporary calls. While prior research in artificial intelligence has demonstrated the effectiveness of algorithms, including Bidirectional Encoder Representations from Transformers (BERT), in tasks like automated classifications of constructed responses, these efforts have predominantly leaned towards text-level features, often overlooking the exploration of conceptual ideas embedded in students’ responses from a cognitive perspective. Despite BERT’s performance in downstream tasks, challenges may arise in domain-specific tasks, particularly in establishing knowledge connections between specialized and open domains. These challenges become pronounced in small-scale and imbalanced educational datasets, where the available information for fine-tuning is frequently inadequate to capture task-specific nuances and contextual details. The primary objective of the present study is to investigate the effectiveness of a pretrained language model (BERT), when integrated with an ontological framework aligned with a contextualized science assessment, in classifying students’ expertise levels in scientific explanation. Our findings indicate that while pretrained language models such as BERT contribute to enhanced performance in language-related tasks within educational contexts, the incorporation of identifying domain-specific terms and extracting and substituting with their associated sibling terms in sentences through ontology-based systems can significantly improve classification model performance. Further, we qualitatively examined student responses and found that, as expected, the ontology framework identified and substituted key domain specific terms in student responses that led to more accurate predictive scores. The study explores the practical implementation of ontology in assessment evaluation to facilitate formative assessment and formulate instructional strategies.

Pretrained language model

ontology

science education

expertise level

text classification

constructed response assessment

The advent of the 21st century has intensified global apprehension surrounding the advancement of Science, Technology, Engineering, and Mathematics (STEM) education. The imperative to enhance the quality of STEM education and its associated skills resonates within contemporary economic and political discourse (Zeidler, 2016). STEM education, as delineated by the National Research Council (NRC, 2011), imparts vital competencies to K-12 students, including problem-solving, critical thinking, logical reasoning, decision-making, along with domain-specific knowledge. These foundational skill sets bear significance not only for individual aspirations (Lee, Capraro, and Viruru, 2018) but also for academic and vocational growth in today's workforce (Chiu and Krajcik, 2020; Jang, 2016). However, there is also widespread recognition of the inherent challenges in developing and assessing these STEM skills due to the intricate and interdisciplinary nature of the underlying concepts and theories (English, 2016; Kelley and Knowles, 2016). Evaluating STEM-related skills often hinges on practical applications, involving activities such as conceptualization, designing, prototyping, and evaluating outcomes pertaining to educational artifacts and solutions (Falloon et al., 2020). An integral component of these evaluative products is scientific literacy, typically demonstrated through a written product that exhibit a student's ability to assess the quality of scientific information, articulate positions based on evidence, and apply conclusions to the natural world using scientific methodologies, such as inquiry (Lehrer and Schauble, 2006; Goldman et al., 2016). The manifestation of scientific literacy varies among individuals, whether through the adept use of technical terms or the practical application of scientific concepts and processes (Bauer, 1992).

Empirically, the analysis of scientific literacy yields insights into students’ knowledge and comprehension, which enables STEM teachers to refine instructional methods and address academic and psychological challenges (e.g., misconception; Udompong & Wongwanich, 2014). Incorporating the teaching of scientific literacy into general education across different grade levels entails merging the understanding of content-area STEM concepts with fundamental reading comprehension and written expression activities (Norris and Phillips, 2003; Yore, 2003). This integration is in accordance with national and state standards such as the Framework for K-12 Science Education (NRC, 2011), Common Core State Standard for English Language Arts (CCSS, 2010), and Next Generation Science Standards (NGSS Lead States, 2013). Additionally, the scientific literacy acquired during primary and secondary school can serve as the foundation for postsecondary academic and career achievement, as emphasized by the National Science Education Standards (NRC, 1996). These standards collectively emphasize the development of knowledge-in-use skills, which require students to apply their learned knowledge in explaining natural phenomena across diverse contexts while expressing ideas effectively (NRC, 2014).

The standards and frameworks within the domain of science education champion a transformative shift in the educational paradigm where students should prioritize developing a profound and systemic understanding of core ideas (Krajcik, Sutherland, Drago, and Merritt, 2012). This departure from the traditional pedagogical approach that involves accumulating superficial knowledge across a broad spectrum of discrete information, aligns with the concept of systems thinking (Meadows, 2008). Achieving competency in systems thinking is critical for science learners as they develop literacy in scientific practices (Libarkin & Kurdziel, 2006; Hmelo-Silver et al., 2007). The accompanying challenge lies in establishing high-quality and reliable educational assessments that can closely correspond to this overarching goal (Pellegrino, 2013). Constructed response (CR; i.e., open-ended questions) assessments have been well documented in its effectiveness in evaluating students' knowledge-in-use capabilities (Maestrales et al. 2021; Jung, Tyack, and Davier, 2022; Snow, 2012). CR assessments in STEM education often engage students with rich, contextualized information and create authentic scenarios for them to convey related ideas (Jescovitch et al., 2021). Unlike multiple-choice items, CR items are designed to reveal examinees' original thoughts and allow students to demonstrate knowledge application by demanding responses in their own words (Krajcik, 2021). Thus, CR items are deemed a valuable method for effectively analyzing students’ scientific literacy and higher-order thinking.

The automation of the scoring process in CR assessments using approaches from Artificial Intelligence (AI) presents a promising avenue for improving efficiency of evaluation of these types of assessments (see Zhai, Shi, and Nehm, 2021, for a review). Our study proposes leveraging the considerable success of pretrained language representation models and integrating them with an ontology-based approach to identify specialized science disciplinary vocabulary and extracting sibling terms. This integration streamlines the automatic scoring process. Importantly, we posit that the proposed approach harmonizes with the prevailing assessment paradigm in science education, which emphasizes the analysis of the presence and interconnections of core ideas within domain-contextualized scenarios. A key challenge of automatic evaluation of CR assessments in science remains that as students gain knowledge in the discipline, their use of domain specific terminology can increase (Jescovitch et al., 2021). Further, detecting student reasoning in CR when explaining phenomenon can be complicated by the domain specific terms based on the provided context of the item itself (Shiroda et al., 2023). This study anticipates contributing to STEM disciplinary education more broadly and, specifically, to environmental science education by introducing a faster and more resource-efficient scoring algorithmic model from an ontological perspective. In summary, our method is tailored for fine-tuning a generalized pretrained language model (BERT: Bidirectional Encoder Representations from Transformers) to score contextualized open-ended responses by considering domain-specific vocabulary not covered in BERT’s predefined glossary utilizing an ontology-based approach.

2.1 Coding Frameworks of CR Assessment

According to the Framework for K-12 Science Education (NRC, 2011), fostering deep science understanding and reasoning in K-12 students is achieved by providing support for three dimensions of science learning: disciplinary core ideas, scientific and engineering practices, and crosscutting concepts. This paradigm provides educators with a holistic means of nurturing students' knowledge-in-use skills when constructing explanations for phenomena and addressing real-world problems within the educational process. Scientific CR assessments tied to these three dimensions help gauge students' proficiency in establishing coherence and making sense of these elements (Underwood et al., 2018).

Obtaining a precise and reliable CR assessment involves an iterative process that includes the development, utilization, and refinement of an expert-validated scoring rubric along with expert scores (Nehm et al., 2010). Prior research studies have employed two primary coding schemes to characterize the quality of student CRs. Analytic rubrics consist of dichotomous criteria designed to ascertain the presence or absence of construct-relevant ideas in student responses (Kaldaras and Haudek, 2022). Each analytic rubric bin represents a distinct concept, and each response must be scored for each analytic rubric bin. While a response may receive a score of “1” in multiple bins due to the coexistence of multiple concepts, some rubric bins may be mutually exclusive. In these instances, not all bins receive a score of “1” based on the design of descriptors by researchers and educators to assess students’ understanding of a specific item with precision. Analytic rubrics are often favored in educational settings for their reliability compared to other coding schemes as they evaluate key content components of reasoning and can provide specific feedback to students (Jönsson and Svingby, 2007; Jescovitch et al., 2019). A second coding approach—using a holistic rubric—involves the use of multi-leveled coding schemes with the goal of providing a singular, overall judgment of students’ CRs grounded in the accuracy or quality of their explanation or reasoning (Jescovitch et al., 2021). This holistic approach is typically most suitable when the overall excellence of a response surpasses the combined merit of its individual components (Tomas, Whitt, Lavelle-Hill, and Severn, 2019). It aims to capture general features of answer quality (e.g., organization, style, and persuasiveness) relying on raters’ sensitivities to the construct (Klein et al., 1998), backgrounds and knowledge (Zhai, Haudek, Stuhlsatz, and Wilson, 2020). It is widely documented that the reliability of holistic scores can be influenced by various sources of measurement errors such as raters' effects, the writer's individual characteristics, and the writing prompt itself used to elicit a writing sample (Barkaoui, 2007; Wang and Troia, 2023).

In the field of science education, there is inadequate evidence supporting the universal superiority of one coding scheme over the other for human coding (Tomas, Whitt, Lavelle-Hill, and Severn, 2019). The choice of coding method is highly dependent upon the specific writing constructs being assessed and the intended purpose of the entire judgement. However, a prevalent notion exists that quantitative measures are less susceptible to threats to internal validity when compared to qualitative scoring rubrics (Troia, Shen, and Brandon, 2019). To mitigate potential bias, researchers suggested both the deconstruction of holistic rubrics and reconstruction of analytic rubrics to singular levels, respectively, to ensure the rubric descriptors measure the intended aspects in a quantitative, measurable manner (Jescovitch et al., 2019; Martin and Graulich, 2023). Previous studies have shown the decomposition of holistic rubrics of multi-dimensional science CRs, which can be achieved by breaking down a holistic rubric into discrete conceptual components which form the basis of analytic rubrics, and be used with high interrater reliability (Kaldaras, Yoshida and Haudek, 2022). Conversely, other studies acknowledge the sophisticated construct implied by the holistic nature of assessment. This perspective is evident in Jescovitch et al.’s study (2021), where analytic codes were amalgamated using validated Boolean logic to align to hypothesized learning progression levels in science education.

2.2 Automated Analysis of CR Assessment

A valid and reliable coding framework in science education serves not only to assist educators in capturing students’ conceptual acquisition based on rubric descriptors but also holds promise for improving the reliability of future automated scoring tools by serving as labels for model training and validation based on human raters’ codes. Supervised machine learning (ML) models have demonstrated significant success in both holistic and analytic scoring using various algorithms to evaluate students’ contextualized science CRs (see Jescovitch et al., 2019; Zhai, Haudek, and Ma, 2023; Zhai, He, and Krajcik, 2022). In a systematic review conducted by Zhai and Yin et al. (2020), the synthesis of 45 studies substantiated the effectiveness of supervised ML models in scoring scientific responses composed by K-16 students, as indicated by a median Cohen’s Kappa of 0.72 across all investigated studies. Zhai and colleagues also concluded a thought-provoking observation that the predominant focus of existing ML studies in this synthesis was on replacing human efforts rather than deepening them, such as gaining a deeper understanding of what cognitive factors can contribute to students' performance in scientific tasks.

Given the limitations in automated scoring models, it is vital to consider integrating more nuanced writing constructs in CR assessments. It may include individual differences related to sociocultural and cognitive factors (e.g., Crossley, Allen, Snow, and McNamara, 2016), along with academic attributes (e.g., Murphy and Yancey, 2008; Wang and Troia, 2023) such as keywords, words frequency, sentence structure, text length, given their substantial influence on essay quality and characteristics. For instance, coherent and qualified CRs tend to demonstrate a greater and more appropriate use of academic, sophisticated vocabulary, coupled with a more advanced level of syntactic complexity in explaining scientific phenomena (Wang et al., 2023). An exemplary approach in this domain is the Constructed Response Classifier (CRC; see Noyes et al., 2020) which comprises an 8-classification ML-based algorithm ensemble implemented in R program (Jurka et al., 2012). In the CRC, each student response is treated as a document, with coding rubric bins considered as classes. Text features, extracted as n-grams, TF-IDF, stemmed at the word level, serve as input variables for classification algorithms. The target variable for training these algorithms is expert-assigned codes. The eight ML algorithms independently provide categorizations, and the final prediction is determined by a majority vote.

Unsupervised pre-trained language representation (LR) models have displayed considerable potential in text classification, particularly in assessing scoring levels within science education. An illustrative example of this application is evident in the work of Cochran, Cohn, Hastings, et al. (2023), wherein LR models were employed to identify the causal structure present in students’ scientific explanations. Transformer-based Natural Language Processing (NLP) models, as exemplified by prominent instances such as BERT and GPT, have become the de facto industry standard for a diverse range of NLP downstream tasks (Cochran, Cohn, Rouet, and Hastings, 2023; Wulff et al., 2023). Prior research (e.g., Cochran et al., 2022) has consistently highlighted the effectiveness of BERT-based transformers in evaluating students’ responses to STEM-related questions. These models undergo an initial phase of pre-training on extensive corpora to acquire generalized language representations, followed by fine-tuning for specific tasks to incorporate domain-specific knowledge. In contrast to conventional word vectorization methods (e.g., one-hot encoding, word2vec, GloVe), pretrained LR models, e.g., BERT, dynamically represent words by avoiding the assignment of specific and fixed embedding vectors to words. These models have garnered recognition for their outstanding performance across 11 downstream tasks. However, the direct use of embeddings extracted from LR models to address domain-specific NLP problems often results in suboptimal performance, primarily due to the knowledge gap between training corpora and domain-specific contexts (Liu et al., 2020). To tackle this challenge, specialized contextualized embedding LR models such as SciBERT (Beltagy, Lo, and Cohan, 2019) and BioBERT (Lee et al., 2020) have been developed and utilized extensively, which are trained on large-scale scientific and biomedical corpora, respectively. Additionally, SciEdBERT (Liu et al., 2023), explicitly designed for science education contexts, emphasizes the importance of domain-specific pretraining based on the data derived from prominent science education journals. It introduces a generalized strategy for automating science education tasks, particularly those related to scoring and text classification.

2.3 Contextualized Word Embedding

Within the NLP realm, word embeddings refer to representations in a continuous space for words that preserve both semantic and syntactic similarities between them (Chen, Perozzi, Al-Rfou, and Skiena, 2013). The fundamental distinction between a word’s non-contextualized core meaning and the senses expressed in specific linguistic contexts can be clarified through the understanding and analysis of contextualized word embeddings (Hofmann, Pierrehumbert, and Schütze, 2020). This involves aligning type-level representations with token-level representations based on the linguistic context. The integration of contextualized word embeddings into pretrained learning models has markedly improved performance across diverse tasks compared to the capabilities of using static word embeddings that exclusively capture type-level representations (see Selva Birunda and Kanniga Devi, 2021, for a review). Recognizing that the linguistic properties and meanings of words can vary across extralinguistic contexts (e.g., time and social space; see Rudolph and Blei, 2018), comprehending the contextualized word embeddings of words is particularly paramount. In educational assessment, understanding the contextualized word embeddings of domain-specific academic vocabulary holds significant value. For instance, Technical Language Processing techniques can efficiently extract text-based scientific information using word embeddings. These embeddings are subsequently utilized in constructing language models to reflect students’ understanding of topical background, thereby practically enhancing overall modeling performance (Kumar, Starly, and Lynch, 2023).

Similar to static word embeddings, contextualized word embeddings are usually generated through training on extensive unlabeled corpora using some variant of Large Language Models, as exemplified by the BERT architecture (Devlin et al., 2019). BERT employs transformer encoders featuring a self-attention mechanism and a masked language modeling target to predict missing words in a sentence by considering both left and right contexts of a target word simultaneously. This unique contextualization sets BERT apart from other language models. In contrast to static word embeddings, this approach significantly enhances performance by incorporating self-attention and the non-directionality inherent in the language modeling task. Notably, the BERT Tokenizer utilizes a predetermined vocabulary of 30,522 distinct tokens, each assigned a unique token ID. During BERT pretraining on extensive corpora such as Wikipedia and Book Corpus datasets, the model learns to convert each token ID into contextualized embeddings, which effectively captures nuanced information for each token within the provided sequence.

While Google’s BERT stands out as a leading contextualized embedding model for most domain-specific tasks, followed by ELMo, GPT, and XLNET (Yunianto, Permanasari, and Widyawan, 2020), it encounters challenges in certain domain-specific tasks due to limited knowledge connections between specific and open domains. These challenges are accentuated when dealing with small-scale and imbalanced datasets, where essential information for fine-tuning is often lacking to capture task-specific nuances and contextual details. One potential solution to this problem is to pretrain a model with an emphasis on domain specificity by generating domain-specific contextualized embeddings rather than relying on the publicly provided ones that offer generalized word embeddings for predetermined words and subwords. Nevertheless, pretraining such models from scratch can be time-consuming and computationally expensive, rendering it impractical for most users. Moreover, implementing this approach in educational contexts poses significant challenges, including potential issues with fine-tuning instabilities (Dodge et al., 2020; Lee, Cho, and Kang, 2019), and raises ethical concerns regarding diversity and representation (Baird and Schuller, 2020; Yan et al., 2023) given the inherent constraints of small and skewed datasets.

2.4 Environmental Ontology

The term “ontology” finds its origins in philosophy, representing a collection of concepts utilized to depict tangible objects in the world (Smith, 2012). In the current era marked by an information explosion and AI revolution, ontology emerges as a potent method for storing, organizing, and retrieving valuable information (Asim et al., 2019). Functioning as an abstract description system for knowledge representation within specific domains, ontology takes on the fundamental task of capturing domain knowledge and concepts. Its applicability now widely extends to the AI research community, facilitating the processing and reuse of existing data for communication among programs, services, agents, and users (Rahman and Hussain, 2020). In an ontology, definitions establish connections between the names of entities (represented as nodes in a knowledge graph) in the universe of discourse (e.g., classes, relations, functions, or other objects) and “human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms” (as cited in Gruber, 1995, p.908).

By organizing ontological classes (terms) hierarchically and describing relationships between terms with a limited set of relational descriptors (e.g., part_of, is_a, located_in, instance_of), ontology establishes a standardized vocabulary for representing entities in a given domain (Arp, Smith, and Spear, 2015). However, applications leveraging domain ontologies necessitate the quantification of relationships between two terms. The semantic similarity between terms, given the underlying domain ontology, serves as suitable measure for such relationships. For example, in a provided ontology like wildlife ontology, the computation of semantic distance using geometrical metrics such as cosine similarity reveals the relatedness between embedding vectors of a seed term (e.g., animal) and those of its sibling terms (e.g., mammal, reptile, insect, amphibian, mollusk) within the hierarchical structure of the domain. This nuanced representation proves particularly valuable when dealing with terms lacking exact synonyms, as ontologies provide related siblings that share overlapping meanings and relationships. The precision introduced by these relationships enhances the detailed and nuanced representation of information within the specified domain, extending beyond mere synonymy to capture subtle variations and distinctions among related terms.

A diverse array of ontologies is available on the DBpedia website[1]. For example, in environmental science, the Environment Ontology (EnvO) is noteworthy, encompassing three hierarchies, including subclasses such as biome, environmental feature, and environmental material (refer to Buttigieg et al., 2013 for details). An optimal approach to annotating entities with EnvO involves combining classes from each hierarchy to comprehensively describe an environmental system from these three different perspectives. Baker and colleagues (2009) have expounded on the significant contributions of ontologies, such as EnvO, to the development of educational assessments and learning design. By enhancing domain transparency for students, ontologies can effectively reveal important elements or concepts and their interrelationships. A notable feature of a domain ontology lies in its capability to convey the importance of a class through its level of connectivity, potentially guiding educational decisions about which ideas are central to the teaching and learning of a science domain. This integration of ontologies into educational assessments not only fosters a deeper understanding of domain-specific concepts but also provides a structured framework for developing and evaluating essential skills in alignment with contemporary educational goals (e.g., Libarkin and Kurdziel, 2006).

The primary objective of the present study is to investigate the effectiveness of the established industrial standard pretrained language model BERT, when integrated with an ontological framework aligned with our science assessment context, in classifying students’ expertise levels in written scientific explanations. To this end, the study specifically addresses two research questions: RQ1. To what extent can pretrained language models (i.e., BERT) and machine learning models (i.e., ensemble learning algorithms) accurately identify students’ scientific expertise levels as revealed by CR assessments in environmental science? RQ2. To what extent does the inclusion of domain-specific vocabulary from environmental science ontology (EnvO) impact the performance of BERT models? The main contributions of this study can be summarized as follows.

First, our study proposes an extension to the BERT model that transcends general word representations by integrating domain-specific word embeddings. Unlike traditional specialized vocabularies built solely from semantically similar domain words, we argue that contextualized words, specifically ontological siblings of each domain-specific word, can provide valuable semantic and syntactic information to model Out-of-Vocabulary (OOV) word embeddings.

Second, in contrast to methods using synonyms extracted from lexical databases such as WordNet and Concept Net to replace OOV words, our approach of using ontological siblings proves to be more suitable for our dataset. There are several reasons. Firstly, many domain-specific academic words, particularly those describing an environmental entity in our study, lack direct synonyms and thus pose a challenge in finding appropriate replacements through conventional synonym extraction methods. Secondly, we posit that this novel approach also is well-suited for handling small-scale, imbalanced educational datasets. It offers a faster and more resource-efficient auto-scoring model without the need for costly pre-training on large-scale corpora. This is particularly advantageous in educational contexts with limited data resources. Lastly, our chosen approach demonstrates the capability to perform OOV word embedding and downstream tasks, such as text classification in science education. This underscores the versatility and effectiveness of our methodology in addressing the specific challenges presented by our dataset and educational context.

Last but not least, our study focuses on evaluating undergraduate students’ expertise levels (ranging from 0 to 4) serving as labels for model training and testing. This evaluation is grounded in the analysis of students’ written CRs that reflect their use of foundational discipline-specific concepts and systems thinking abilities. These elements are intricately connected to their usage and understanding of domain-specific words.

Collectively, these aspects underscore the potential advantages of integrating an ontological framework that comprehensively captures the expansive domain of environmental science within the scoring process. This integration shows promise in enhancing our comprehension of how students employ language and articulate their ideas, thereby facilitating the classification of their expertise levels based on the conveyed concepts.

4.1 Dataset

4.1.1 CR Data Collection

Data for our research was originally collected for a collaborative research project in creating the Next Generation Concept Inventory (NGCI) a concept inventory that utilizes short answer questions in leu of the traditional concept inventory, multiple choice questions. Data collection centered on undergraduate students in introductory interdisciplinary environmental courses from ten institutions. These higher education institutions were purposefully selected to represent the three primary categories of four-year colleges according to the Carnegie Classifications of Institutions of Higher Education (Carnegie Foundation for the Advancement of Teaching, 2011) and the three approaches to interdisciplinary environmental science (IES) curriculum design outlined by a representative survey of higher education institutions (see Vincent 2013 for further description of the three curricular designs in IES program). The sample includes baccalaureate colleges (n = 4), master’s college and universities (n = 3), and doctoral/research universities (n = 4), and degree programs/tracks representative of the three approaches to curriculum design—emphasis on natural systems (n = 7), emphasis on societal systems (n = 6), and emphasis on solutions development (n = 4).

Student responses were collected from introductory IES courses during Fall and Spring semesters from Spring 2022 through Spring 2023 by having students complete the assessment questions pre- and post-course discussion of the FEW Nexus (Food-Energy-Water). We then added the items in a Qualtrics survey and administered the survey to over 400 IES undergraduates from seven post-secondary institutions across the United States to collect student responses (UNCO IRB#158867-1). Responses were then de-identified for coding to create training and testing data for machine learning.

For the purposes of this investigation, we use student responses (n = 1293) to two questions from the full NGCI assessment which were selected for analysis with different assumed levels of difficulty: (1) a question about identifying energy sources related to water reservoirs (Reservoir), and (2) a question about assessing tradeoffs (Biomass). These items were chosen as a related pair, as the Reservoir item has students identify sources of and connections between FEW and Biomass has students explain the possible tradeoffs around FEW given a certain scenario (see Table 1). Further, responses to both items were coded on a similar scale in their respective coding rubrics related to expertise of the responses. For analysis, we combined the two parts of Reservoir, since it was necessary to consider both sources and connections to understand student answers. For Biomass, we analyzed responses for part A and B using separate rubrics.

Table 1

Descriptions of Constructed Response Assessment Items
Reservoir
a. A reservoir is an artificial lake that stores water. What types of energy does the water in this reservoir possess? b. Explain how the kinds of energy listed in your previous response could be used for food production by nearby farmers.
Biomass
Biomass energy production involves growing certain crops and converting them to energy. Corn, in the form of ethanol, is a common source of biomass energy. To increase its energy independence from natural gas, your community decides to convert half of their existing agricultural bean fields to corn biomass crops that will eventually provide energy to the surrounding area. You have taken an environmental course and know that burning natural gas has a greater energy return than burning biomass (e.g., one unit of natural gas requires less energy to produce than one unit of ethanol biomass). a. You realize that corn requires more water to produce than beans; much of your irrigation water comes from a nearby river and limited rainfall. How would you expect the area’s water use to change as a result of this shift from agriculture to biomass production? b. Tradeoffs describe the compromise between positive and negative outcomes of a decision. A tradeoff results in something decreasing in return for gains in something else. Describe the tradeoffs to food, energy, and water systems in switching from bean farming to biomass (corn) farming.

4.1.2 Rubric Development

Rubric development began by reviewing examples of published rubrics that were used in similar assessments and intended for use with automatic scoring (Jescovitch et al., 2021; Sripathi et al., 2023). We agreed upon a scale that would best represent the students’ varying levels of knowledge (Table 2). We created each rubric by first analyzing the range of student answers we had received from the different participating institutions. During this initial review, we used an inductive approach and read student answers to identify common themes that revealed student knowledge regarding food, energy, and water systems and their relationships to each other, to other natural, and to human systems.

For the first stage of coding, we developed dichotomous, analytic rubrics aligned to key disciplinary concepts (e.g., changes to water usage) for the content analysis of responses. We used an inductive-deductive approach to coding, starting with an initial coding schema based on Vincent et al.’s (2013) dimensions of knowledge. Recurring elements that students included in their written responses, such as predictions about energy transfer, were consolidated into a coding rubric (supplemental information in Royse et al., in review). These rubrics had parallel structures for each node of the FEW Nexus. Each response was categorized based on the ideas it contained. Using these rubrics, two researchers coded responses to each of the questions. After scoring independently, the scorers compared assigned scores. If scores on a response agreed, then the scores were taken as consensus final scores. If scores disagreed, then scorers met to discuss and resolve the issue to a final consensus score. The coding team then engaged in five iterative rounds of independently coding artifacts; meeting in between rounds to further modify the coding schema until an agreement level of at least 0.85 was reached between all coders. After reaching a consensus for the rubric, all student responses were divided between two members of the research team and were scored independently. A total of 346 responses were scored for the Reservoir question item and 947 responses for the Biomass question item (see Table 3 for details). We provide an example rubric for parts of the Tradeoffs of FEW Systems: Biomass Energy Production question item (hereafter referred to as “Biomass question item”) in Table 2, and the other rubrics are available in the Appendix.

Table 2

Analytic coding rubric for Tradeoffs Systems: Biomass Energy Production
Part A Rubric		Part B Rubric
Bin	Brief Description	Bin	Brief Description
A	Increased Water Consumption	A1	Less food produced
B1	Water scarcity/not enough water total	A2	More land converted to meet food needs
B2	Generally less water/decrease in water available	B1	More water use
B3	Less water for other things	B2	Less water available
B4	Change in human behavior in water use	C1	Energy produced (corn creates biomass energy)
C	Water prices change	C2	Energy return on investment (ethanol is less efficient/lower EROI than other sources)
D	Changes to river	C3	Renewable energy, more sustainable energy, lower environmental impact source of energy
E	Impacts to biodiversity/wildlife/ecosystem

4.1.3 Expertise Level

Following the initial phase of constructing dichotomous, analytic rubrics, the subsequent coding stage involved the utilization of both student and instructor responses to our two selected assessment items. This process aimed to formulate four levels of expertise codes. Instructor responses were used to develop these levels based on expected student learning in IES courses. We then used a range of student responses to refine the distinct levels of expertise. Based on the results from the first stage of content coding, we then assigned each response for Reservoir, Biomass part A and Biomass part B to a single holistic expertise level, ranging from zero to four. These codes represent a more holistic view of the response as a whole, instead of a more discrete code based on specific content. The higher levels represent more expertise in using disciplinary content and connections in response to the assessment prompt; these higher levels focus on making multiple connections in explanations. Conversely, lower levels in this scheme tend to focus on specific vertices of the FEW Nexus, make singular connections or do not address the prompt. Specifically, these expertise levels do not just identify specific disciplinary terms or phrases, but instead focus on the inclusion and utilization of ideas within the entire response. If students could explain the idea correctly, even without using the domain-specific term, the response would still be coded at a high expertise level. We use these expertise levels as our human-assigned codes in the data corpus.

Table 3 illustrates the distribution of student data disaggregated by expertise level, ranging from 0 to 4. In terms of the textual statistics concerning student responses, the dataset for Biomass Part A encompasses a total of 481 student responses that were collected and coded (character length; mean = 146.76, SD = 93.98, range = 5-579). For the dataset of Biomass Part B, a total of 466 responses were gathered (character length; mean = 231.16, SD = 156.49, range = 3-1054). Regarding the dataset of Reservoir, it includes a total of 346 responses (character length; mean = 189.32, SD = 145.63, range = 11-1169). These statistics indicate that our NGCI assessment items require student to compose short, constructed responses to the designated open-ended questions. Table 4 displays expertise level coding rubric for Biomass (part A and B) and Reservoir Assessment Items.

Table 3

Distribution of Student Data Disaggregated by Expertise Level
Dataset	Expertise Level
Dataset	0	1	2	3	4	Total
Reservoir	10	59	165	95	17	346
Biomass part a	6	66	4	281	124	481
Biomass part b	61	58	281	65	1	466
Total	77	183	450	441	142	1293

Table 4

Expertise Level Coding Rubric for Biomass and Reservoir Assessment Items
Expertise level	Biomass				Reservoir
	Part A		Part B		Reservoir
	Description	Example	Description	Example	Description	Example
4	Response includes a causal statement, making one or more connections between increased water consumption (given in the item prompt) and a direct effect(s), such as changes to the river, water scarcity or economic impact.	The beans may have been grown with just the limited rainfall; switching to corn would require water to be pumped in from the river. Increasing the amount of water brought to the fields may require new infrastructure, and could even cause water shortages if the river is small enough or there is a drought.	Response includes multiple connections or tradeoffs between Food, Energy and Water in response to the phenomenon and includes specific details or examples.	Although we may have more biomass from the corn there will be less bean production, steering the food to be made with less healthy ingredients. Also we may have less water for beans and citizens, but we will have new biomass energy. the energy may be great but it takes more water to feed the corn and more energy than natural gas to produce.	Response includes multiple connections between all of Hydroelectricity (specific), Energy production (general), Uses of Energy and Uses of Water.	I guess potential energy. My assumption is that the water will be turned into hydroelectric energy though once passed through the dam. The energy produced by this might be used by farmers to move other water for irrigation to their crops. It may also be used to power some machinery or even just their homes and facilities.
3	Response re-explains water consumption will increase or that less water will be available for other uses but without explicit connection.	As a result of the crop change the water usage would increase because corn requires more water than the beans.	Response includes multiple, general connections or tradeoffs between Food, Energy and Water in response to the phenomenon, with little to no explanation.	In order to get more energy from biomass, we must produce less food and use more water.	Response includes at least three ideas and connections related to Hydroelectricity (specific), Energy production (general), Uses of Energy and Uses of Water.	Hydroelectric power and as a water resource. The water helps farmers grow their crops but also helps power their farms and machinery.
2	Response describes a change in water prices without connecting to demand or phenomenon.	Water bills would go up for the surrounding community.	Response includes a single connection or tradeoff between two of the vertices of Food, Energy or Water in response to the phenomenon.	Energy would be the variable that would increase. Since the corn would need more water to be grown than beans do, water would decrease. Corn produces more energy than beans, so energy would increase.	Response includes at least two ideas and connections related to Hydroelectricity (specific), Energy production (general), Uses of Energy and Uses of Water.	The water would possess hydropower. The electricity generated by the hydropower could be used in agricultural equipment that requires electricity, like sprinklers.
1	Response includes inaccuracy that water use will decrease.	I expect the area's water to not be used up as fast from the shift of agriculture.	Response focuses on a single vertex (Food, Energy or Water) relevant to phenomenon and does not include a connection between vertices.	Increased water use for corn decrease water for beans = less beans produced.	Response includes only one connection related to Hydroelectricity (specific), Energy production (general), Uses of Energy and Uses of Water.	Water turbines. They could use them to power their machines.
0	Student does not answer the question or address the relevant phenomenon.	I believe that there is positive in food, energy and water because those are very important.	Student does not answer the question or address the relevant phenomenon.	Grow more beans.	Student does not answer the question or provides no connections in response.	I am not sure what type of energy the water in the reservoir possesses.

4.1.4 Augmented Data

Text augmentation techniques have been employed to balance and expand the data set so that models can be developed which exhibit relatively higher accuracy (Cochran, Cohn, Rouet, and Hastings, 2023). The primary objective is to address the insufficiency and imbalance in data sets and allow smaller data sets, typical in many educational contexts (Ferrara & Qunbar, 2022), to emulate larger data volumes so that the augmented data set can enhance model training. This technique is frequently applied in educational assessments, particularly in tasks related to text classification (Cochran, Cohn, Rouet, and Hastings, 2023; Fang, Lee, and Zhai, 2023). In our study, we employed an NLP sentence augmentation package operating at the sentence level to augment textual input based on contextual word embeddings, resulting in the augmentation of each response five times, yielding a total of 6465 augmented responses. It is important to note that this augmentation technique utilizes BERT’s predetermined vocabulary and does not replace specialized domain words. Additionally, determining the optimal frequency for augmentation lacks a universally robust guideline, as thresholds may vary across different cases, and our study does not specifically address this aspect. However, we posit that employing five times augmentation introduces valuable diversity and variability to the dataset. This practice exposes the models to a wider spectrum of linguistic variations and contextual nuances, thereby mitigating overfitting to specific patterns present in the original data. In our preliminary experiment, we trained models using only the collected CR data. In our results table (see Table 5), we include this a priori baseline metric of Cohen’s Kappa without utilizing augmented data. This metric serves as a reference point for comparison, allowing an assessment of the efficacy achieved through using the augmented dataset. Recognizing the need for improvement, particularly in line with recommendations from the measurement field (Ramesh and Sanampudi, 2022), we opted to integrate a data augmentation method in this study and report results based on this approach. This strategy, commonly used in the educational field (Bonthu, Sree, and Prasad, 2023), aims to surmount the challenge of data scarcity and enhance machine-based modeling for practical applications in the future. Each response was augmented five times, with the augmented data used exclusively during the training phase, while the original student CR data remained reserved as a model validation set.

4.2 Model Architecture

The proposed model architecture (see Fig. 1) comprises three main layers: the ontology-based system layer, the BERT encoder layer, and the text classification layer. The ontology-based system layer consists of three modules: domain-specific word identification, siblings extraction, and word substitution. Given an input sentence S denoted as a sequence of tokens $\left[{w}_{1}, {w}_{2}, \dots , {w}_{j}\right]$, the word identification module calculates semantic similarity between each token and nodes in the target ontology, identifying domain-specific academic vocabularies with high similarity scores. In Fig. 1, “photosynthesis” is chosen as a domain-specific word due to its high similarity scores with ontological nodes and its absence in the predefined glossary of BERT. Although the word “lake” also has a high semantic similarity score as indicative of a domain-specific word, we ignored this since the term was already included in the BERT glossary and thus did not encounter OOV issues. Then the siblings extraction module produces three selected sibling terms for the identified vocabulary based on inclusion criteria, and the word substitution module replaces the domain-specific term with these siblings, resulting in three modified sentences. If a sentence contains multiple domain-specific terms, then in each training epoch, we randomly select one of them to be substituted by three siblings. In the BERT encoding layer, these modified sentences undergo three iterations of BERT training, each corresponding to one of the chosen siblings for replacement. The outputs are three $1\times 768$ feature vectors reflecting semantic nuances introduced by the selected siblings. These vectors are averaged to enhance representation. Finally, the feature vectors are employed in text classification, where the label is students’ expertise levels ranging from 0 to 4.

4.2.1 Domain-Specific Term Identification

Rationale and Prior Work

The adept use of domain-specific terms ensures that students demonstrate a precise and sophisticated command of the subject matter. Beyond linguistic considerations, incorporating domain-specific terms in assessments mirrors a student’s engagement in system thinking of key concepts within the targeted academic domain (Coxhead, 2000; Nagy and Townsend, 2012). This alignment of assessment tasks with the discipline’s requisite vocabulary ensures that students showcase their ability to apply acquired principles in a contextually relevant manner.

Drawing on insights gleaned from our prior investigations, which entailed content analysis of eight college-level introductory environmental course syllabi (Horne et al., 2023) and interviews with undergraduate students probing their knowledge of natural environment and ecological topics (Manzanares et al., under review), we consolidated concepts associated with the FEW Nexus as focal themes in our CR assessments. These CR assessments prompted students to articulate their ideas and facilitate the evaluation of their systems thinking and application ability in practical scenarios centered on the FEW Nexus. Students were tasked with identifying boundaries, components, and connections within the FEW Nexus and predicting outcomes resulting from alterations in components or connections (Royse et al., in review). This approach demands a comprehensive understanding of diverse FEW concepts, encouraging students to establish connections to their environment. Meanwhile, the use of specialized terms and ideas proves indispensable for effective expression. For example, in our recent observation (see Royse et al., under review), we conducted conventional content analysis to discern commonalities and differences among the machine-misscored responses in terms of words, phrases, and ideas qualitatively. This prompted us to recognize the importance of identifying and extracting domain-specific vocabulary and ideas from students’ CR.

Experimental Design

Ontologies are increasingly employed in academic research studies for their capacity to describe the semantic relationships of domain knowledge in a formal, structured, and machine-processable manner (Patel and Debnath, 2024). In the context of this study, we used Environment Ontology (EnvO)[2] as the designated ontology to map each word (after preprocessing through the removal of defaulted English stopwords) contained in students’ CR data onto nodes within the ontological knowledge graph. The conceptual frameworks enshrined within this ontology, align with the CR assessments and rubrics in college-level environmental science, specifically focusing on the FEW Nexus.

The mapping procedure was executed using the text2term ontology mapper[3]. This tool plays a pivotal role in our study by serving two functions: (1) generating a semantic similarity metric (with options) to compare the provided strings with all nodes within the target ontology. The results are then converted into a CSV file, facilitating the identification of domain-specific terms within the dataset. (2) generating knowledge graphs depicting the locations of sibling terms of each ontology-based term, which, in our case, pertains to domain-specific vocabulary. This procedure helps identifying sibling terms of each recognized domain-specific word in the subsequent experimental step of the siblings extraction module.

In the experiment, the matching in the first function was conducted in Python using the “map_terms” function, where the parameters were set follows: target ontology = “http://purl.obolibrary.org/obo/envo.owl”, mapper = Mapper.TFIDF, min_score = 0.7, and save_term_graphs = TRUE. The mapping function can be presented as follows:

$$f\left({w}_{i}\right)=M\left({w}_{i}\right)=\left\{{w}_{i}\in A, {n}_{i}\in B \right| Sim\left({w}_{ij},n\right)\ge \text{m}\text{i}\text{n}\_\text{s}\text{c}\text{o}\text{r}\text{e}\}$$

(1)

Here, $f\left({w}_{i}\right)$ represents the mapping of source term ${w}_{i}$ (i.e., words in students’ CR) to best-matched terms $M\left({w}_{i}\right)$, where A represents the set of source terms and B represents the set of target terms (i.e., nodes in EnvO). The function considers only those cases where the similarity score $Sim\left({w}_{ij},n\right)$ between the source term ${w}_{i}$and the target term ${n}_{i}$ exceeds or equals the specified minimum similarity score threshold (min_score). The similarity score $Sim\left({w}_{ij},n\right)$ is computed using algorithms such as Levenshtein distance. In this particular case, the defaulted TF-IDF metrics were utilized, as illustrated in Eq. (2). The mapping process aims to find the best-matched terms in the target ontology based on the specified similarity score threshold. The cutoff value (i.e., min_score) of 0.70 was determined through experience, with a smaller setting enhancing tolerance but risking misjudgments. After multiple iterations, we set the threshold at 0.70 for TFIDF cosine similarity calculations, considering values greater than 0.70 as indicative of words being close to exact matches.

The similarity score between a source term ${w}_{i}$ and a target term ${n}_{i}$ is computed using cosine similarity, which is expressed as follows:

$$Sim\left({w}_{i},{n}_{i}\right)= \frac{tfidf\left({w}_{i}\right)\bullet tfidf\left({n}_{i}\right)}{\parallel tfidf\left({w}_{i}\right)\parallel \parallel tfidf\left({n}_{i}\right)\parallel }，$$

(2)

where $tfidf\left({w}_{i}\right)$ represents the TF-IDF vector of the source term ${w}_{i}$ and $tfidf\left({n}_{i}\right)$ represents the TF-IDF vector of the target term ${n}_{i}$. The cosine similarity is calculated by taking the dot product of the TF-IDF vectors and normalizing it by the product of their magnitudes. In simpler terms, the cosine similarity measures the cosine of the angle between the TF-IDF vectors of the source and target terms. A result closer to 1 implies greater semantic similarity between the terms, with 1 indicating that the terms are semantically identical in the given context.

4.2.2 Siblings Extraction

Recent synonym discovery methods operate on domain specific synonym words as the input unit. They utilize raw text as contextual information for downstream BERT applications such as text classification (e.g., Graichen, 2023; Zeng, Yao, Zhang, and Xie, 2023). However, these methods are ill-suited in our case and lead to serious OOV problems. Moreover, finding appropriate synonyms for certain domain terms in our environmental science assessment proves difficult to accomplish. For example, one student's response contains the word Eukaryote as part of an explanation about why a restricted number of living systems can adversely affect a lake environment. According to the Google BERT glossary, Eukaryote is considered OOV, less common, and challenging to predict in a masked language model due to limited educational datasets. The quest for suitable synonyms related to Eukaryote further complicates matters.

To tackle this issue, we propose leveraging siblings extracted from the domain ontology. The Text2term Ontology Mapper can save knowledge graphs for each mapped term in a machine-processable manner, serving as representations of semantic relationships between entities within the domain for analysis. Ontologies enhance these representations by adding semantic meaning to entities and relationships (e.g., is_a) in a knowledge graph, enabling more sophisticated reasoning and inference. Consider the term photosynthesis in Fig. 2, a frequent occurrence in our students' CR items and deemed important according to the rubric. Terms/phrases like cellular process, light reaction, and dark reaction could serve as viable substitutes for photosynthesis. It is important to note that we avoid using terms/phrases like metabolic, photosystem, and photosynthetic membrane to represent and substitute photosynthesis, despite their proximity according to the knowledge graph. This decision is grounded in their specialized nature and their absence from the Google glossary. Incorporating such terms would introduce further complexity to our methodology.

To sum up, for each domain-specific word, we applied a selection criterion to choose three sibling terms based on (1) Proximity to domain words: prioritizing sibling terms closely related in the ontology, and (2) Commonality: selecting words and/or variants present in the BERT glossary to facilitate subsequent embedding procedures.

4.2.3 Word Substitutions

After extracting three suitable siblings for a given domain-specific term, these selected sibling terms are employed as substitutes for the identified academic term, leading to the creation of three new sentences. Subsequently, these modified sentences are used to train BERT in order to generate three distinct features and each captures the semantic nuances introduced by the selected siblings. This process yields three $1\times 768$ feature vectors, which are further enhanced by an averaging process to improve their representation.

4.3 Experimental Setup and Hyperparameters

BERT Pre-Training. We utilized the pre-trained BERT model from the Huggingface library[4] and configured our system to align with the parameter settings specified in the basic version of Google BERT (see Devlin et al., 2018). Our model was configured with 12 self-attention layers, 12 attention heads, and a hidden dimension of 768 for the embedding vectors.

BERT Fine-Tuning. We concatenated the pre-trained models with the text classification model. During the training phase, each CR data was augmented five times, and the augmented CR texts were input into BERT to generate a list of feature vectors. The mean of these feature vectors was forwarded into the text classification layer to generate prediction results based on 4 expertise levels. Finally, the cross entropy loss was computed based on the prediction and ground truth. In these experiments, we employed the Adam optimizer with consistent hyperparameters: a learning rate of 1e-04, batch size of 10, input sequence length of 128, 20 training epochs, weight decay of 0.0005, and a dropout rate of 0.3. The training process was conducted on a 1080Ti NVIDIA GPU. All experiments involved training on augmented CR data, and testing and validating on parts of original student CRs with a test size of 0.3.

4.4 Baselines

The CRC tool, which facilitates computerized analysis of students' constructed responses, is publicly accessible for evaluation purposes at a group website[5]. This is our first baseline model. Our investigation also sought to assess the effectiveness of incorporating an ontology-based system as an extension to the BERT models on overall model performance. To this end, we employed the original BERT as a pretrained model for the text classification downstream task, serving as our baseline model two. We also utilize the SciBERT model[6] (Beltagy, Lo, and Cohan, 2019), pretrained on an extensive multi-domain corpus of scientific publications, as our baseline model three.

4.5 Evaluation Metrics

In this study, we employed Cohen’s Kappa, classification accuracy, and F1 score as key metrics for evaluating and comparing the performance of the auto-scoring models. We obtained and reported the best results for the three evaluation metrics across 20 epochs utilizing a tenfold cross-validation method. It is also noted that computations of these evaluation metrics were conducted using a validation subset (n = 239), while the remaining dataset (n = 1000) was employed for training and testing the models.

Cohen's Kappa is a widely recognized metric utilized to assess inter-rater agreement between machine-based scoring systems and human expert scorers (Zhai, Shi and Nehm, 2021). This metric quantifies the degree of concordance between the two while accounting for chance agreements, providing valuable insights into the reliability of automated scoring mechanisms. A Cohen’s Kappa value greater than 0.6 indicates an acceptable level of agreement.

Classification accuracy is a fundamental metric in classifier model evaluation to measure the correctness of predictions by dividing the number of accurate predictions by the total number of predictions made. This metric provides a straightforward assessment of overall accuracy, indicating the proportion of correct predictions relative to the total predictions made.

Given the presence of class imbalance, we also adopt a macro-averaged-F1 method for multi-class evaluation to assess the overall performance of our proposed approach in comparison to others. The macro-average F1 score is computed independently for each class, and subsequently, the average of all individual F1 scores is calculated. This methodology provides a comprehensive evaluation metric, considering the model's performance across all classes and offering a balanced perspective, particularly in scenarios with class imbalances.

The performance results of all approaches are presented in Table 4. Each model achieved a Cohen’s Kappa value exceeding the threshold of 0.60, indicating a moderate to high level of agreement between human and machine scores. Incorporating an ontology-based term identification system into pre-trained language models exhibited increased Cohen’s kappa metrics for the both the BERT and SciBERT models, with the highest performance of SciBERT + Ontology with a Cohen’s Kappa of 0.780. This result suggests a superior level of performance with reduced variations between human and machine scores, and a more stable and reliable performance compared to other approaches. Accuracy metrics between original and ontology-extended BERT models generally exhibited equality. However, in terms of F1 score, extended BERT models with ontology-based systems outperformed their original counterparts. These observations validated our proposition that leveraging features that semantically capture contextual information from an ontological perspective could improve model performance.

Table 5

Results of Various Models on Text Classification Task
Models	a priori	Cohen’s Kappa	Accuracy	F1 Score
CRC	0.350	0.656	0.765	0.628
BERT	0.493	0.735	0.877	0.636
SciBERT	0.518	0.761	0.881	0.649
BERT + Ontology	0.533	0.775	0.879	0.660
SciBERT + Ontology	0.542	0.780	0.881	0.665
Note. The “a priori” column presents Cohen’s Kappa metrics derived from the non-augmented data corpus, as discussed in Section 4.1.3. The reported results for BERT and its variant models within this table showcase the optimal outcomes observed after 20 epochs. As for the CRC model, only the final result following tenfold cross-validation is documented. The bolded entries signify the top performance level across all models.

In the absence of supervision from the training set, the unsupervised baseline methods, namely BERT and SciBERT, demonstrated good performance, with SciBERT consistently outperforming BERT across all evaluation metrics in our investigation. Conversely, the supervised method utilizing the CRC classifier displayed suboptimal performance in predicting score levels within our dataset across all evaluation metrics. This outcome is not unexpected, given the constrained predictive capabilities inherent in a supervised model for a multi-class classification algorithm lacking sufficient training data for all classes. Given the success of the SciBERT pretraining model on our study dataset across all metrics, particularly for Cohen’s Kappa, which indicates a robust agreement between machine-based scores and human-assigned scores, we further assessed the performance difference between BERT, with and without incorporating ontological terms, and their corresponding SciBERT models.

We also conducted a series of hypothesis tests originated from our proposed research questions, i.e., a one-sample two tailed t-test for the differences between paired observations (n = 20) by comparing the Cohen’s Kappa performances after 20 epochs among the four LR models. Six paired t-tests were performed to assess all possible comparisons between the four LR models. Our analysis revealed significant performance differences in four of the comparisons at a significance level of p < 0.05. To answer RQ1, the t-test results yielded t(19) = 1.73 with a p-value of 0.048 between BERT and SciBERT. This finding suggests a modest yet statistically significant difference between the two models. To answer RQ2, even after incorporating domain-specific vocabulary from EnvO ontology into BERT models, the difference between BERT and SciBERT remained significant. Specifically, SciBERT + Ontology (mean = 0.740, SD = 0.005) exhibited a significantly higher performance compared to BERT + Ontology (mean = 0.713, SD = 0.001) with t(19) = 1.72, p < 0.01. Moreover, Cohen’s Kappa for SciBERT + Ontology was significantly higher than SciBERT (mean = 0.732, SD = 0.006) with t(19) = 1.73, p < 0.05, as well as BERT (mean = 0.713, SD = 0.012) with t(19) = 1.73, p < 0.01. The other two t-tests (i.e., BERT vs. BERT + Ontology, SciBERT vs. BERT + Ontology) resulted in non-significant differences.

The confusion matrices for our validation set (n = 293), including CRC, BERT, SciBERT, BERT + Ontology, and Sci-BERT + Ontology, are provided in the Appendix.

Qualitative error analysis. Seven student CR examples were selected for a qualitative comparison of scoring from different approaches, as presented in Table 6. To optimize the presentation in terms of space and readability, we focused on two baseline models in this section: the supervised CRC classifier and an unsupervised model, i.e., SciBERT, which demonstrated improved performance compared to the original BERT model. Correspondingly, the SciBERT + Ontology system was chosen for comparison with the aforementioned baselines.

Upon qualitative examination, our proposed approach incorporating ontology-based information exhibited superior performance, in analyzing student CRs characterized by a greater abundance of specialized terms, regardless of the response length – be it longer (e.g., CR#2) or shorter (e.g., CR#3, 4, 7). In the case of CR#5, where the student’s response adopted a more colloquial and communicative tone, both SciBERT and our approach exhibited proficiency in discerning complex language nuances and linguistic patterns by effectively considering the contextual information associated with each word. This proficiency was derived from pre-training of SciBERT on vast amount of text data. Conversely, the CRC inaccurately predicted this specific case, possibly attributable to a deficiency in contextual awareness and sensitivity to language usage. This limitation is inherent in ML models that rely on manually crafted features and exhibit constrained contextual understanding. It is important to avoid oversimplifying the broader landscape of supervised models, as some models may excel in detecting language nuances owing to their design and training data. This discussion is particularly relevant to our case in science educational assessment where the data exhibits distinct characteristics and writing styles.

While the SciBERT and SciBERT + Ontology approach showcased notable proficiency, it is essential to acknowledge their limitations. Instances of incorrect predictions were observed, as exemplified in CR#1 and CR#6. We attributed these errors to the brevity of the sentences and the presence of less meaningful words, as well as grammatical and spelling errors (e.g., “prodouced”), which might introduce noise during the prediction process.

Table 6

Illustration of Correctly and Incorrectly Scored Responses to Biomass Item Using Various Approaches
No.	Example student CR	Length	Actual Score	Predicted Scores
No.	Example student CR	Length	Actual Score	CRC	SciBERT	SciBERT + Onto
1	more food, less energy used, less water	7	1	1	3	3
2	The shift to corn biomass production could decrease the availability and affordability of beans, which are a source of nutrition and income for the community. Moreover, the shift may require the use of fertilizers and pesticides, which can harm soil quality and affect the health of the ecosystem. The shift to corn biomass production could increase the community's energy independence from natural gas, but it comes with the cost of lower energy returns. For example, one unit of natural gas requires less energy to produce than one unit of ethanol biomass, and this could lead to higher energy costs for the community. The shift to corn biomass production could increase the area's water use, as corn requires more water than beans. This increase in water use may affect the availability of water for other purposes, such as drinking water, irrigation, and recreation. Additionally, the shift could lead to water pollution, as the use of fertilizers and pesticides could contaminate water sources and harm aquatic ecosystems.	165	2	4	3	2
3	food: beans for corn energy: using energy to grow corn, being able to create ethanol water: more water to grow corn than beans	23	2	1	1	2
4	Switching to biomass farming would reduce the amount of food being grown and require more water too, but ethanol would provide a cleaner energy source that would reduce carbon emissions.	30	3	2	2	3
5	If I make this decision, my food supply will be divided in half. I will have less of a surplus of food, so I need to make sure the amount of food I farm matches the market demand. Energy systems will benefit because I will amass biofuel from this process. I will be participating in sustainable energy production. Water systems will suffer because more water will be needed to execute this decision. I might have to connect to bigger waterways to sustain this change.	84	3	2	3	3
6	Switching corn farming reduced the amount of food produced and increased the amount of water used and energy prodouced.	19	3	2	1	1
7	Although we may have more biomass from the corn there will be less bean production, steering the food to be made with less healthy ingredients. Also we may have less water for beans and citizens, but we will have new biomass energy. the energy may be great but it takes more water to feed the corn and more energy than natural gas to produce.	64	4	2	2	4
Note. The bolded words represent example terms identified by the Text2Term ontology mapper, indicating that these are domain-specific words found in students' CRs.

This study addresses the challenging task of automated CR evaluation within introductory, college-level IES courses. The primary objective encompasses exploring the integration of ontological perspectives into the automatic scoring system as part of an approach to alleviate the evaluation burden on instructors. To this end, we extended BERT model by incorporating an ontology (EnvO) aligned to the assessment domain and tasks to discern domain terms and extract their sibling terms. The experiments were conducted on a real-world educational dataset that demonstrated the effectiveness of our proposed approach. In conclusion, our baseline models (CRC, BERT, SciBERT) without the inclusion of an ontology-based system, yielded reasonably good Cohen’s Kappa, accuracy, and F1 scores, while our combined approach integrating the ontology-based system onto BERT/SciBERT showcased further performance enhancement of the automated scoring models.

To answer RQ1, when comparing supervised machine learning models with unsupervised pre-trained LR models, the relatively lower performance of the CRC classifier in multi-class classification was anticipated. This finding aligns with the study by González-Carvajal and Garrido-Merchán (2020), where the BERT model outperformed classical NLP strategies relying on feature-based ML models. BERT’s effectiveness in capturing intricate linguistic patterns and contextual information (Zhang et al., 2020) aligns well with the nature of our text classification task, which involves assessing students’ expertise levels based on their proficiency to employ correct conceptual ideas, establish connections, and apply them in explanations.

Despite BERT’s commendable performance in classification, the unique nature of science education, particularly in environmental science at the post-secondary education level, introduces distinctions in our dataset. Undergraduate students in many science courses are prone to using some specialized and scientific domain-specific vocabulary in expressing ideas and constructing responses, as highlighted in earlier investigations (Horne et al., 2023; Jescovitch et al., 2021; Manzanares et al., in review). Consequently, our students’ responses collected from environmental science courses feature OOV words not encompassed in BERT’s predefined glossary. To address this lexical gap, our proposed ontology-based system facilitated a theoretically grounded and reliable identification of domain terms. Additionally, the system extracted sibling terms considered as substitutes for the original OOV specialized terms. These substitutes share semantically overlapping information from an ontological perspective and, importantly, can be integrated into subsequent BERT models to generate valid feature vectors. In response to RQ2, this methodology has proven advantages, as evidenced by the enhanced performance of the models compared to their original counterparts (e.g., SciBERT + Ontology to SciBert). The outcomes of the statistical t-test pair comparison further substantiate that our proposed methods of incorporating features derived from ontological terms significantly enhance the model’s efficacy in classifying student expertise levels. It is noteworthy that SciBERT, being a more scientifically oriented pre-trained model, outperformed the generic BERT model in our specific context of science educational assessment. Even so, the SciBERT model still exhibited performance improvements when coupled to the ontology approach.

In the contemporary landscape of STEM education, educational scholars and practitioners have long sought effective assessment techniques to measure the depth and breadth of students’ acquired knowledge (Klassen, 2006; Millar, 2010; NRC, 2001). These tools aim to provide immediate feedback and allow for adjustments in instructional focus to enhance students’ ability to understand and apply knowledge. In the realm of science education CR assessment, the proliferation of diverse auto-scoring models and formative feedback systems underscores the significance of standardizing and popularizing AI-based grading systems. While state-of-the-art large language models continue to develop in the industrial market, we see utility in our proposed ontology approach. We present this novel perspective for future research in assessment evaluation in science education, as classification of domain-specific reasoning has been a well noted issue for automated scoring systems (Liu et al., 2016; Jescovitch et., 2021; Maestrales et al. 2021). Our approach has been tested on real-world datasets fraught with challenges such as noise, imbalance, bias, and data scarcity. By achieving in addressing these challenges implies that, within the ontology, core concepts in a given domain possess distinct rules for interpreting their existence and connections, and these rules align with the concepts underlying students’ open-ended responses. According to Baker et al.’s (2009) insights, this unique property of ontology can guide pedagogic decisions regarding the selection of assessment targets and understanding students’ grasp of ideas within the domain. Baker and colleagues also accentuate the hierarchical structure in ontology for tailoring assessments. For example, in our study, sibling terms distant from a given term and those proximate to it demand different cognitive demands for comprehension. Our method refrains from introducing complex vocabulary to subsequent BERT models for feature generation. However, we envision future assessments that showcase student knowledge by considering ontological nodes’ location, complexity, proximity, and other relevant properties.

Ontology philosophically is linked with scientific literacy, epistemology, and education. However, to the best of our knowledge, there is a dearth of formal studies investigating the application of ontology-based methods in the automated assessment of scientific responses. Nevertheless, this gap should not be interpreted as suggesting that ontology lacks relevance and significance in the education domain. We consider our research as a pathway for future studies. In the present educational science landscape, where students are tasked integrating extensive domain knowledge with authentic science practices to solve problems and engage in phenomena (Gao, Li, Shen, and Sun, 2020; Krajcik and Shin, 2023), the utility of ontology in education can be evident. Our primary objective, as articulated in Zhai and colleagues’ synthesis (2020) on auto-scoring models in science education, is not to compete with high-performance model development. Instead, our focus is on enriching human work by providing a nuanced understanding of students’ reasoning abilities in science and how this understanding can inform science instruction from an ontological perspective. For instance, teachers may utilize ontological information, such as the knowledge graph (as depicted in Fig. 2) to help understand and evaluate relevant student connections as part of formative assessment practice.

Further, these findings suggest a way to expand the generalizability of classification models in science assessment. Current applications of text classification models in science education tend to produce classification models that are unique to an assessment item or a small set of closely related items (Shiroda, Doherty and Haudek, in press) or rely on gathering of a very large training corpus that includes multiple items targeting the same construct (Moharerri et. al., 2014). Thus, adding an ontological layer which associates domain specific terms automatically may be a way to apply a single classification model to evaluate a number of assessment items within a given science domain and focused on the same assessment construct, but varying only in item context or surface features. Using such an approach would speed model development and allow items that vary only in surface features to be used interchangeably during formative assessment. This is necessary as undergraduate explanations of science phenomena can be influenced by context or variable presented in the item despite depending on the same foundational principles for the answer (Nehm and Ha, 2011; Doherty et al., 2023).

Conflict of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Funding

This material is based upon work supported by the National Science Foundation Grant number 2013359 and Grant No. 2013373.

Author Contribution

Conceptualization: HW, KCH; Material preparation and data collection: ADM, CLR, EAR; Formal analysis, interpretation, and writing – original draft preparation: HW; Writing – review and editing: KCH, ADM, CLR, EAR; Fund-ing Acquisition: CLR, KCH. All authors read and approved the final manuscript.

Acknowledgements

We appreciate the contributions of the other members of our NGCI project team.

Data and Code Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Code available at https://github.com/IvyWang845/Ontology-Use-For-Textual-Analysis.

Arp, R., Smith, B., & Spear, A. D. (2015). Building ontologies with basic formal ontology. MIT Press.
Asim, M. N., Wasim, M., Khan, M. U. G., Mahmood, N., & Mahmood, W. (2019). The use of ontology in retrieval: a study on textual, multilingual, and multimedia retrieval. IEEE Access, 7, 21662-21686. doi: 10.1109/ACCESS.2019.2897849.
Baker, E., Chung, G., & Herman, J. (2009). Ontology-based educational design: Seeing is believing. Los Angeles, CA: CRESST.
Baird, A., & Schuller, B. (2020). Considerations for a more ethical approach to data in ai: on data representation and infrastructure. Frontiers in big Data, 3, 25. https://doi.org/10.3389/fdata.2020.00025
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing writing, 12(2), 86-107. https://doi.org/10.1016/j.asw.2007.07.001
Bauer, H. H. (1992). Scientific literacy and the myth of the scientific method. University of Illinois Press.
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. https://doi.org/10.48550/arXiv.1903.10676
Bonthu, S., Sree, S. R., & Prasad, M. K. (2023). Improving the performance of automatic short answer grading using transfer learning and augmentation. Engineering Applications of Artificial Intelligence, 123, 106292. https://doi.org/10.1016/j.engappai.2023.106292
Carnegie Foundation for the Advancement of Teaching. (2001). The Carnegie classification of institutions of higher education, 2010 edition. The Carnegie Classification of Institutions of Higher Education. Retrieved 11^th January 2023 from http://carnegieclassifications.iu.edu/2010/
Chen, Y., Perozzi, B., Al-Rfou, R., & Skiena, S. (2013). The expressive power of word embeddings. arXiv preprint arXiv:1301.3226. https://doi.org/10.48550/arXiv.1301.3226
Chiu, M. H., & Krajcik, J. (2020). Reflections on Integrated Approaches to STEM Education: An International Perspective. Integrated Approaches to STEM Education: An International Perspective, 543-559. https://doi.org/10.1007/978-3-030-52229-2_29
Clarke, V., Braun, V., & Hayfield, N. (2015). Thematic analysis. Qualitative psychology: A practical guide to research methods, 3, 222-248.
Cochran, K., Cohn, C., Hastings, P., Tomuro, N., & Hughes, S. (2023). Using BERT to Identify Causal Structure in Students’ Scientific Explanations. International Journal of Artificial Intelligence in Education, 1-39. https://doi.org/10.1007/s40593-023-00373-y
Cochran, K., Cohn, C., Hutchins, N., Biswas, G., & Hastings, P. (2022, July). Improving automated evaluation of formative assessments with text data augmentation. In International Conference on Artificial Intelligence in Education (pp. 390-401). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_32
Cochran, K., Cohn, C., Rouet, J. F., & Hastings, P. (2023, June). Improving Automated Evaluation of Student Text Responses Using GPT-3.5 for Text Data Augmentation. In International Conference on Artificial Intelligence in Education (pp. 217-228). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-36272-9_18
Common Core State Standards Initiative (June, 2010). Common Core State Standards for English language arts & literacy in history/social studies, science, and technical subjects. Retrieved December 29th from: https://corestandards.org/wp-content/uploads/2023/09/ELA_Standards1.pdf
Coxhead, A. (2000). A new academic word list. TESOL quarterly, 34(2), 213-238. https://doi.org/10.2307/3587951
Crossley, S. A., Allen, L. K., Snow, E. L., & McNamara, D. S. (2016). Incorporating learning characteristics into automatic essay scoring models: What individual differences and linguistic features tell us about writing quality. Journal of Educational Data Mining, 8(2), 1-19.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. https://doi.org/10.48550/arXiv.2002.06305
Doherty, J. H., Cerchiara, J. A., Scott, E. E., Jescovitch, L. N., McFarland, J. L., Haudek, K. C., & Wenderoth, M. P. (2023). Oaks to arteries: the Physiology Core Concept of flow down gradients supports transfer of student reasoning. Advances in Physiology Education, 47(2), 282-295. https://doi.org/10.1152/advan.00155.2022
English, L. D. (2016). STEM education K-12: Perspectives on integration. International Journal of STEM education, 3, 1-8. https://doi.org/10.1186/s40594-016-0036-1
Falloon, G., Hatzigianni, M., Bower, M., Forbes, A., & Stevenson, M. (2020). Understanding K-12 STEM education: A framework for developing STEM literacy. Journal of Science Education and Technology, 29, 369-385. https://doi.org/10.1007/s10956-020-09823-x
Fang, L., Lee, G. G., & Zhai, X. (2023). Using gpt-4 to augment unbalanced data for automatic scoring. arXiv preprint arXiv:2310.18365. https://doi.org/10.48550/arXiv.2310.18365
Ferrara, S., & Qunbar, S. (2022). Validity Arguments for AI-Based Automated Scores: Essay Scoring as an Illustration. Journal of Educational Measurement, 59(3), 288–313. https://doi.org/10.1111/jedm.12333
Gao, X., Li, P., Shen, J., & Sun, H. (2020). Reviewing assessment of student learning in interdisciplinary STEM education. International Journal of STEM Education, 7(1), 1-14. https://doi.org/10.1186/s40594-020-00225-4
Goldman, S. R., Britt, M. A., Brown, W., Cribb, G., George, M., Greenleaf, C., Lee, C. D., Shanahan, C., & READI, P. (2016). Disciplinary literacies and learning to read for understanding: A conceptual framework for disciplinary literacy. Educational Psychologist, 51(2), 219-246. https://doi.org/10.1080/00461520.2016.1168741
González-Carvajal, S., & Garrido-Merchán, E. C. (2020). Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012. https://doi.org/10.48550/arXiv.2005.13012
Graichen, E. (2023). Context-aware Swedish Lexical Simplification: Using pre-trained language models to propose contextually fitting synonyms (Dissertation). Retrieved December 29^th 2023 from https://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-194982
Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing?. International journal of human-computer studies, 43(5-6), 907-928. https://doi.org/10.1006/ijhc.1995.108
Hmelo-Silver, C. E., Marathe, S., & Liu, L. (2007). Fish swim, rocks sit, and lungs breathe: Expert-novice understanding of complex systems. The Journal of the Learning Sciences, 16(3), 307-331. https://doi.org/10.1080/10508400701413401
Hofmann, V., Pierrehumbert, J. B., & Schütze, H. (2020). Dynamic contextualized word embeddings. arXiv preprint arXiv:2010.12684. https://doi.org/10.48550/arXiv.2010.12684
Horne, L., Manzanares, A., Babin, N., Royse, E., Arawaka, L., Blavascunas, E., Doner, L., Druckenbrod, D., Fairchild, E., Jarchow, M., Muchnick, B., Panday, P., Perry, D., Thomas, B., Toomey, A., Tucker, B., Washington-Ottombre, C., Vincent, S., Anderson, S., & Romulo, C. (2023). Alignment among environmental programs in higher education: What Food-Energy-Water Nexus concepts are covered in introductory courses? Journal of Geoscience Education, 1-18. DOI: 10.1080/10899995.2023.2187680
Jang, H. (2016). Identifying 21st century STEM competencies using workplace data. Journal of science education and technology, 25, 284-301. https://doi.org/10.1007/s10956-015-9593-1
Jescovitch, L. N., Scott, E. E., Cerchiara, J. A., Doherty, J. H., Wenderoth, M. P., Merrill, J. E., ... & Haudek, K. C. (2019). Deconstruction of holistic rubrics into analytic rubrics for large-scale assessments of students’ reasoning of complex science concepts. Practical Assessment, Research, and Evaluation, 24(1), 7. https://doi.org/10.7275/9h7f-mp76
Jescovitch, L. N., Scott, E. E., Cerchiara, J. A., Merrill, J., Urban-Lurain, M., Doherty, J. H., & Haudek, K. C. (2021). Comparison of machine learning performance using analytic and holistic coding approaches across constructed response assessments aligned to a science learning progression. Journal of Science Education and Technology, 30(2), 150-167. https://doi.org/10.1007/s10956-020-09858-0
Jönsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational research review, 2(2), 130-144. https://doi.org/10.1016/j.edurev.2007.05.002
Jung, J. Y., Tyack, L., & von Davier, M. (2022). Automated scoring of constructed-response items using artificial neural networks in international large-scale assessment. Psychological Test and Assessment Modeling, 64(4), 471-494.
Jurka, T. P., Collingwood, L., Boydstun, A. E., & Grossman, E. (2013). RTextTools: A Supervised Learning Package for Text Classiﬁcation. The R journal, 5(1), 6-12. doi: 10.32614/rj-2013-001
Kaldaras, L., & Haudek, K. C. (2022). Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. In Frontiers in Education (Vol. 7, p. 968289). Frontiers Media SA. https://doi.org/10.3389/feduc.2022.968289
Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. In Frontiers in Education (Vol. 7, p. 983055). Frontiers. https://doi.org/10.3389/feduc.2022.983055
Kelley, T. R., & Knowles, J. G. (2016). A conceptual framework for integrated STEM education. International Journal of STEM education, 3, 1-11. https://doi.org/10.1186/s40594-016-0046-z
Klassen, S. (2006). Contextual assessment in science education: Background, issues, and policy. Science Education, 90(5), 820-851. https://doi.org/10.1002/sce.20150
Krajcik, J. S. (2021). Commentary—applying machine learning in science assessment: opportunity and challenges. Journal of Science Education and Technology, 30(2), 313-318. https://doi.org/10.1007/s10956-021-09902-7
Krajcik, J., & Shin, N. (2023). Student Conceptions, Conceptual Change, and Learning Progressions. Handbook of Research on Science Education: Volume III.
Kumar, A., Starly, B., & Lynch, C. ManuBERT: A Pretrained Manufacturing Science Language Representation Model. Available at SSRN 4375613: http://dx.doi.org/10.2139/ssrn.4375613
Lee, Y., Capraro, M. M., & Viruru, R. (2018). The factors motivating students’ STEM career aspirations: Personal and societal contexts. International Journal of Innovation in Science and Mathematics Education, 26(5).
Lee, C., Cho, K., & Kang, W. (2019). Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299. https://doi.org/10.48550/arXiv.1909.11299
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://doi.org/10.1093/bioinformatics/btz682
Lehrer, R., & Schauble, L. (2006). Scientific thinking and science literacy. Handbook of child psychology, 4, 153-196.
Libarkin, J. C., & Kurdziel, J. P. (2006). Ontology and the teaching of earth system science. Journal of Geoscience Education, 54(3), 408-413. https://doi.org/10.5408/1089-9995-54.3.408
Liu, Z., He, X., Liu, L., Liu, T., & Zhai, X. (2023). Context matters: A strategy to pre-train language model for science education. arXiv preprint arXiv:2301.12031. https://doi.org/10.48550/arXiv.2301.12031
Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215-233. https://doi.org/10.1002/tea.21299
Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., & Wang, P. (2020, April). K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 03, pp. 2901-2908). https://doi.org/10.1609/aaai.v34i03.5681
Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30, 239-254. https://doi.org/10.1007/s10956-020-09895-9
Manzanares, A. D., Horne, L., Royse, E. A., Azzarello, C. B., Jarchow, M., Druckenbrod, D., Babin, N., Atalan-Helicke, N., Vincent, S., Anderson, S. W., & Romulo, C. (in review). Undergraduate students’ knowledge about the relationships between climate change and the Food-Energy-Water Nexus. Journal for STEM Education Research.
Martin, P. P., & Graulich, N. (2023). When a machine detects student reasoning: a review of machine learning-based formative assessment of mechanistic reasoning. Chemistry Education Research and Practice. DOI: 10.1039/D2RP00287F
Millar, R. (2010). Analysing practical science activities to assess and improve their effectiveness. Hatfield: Association for Science Education.
Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7, 1-14. https://doi.org/10.1186/s12052-014-0015-2
Murphy S., Yancey K. B. (2008). Construct and consequence: Validity in writing assessment. In Bazerman C. (Ed.), Handbook of research on writing: History, society, school, individual, text (pp. 365-385). Routledge.
Nagy, W., & Townsend, D. (2012). Words as tools: Learning academic vocabulary as language acquisition. Reading research quarterly, 47(1), 91-108. https://doi.org/10.1002/RRQ.011
National Research Council. (1996). National science education standards. National Academies Press.
National Research Council. (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. J. Pelligrino, N. Chudowsky, & R. Glaser (Eds.). Board on Testing and Assessment, Center for Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
National Research Council (2011). A framework for K–12 science education: Practices, crosscutting concepts, and core ideas. Washington, DC: National Academies Press.
National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. The National Academies Press. https://doi.org/10.17226/18409
Nehm, R. H., & Ha, M. (2011). Item feature effects in evolution assessment. Journal of Research in Science Teaching, 48(3), 237-256. https://doi.org/10.1002/tea.20400
Nehm, R. H., Ha, M., Rector, M., Opfer, J. E., Perrin, L., Ridgway, J., & Mollohan, K. (2010). Scoring guide for the open response instrument (ORI) and evolutionary gain and loss test (ACORNS). Technical Report of National Science Foundation REESE Project, 0909999.
NGSS Lead States. (2013). Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press.
Norris, S. P., & Phillips, L. M. (2003). How literacy in its fundamental sense is central to scientific literacy. Science Education, 87(2), 224-240. https://doi.org/10.1002/sce.10066
Noyes, K., McKay, R. L., Neumann, M., Haudek, K. C., & Cooper, M. M. (2020). Developing computer resources to automate analysis of students’ explanations of London dispersion forces. Journal of Chemical Education, 97(11), 3923-3936. https://doi.org/10.1021/acs.jchemed.0c00445
Patel, A., & Debnath, N. C. (2024). A Comprehensive Overview of Ontology: Fundamental and Research Directions. Current Materials Science: Formerly: Recent Patents on Materials Science, 17(1), 2-20. https://doi.org/10.2174/2666145415666220914114301
Pellegrino, J. W. (2013). Proficiency in science: Assessment challenges and opportunities. Science, 340(6130), 320-323. DOI: 10.1126/science.1232065
Rahman, H., & Hussain, M. I. (2021). A light-weight dynamic ontology for Internet of Things using machine learning technique. ICT Express, 7(3), 355-360. https://doi.org/10.1016/j.icte.2020.12.002
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring system: a systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527. https://doi.org/10.1007/s10462-021-10068-2
Royse, E., Manzanares, A., Wang H., Haudek, K., Azzarello, C., Horne, L., Druckenbrod, D., Shiroda, M., Adams, S., Fairchild, E., Vincent, S., Anderson, S., Romulo, C. (in review). FEW Questions, Many Answers: Using Machine Learning Analysis to Assess How Students Connect Food-Energy-Water Concepts. Humanities and Social Sciences Communications.
Rudolph, M., & Blei, D. (2018, April). Dynamic embeddings for language evolution. In Proceedings of the 2018 world wide web conference (pp. 1003-1011). https://doi.org/10.1145/3178876.3185999
Shiroda, M., Doherty, J. H., Scott, E. E., & Haudek, K. C. (2023). Covariational reasoning and item context affect language in undergraduate mass balance written explanations. Advances in Physiology Education, 47(4), 762-775. https://doi.org/10.1152/advan.00156.2022
Shiroda, M., Doherty, J. H., & Haudek, K. C. (in press). Exploring Attributes of Successful Machine Learning Assessments for Scoring of Undergraduate Constructed Responses. In Uses of Artificial Intelligence in STEM Education (1st ed.). Oxford University Press.
Selva Birunda, S., & Kanniga Devi, R. (2021). A review on word embedding techniques for text classification. Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2020, 267-281. https://doi.org/10.1007/978-981-15-9651-3_23
Smith, B. (2012). Ontology. In The furniture of the world (pp. 47-68). Brill.
Snow, R. E. (2012). Construct validity and constructed-response tests. In Construction versus choice in cognitive measurement (pp. 45-60). Routledge.
Sripathi, K. N., Moscarella, R. A., Steele, M., Yoho, R., You, H., Prevost, L. B., Urban-Lurain, M., Merrill, J., & Haudek, K. C. (2023). Machine Learning Mixed Methods Text Analysis: An Illustration From Automated Scoring Models of Student Writing in Biology Education. Journal of Mixed Methods Research, 15586898231153946. https://doi.org/10.1177/15586898231153946
Tomas, C., Whitt, E., Lavelle-Hill, R., & Severn, K. (2019, September). Modeling holistic marks with analytic rubrics. In Frontiers in Education (Vol. 4, p. 89). Frontiers Media SA. https://doi.org/10.3389/feduc.2019.00089
Troia, G. A., Shen, M., & Brandon, D. L. (2019). Multidimensional levels of language writing measures in grades four to six. Written Communication, 36(2), 231-266. https://doi.org/10.1177/0741088318819473
Udompong, L., & Wongwanich, S. (2014). Diagnosis of the scientific literacy characteristics of primary students. Procedia-Social and Behavioral Sciences, 116, 5091-5096. doi: 10.1016/j.sbspro.2014.01.1079
Underwood, S. M., Posey, L. A., Herrington, D. G., Carmel, J. H., & Cooper, M. M. (2018). Adapting assessment tasks to support three-dimensional learning. Journal of Chemical Education, 95(2), 207-217. https://doi.org/10.1021/acs.jchemed.7b00645
Vincent S, Bunn S, & Sloane S (2013). Interdisciplinary environmental and sustainability education on the nation’s campuses: curriculum design. National Council for Science and the Environment, Washington, DC. Available December 29^th, 2023 from: https://gcseglobal.org/sites/default/files/inline-files/2013%20Curriculum%20Design%20Full%20Report.pdf
Wang, H., & Troia, G. A. (2023). Writing Quality Predictive Modeling: Integrating Register-Related Factors. Written Communication, 40(4), 1070-1112. https://doi.org/10.1177/07410883231185287
Wulff, P., Mientus, L., Nowak, A., & Borowski, A. (2023). Utilizing a pretrained language model (BERT) to classify preservice physics teachers’ written reflections. International Journal of Artificial Intelligence in Education, 33(3), 439-466. https://doi.org/10.1007/s40593-022-00290-6
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., ... & Gašević, D. (2023). Practical and ethical challenges of large language models in education: A systematic literature review. arXiv preprint arXiv:2303.13379. https://doi.org/10.1111/bjet.13370
Yore, L. D. (2003). Examining the literacy component of science literacy: 25 years of language arts and science research. International Journal of Science Education, 25(6), 689-725. https://doi.org/10.1080/09500690305018
Yunianto, I., Permanasari, A. E., & Widyawan, W. (2020, October). Domain-specific contextualized embedding: A systematic literature review. In 2020 12th International Conference on Information Technology and Electrical Engineering (ICITEE) (pp. 162-167). IEEE. doi: 10.1109/ICITEE49829.2020.9271752.
Zeidler, D. L. (2016). STEM education: A deficit framework for the twenty first century? A sociocultural socioscientific response. Cultural Studies of Science Education, 11, 11-26. https://doi.org/10.1007/s11422-014-9578-z
Zeng, L., Yao, C., Zhang, M., & Xie, Z. (2022, August). SynBERT: Chinese Synonym Discovery on Privacy-Constrain Medical Terms with Pre-trained BERT. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (pp. 331-344). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-25158-0_25
Zhai, X., Haudek, K. C., Stuhlsatz, M. A., & Wilson, C. (2020). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 100916. https://doi.org/10.1016/j.stueduc.2020.100916
Zhai, X., Haudek, K. C., & Ma, W. (2023). Assessing argumentation using machine learning and cognitive diagnostic modeling. Research in Science Education, 53(2), 405-424. https://doi.org/10.1007/s11165-022-10062-w
Zhai, X., He, P., & Krajcik, J. (2022). Applying machine learning to automatically assess scientific models. Journal of Research in Science Teaching, 59(10), 1765-1794. https://doi.org/10.1002/tea.21773
Zhai, X., Shi, L., & Nehm, R. H. (2021). A meta-analysis of machine learning-based science assessments: Factors impacting machine-human score agreements. Journal of Science Education and Technology, 30, 361-379. https://doi.org/10.1007/s10956-020-09875-z
Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111-151. https://doi.org/10.1080/03057267.2020.1735757
Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., & Zhou, X. (2020, April). Semantics-aware BERT for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9628-9635). https://doi.org/10.1609/aaai.v34i05.6510

https://www.dbpedia.org/resources/ontology/
https://sites.google.com/site/environmentontology/
https://github.com/ccb-hms/ontology-mapper
https://huggingface.co/docs/transformers/model_doc/bert
https://beyondmultiplechoice.org/
https://huggingface.co/allenai/scibert_scivocab_uncased

No competing interests reported.

APPENDIX.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Extending a Pretrained Language Model (BERT) using an Ontological Perspective to Classify Students’ Scientific Expertise Level from Written Responses

Status:

Version 1

Abstract

Figures

1. INTRODUCTION

2. LITERATURE REVIEW

2.1 Coding Frameworks of CR Assessment

2.2 Automated Analysis of CR Assessment

2.3 Contextualized Word Embedding

2.4 Environmental Ontology

3. RESEARCH PURPOSES

4. METHODOLOGY

4.1 Dataset

4.1.1 CR Data Collection

4.1.2 Rubric Development

4.1.3 Expertise Level

4.1.4 Augmented Data

4.2 Model Architecture

4.2.1 Domain-Specific Term Identification

4.2.2 Siblings Extraction

4.2.3 Word Substitutions

4.3 Experimental Setup and Hyperparameters

4.4 Baselines

4.5 Evaluation Metrics

5. RESULT

6. DISCUSSION

7. IMPLICATIONS, CONCLUSIONS, FUTURE SCOPES

Declarations

Conflict of Interest

Funding

Author Contribution

Acknowledgements

Data and Code Availability

References

Footnotes

Additional Declarations

Supplementary Files

Status:

Version 1