Automated Scoring of Open-Ended Question Complexity: A Large Language Model Approach

doi:10.21203/rs.3.rs-3890828/v1

Download PDF

Research Article

Automated Scoring of Open-Ended Question Complexity: A Large Language Model Approach

https://doi.org/10.21203/rs.3.rs-3890828/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Question-asking, an essential yet often understudied activity, holds significant implications for learning, creativity, and cognitive development. In particular, the quality and complexity of the questions asked are crucial factors affecting these fields. Previous research has explored open-ended question complexity through frameworks like the Bloom taxonomy of cognitive objectives, but the measurement of complexity remains challenging. Recent advancements in natural language processing have enabled automated scoring of psychological tasks, notably predicting human ratings of creativity. Although some methods have been applied to measure question complexity, there has been scarce research so far on the automatic assessment of open-ended questions. Here, we address this gap by employing a Large Language Model (LLM) to accurately predict human ratings of open-ended question complexity based on the Bloom taxonomy and comparing these predictions to existing baseline measures such as semantic distance and word count. Specifically, this study capitalized on previously collected human-rated responses from a creative question-asking task to train an LLM for scoring questions based on the Bloom taxonomy of complexity. Our results reveal that our LLM-generated Bloom scores correlated strongly with human ratings of complexity (r = .73), whilst also greatly exceeding tested baseline measures. Our study emphasizes the significance of LLM in automating the assessment of open-ended question complexity, fostering cost-effective, automatic, and reliable measurements in this domain. Our study further highlights the exciting possibilities for the continued usage of LLM in education and psychology and their potential in helping study how we ask creative questions.

Question asking

Bloom taxonomy

NLP

LLM

Question asking is a common and everyday activity, one which we spend considerable time engaged in. Yet only a very rudimentary technical understanding of question asking currently exists (Kearsley, 1976). Question asking is central to learning (Chin & Osborne, 2008; Salmon & Barrera, 2021), as well as being an important component of educational programs (Chin & Brown, 2002). This is especially true for open-ended questions which encourage larger and syntactically more complex answers as opposed to closed ended questions which do not wholly reflect genuine communication and encourage short, restricted responses (Çakır & Cengiz. 2016). Question asking has also been linked to the creative process (Acar et al., 2023; Raz et al., 2023), and has been shown to be a valid measure of creativity (Yager, 1996). However, questions are difficult to assess, due to their sometimes open-ended nature, and a common emphasis on the answers—and not the questions—that people produce. Recent advances in Natural Language Processing have led to the rapid application of such tools towards automatic and quantitative assessment of linguistic open-ended responses, such as assessing originality of answers to a divergent thinking creativity task (Beaty & Johnson, 2021; Dumas et al., 2021). Nevertheless, few methods exist for automatic scoring of question asking (Jayakodi et al., 2015; Mohammed & Omar, 2020) and even fewer that have utilized recent advances in AI and large language models (LLM) (Gani et al., 2023; Hwang et al., 2023) which perform better than previous linguistic approaches (Vaswani et al., 2017). Yet, to our knowledge, no such computational method has been applied on open-ended questions. In the present study, we extend research on automatic complexity scoring of a creative question asking task, by developing and training a LLM (RoBERTa; Liu et al., 2019) to predict human-rated complexity scores for open-ended questions generated in a creative question asking task (Raz et al., 2023).

Question asking

Ronfard et al. (2018) examined research on question-asking in childhood by focusing on the epistemic function of questions – the use of questions to bridge a gap in knowledge or to resolve uncertainty. These types of epistemic questions can be asked about diverse topics. For example, children (and adults) request information about labels, facts, procedures, and causal mechanisms and they can do so to obtain clarification, to rule out possible hypotheses, and out of curiosity or “wonderment”. Ronfard et al. (2018) highlighted the great variability in the quality and quantity of the questions people ask, which they stated is influenced by the precision, wording, and quantity of the questions and that developments in cognitive skills and increases in prior knowledge may allow for deeper processing and more precise questions. The authors argue that question-asking is a powerful learning strategy, yet research on questions has been relatively sparse and isolated across several disciplines (e.g., Gottlieb, 2021; Nelson, 2005; Rothe et al., 2018; Raz et al., 2023; Sasson & Kenett, 2023). Ortlieb et al. (2012) further argue that the ultimate goal of education should be advancing beyond the use of the closed questioning style as its only means to assess the learners, with Raphael (1994) stating “If you only ever use closed questions, then you are never going to encourage your learners to think” (Raphael, 1994, p. 114). The researchers advocate for advancing open-ended questions, which are high-level divergent questions that encourage the learner to contemplate and explore before determining an answer (Ortlieb et al., 2012).

Open-ended vs closed-ended questions

Open-ended and close-ended questions differ in several characteristics, especially regarding the role of respondents when answering these types of questions. Close-ended questions limit the respondent to the set of alternative answers offered by the questions and require the respondent to engage in convergent thinking (i.e., converging on a single correct solution), while open-ended questions allow expressing an opinion without being largely influenced by the question designer and engaging in divergent thinking (i.e., diverging on multiple possible solutions). The advantages of the open-ended questions include the possibility of discovering responses that individuals may give spontaneously, and thus they avoid the biases that may result from suggesting responses, which may occur in close-ended questions (Reja et al., 2003). Open-ended questions are information-seeking oriented and thus serve the purpose of acquiring information as they genuinely seek knowledge (Çakır & Cengiz. 2016). This is especially pertinent in education as teacher’s questions are indispensable components of classroom discourse and take an important role in facilitating student learning (Chin & Osborne, 2008; Salmon & Barrera, 2021).

Research on teachers’ questions reveal that closed-ended questions are used more than open-ended questions in whole-class teaching (Çakır & Cengiz. 2016), a practice which has been heavily criticized (Nunan, 1987; Brock 1986). Open-ended questions are not only important tools in engaging children in cognitively challenging conversations and promoting higher-order thinking but they are also found to offer linguistic advantages for children as they help develop children’s vocabulary and cognitive skills (Çakır & Cengiz. 2016). Baloche (1994) and Khan and Inamullah (2011) argue that a teacher's ability to ask open-ended questions is crucial for the development of their students' creative thinking skills and for nurturing higher level thinking. Higher level, or complex, open questions are regarded of particular importance in fostering creativity, as these involve more elaborate and abstract ideas such as the creation of new topics and the expression of opinions. Framing questions that are complex, open-ended, focused and uncluttered by irrelevant information is thus believed to support higher level thinking. Although complexity as understood through higher level open questions and higher-level thinking seems to be an essential part of question asking, it is not necessarily clear how to best measure and classify it.

The Bloom taxonomy

One common approach to evaluate question complexity is utilizing the Bloom taxonomy (Bloom et al., 1956). The Bloom taxonomy has been widely accepted as a guideline in designing learning objectives of differing levels of cognitive complexity (Adams, 2015; Goh et al., 2020; Omar et al., 2012). Specifically, the taxonomy includes six cognitive levels, which are hierarchically ordered from simple to complex (Krathwohl; 2002). The Bloom taxonomy is thought to represent a cumulative hierarchy, where mastery of each simpler category is prerequisite for mastery of the next more complex one (Krathwohl, 2002). An updated version of the taxonomy includes the following levels, in ascending order of complexity (Krathwohl, 2002): Remembering (retrieving relevant knowledge from long-term memory, for example when or how did X happen), Understanding (determining the meaning of instructional messages, including oral, written, and graphic communication, for example how would you summarize..?), Applying (carrying out or using a procedure in a given situation, for example how would you solve X using what you have learned), Analyzing (breaking material into its constituent parts and detecting how the parts relate to one another, for example how can you make a distinction between…?), Evaluating (making judgments based on criteria and standards, for example how would you prove or disprove…?), and Creating (putting elements together to form a novel, coherent whole, for example what changes would you make to solve…?).

Previous studies have applied the Bloom taxonomy to the evaluation of question complexity: assigning each question a score from one to six (from simple to complex) and allowing for quantitative analyses to be conducted on question complexity (Oliver et al., 2004; Plack et al., 2007; Zheng et al., 2008). Several attempts at using LLMs to predict Bloom taxonomy scores have been made in the past (Gani et al., 2023; Hwang et al., 2023). In one case researchers automated the quality evaluation of biology and chemistry multiple choice questions (Hwang et al., 2023). This was only partially successful, as model accuracy ranged from 25–90% depending on question type. The findings of Hwang et al., (2023) are further complicated as the authors employed human-rated Bloom scores sourced from a single human rater, raising potential issues of rater subjectivity with the questions themselves also being generated and sourced from GPT 3.5, which in itself has been shown to be at least partially flawed in terms of question quality (Grévisse, 2024). In contrast, Gani et al. (2023) developed a Bloom’s Taxonomy-based exam question classification approach using an LLM and over 2000 labeled multiple choice exam questions as training data, achieving good accuracy (86%) as compared to previous computational models. The study compared six embedding approaches to determine the best for examination question classification according to Bloom’s taxonomy. Their results showed that RoBERTa is the most optimal, and suggested future work could include testing RoBERTa with larger datasets to evaluate its scalability and generalizability.

Critically, multiple choice questions are usually closed-ended, single-solution tasks (SST) (de Vink et al., 2021), which are binary scored for correctness and usually require more closed convergent thinking, whereas open-ended questions that require divergent thinking and involve multiple solution tasks (MST) are usually evaluated in terms of fluency (number of correct solutions), flexibility (diversity of solutions), and originality (novelty of solutions). Previous research has indicated that creative thinking is more strongly related to MST than SST performance (de Vink et al., 2021), that asking questions is a key trait of creativity and an integral part of the creative process, and that question complexity is closely related to creativity (Acar et al., 2023; Raz et al., 2023). Thus, the need is clear for integrating divergent open-ended question asking, together with larger datasets in developing new LLM based approaches to automatically predict question complexity, highlighting the important role of creativity in question asking.

Question asking and creativity

Asking questions is both a key part of creativity and an important component of the creative process (Acar et al., 2023; Raz et al., 2023) that likely facilitates information seeking behavior (Kenett et al., 2023). It has been shown to be part of the creative problem-solving process in children (Torrance, 1970). Almeida et al. (2011) have argued for the importance of critical thinking skills in higher education. They proposed that individual differences in creativity are directly related to question asking, such that students ask questions that are consistent with their creative ability. Almeida et al. (2011) implemented several teaching and learning strategies in a Chemistry and Geology courses, as a way of encouraging students’ questioning. The authors found that lower creativity was associated with closed, less complex questions which relate to simple facts and concepts, and higher creativity with more complex, specialized questions which reveal working hypotheses and applications of new knowledge (see also Acar et al., 2023).Recently, Raz et al. (2023) explored the relation between open-ended question asking and creativity using the alternative questions task (AQT). The AQT requires participants to generate creative and unusual questions about common objects (e.g., pen, book, shoe) such as “who invented the first pencil” or “what library has the most books?”. Responses were rated separately in terms of their creativity, using a 1 (not at all creative) to 5 (very creative) scoring method (Runco & Mraz, 1992; Silvia et al., 2008), and complexity, according to the Bloom taxonomy. The authors observed good inter-rater agreement (alpha Cronbach > 0.7) for the subjectively rated Bloom taxonomy scores. They also found that question complexity and creativity were positively related: questions which were higher on the Bloom taxonomy (i.e., more complex) were also scored as more creative, and those that were less complex were scored as less creative. Thus, this study provided empirical evidence that open-ended question complexity and creativity are related, such that stronger creative abilities accompany stronger question asking abilities.

However, Raz et al. (2023) noted that subjective scoring of responses according to the Bloom taxonomy may suffer from the same limitations as subjectively rated creativity scores (Kaufman, 2019; Silvia et al., 2008), such as inconsistent rater agreement and fatigue. Thus, automating Bloom taxonomy scoring by means of computational approaches may help overcome these limitations, accelerating empirical research of question asking.

Semantic distance and creativity

Recent advances in natural language processing (NLP) tools such as semantic distance methods have allowed for the automated scoring of psychological tasks (e.g., Beaty & Johnson, 2021; Demszky et al., 2023; Rathje et al., 2023). This has made it possible to overcome the typical bottlenecks of human scoring, such as high labor costs (Kaufman & Baer, 2012; Kaufman et al., 2013; Barbot, 2018; Forthmann et al., 2017; Reiter-Palmon et al., 2019). Semantic distances reflect the remoteness between the meanings of two words. This metric rests on the distributional semantics theory, which holds that the meaning of words can be derived by looking at the context in which they appear (Firth, 1957). Words that appear in the same context will have similar meanings, so that the relationship between the meanings of two words can be extracted from their co-occurrence frequencies in text (e.g., books; Lenci, 2018). For example, the words “dog” and “cat” are semantically related and will often co-occur in text, while “dog” and “plane” are semantically unrelated and less likely to co-occur. Semantic distance leverages word embeddings (i.e., word vectors) to quantitatively represent and compare word meanings (Landauer et al., 1998). These embeddings can be generated in several ways, including latent semantic analysis (LSA; Deerwester et al., 1990) and computational models such as Word2Vec (Mikolov, Chen, et al., 2013), GloVe (Pennington et al., 2014). Once extracted, calculations such as the cosine angle between two-word embeddings will provide a measure of semantic similarity. Semantic distance refers to the inverse of the cosine angle (1 - semantic similarity) and indicates the dissimilarity between word pairs (Beaty & Johnson, 2021). The associative theory of creativity provides a framework that highlights the relevance of semantic distance measures in creativity assessment. Creative ideas are thought to arise from the combination of semantically “distant” concepts from memory, such that ideas containing semantically unrelated concepts will tend to be more original (Beaty & Kenett, 2023; Kenett, 2019; Mednick, 1962).

One way of calculating idea originality is termed the maximum associative distance (MAD) method and involves taking the semantic distance score that is maximally distant between a given prompt such as pencil and all words given as a response to it – i.e., possible uses (Yu et al., 2023). MAD scores have been found to robustly predict human-rated originality scores (r = .74) on a classic creativity task, the alternative uses task (AUT; Guilford, 1967) which involves the production of unusual uses for commonplace objects. MAD scores predicted human-rated AUT responses better than previous compositional approaches—involving the addition or multiplication of semantic distances between the prompt and each word in a response (Beaty & Johnson, 2021). Further, MAD scores have been shown to correlate with human-rated Bloom taxonomy scores of AQT responses (r = .51; Raz et al., 2023).

A different approach at calculating semantic distance scores is via divergent semantic integration (DSI; Johnson et al., 2023). This technique was developed for longer text responses, such as short stories, and involves averaging the semantic distance between all possible pairs of words in a response. In this way, DSI does not consider the prompt in its calculation, but rather captures the dissimilarity of concepts within a response. Given the extensive work done on validating semantic distance as a measure of originality (e.g., Patterson et al., 2023), both MAD and DSI scores can serve as a good baseline against which to compare other automated creativity assessment tools.

An additional notable advancement has come in the form of large language models (LLM). When fine-tuned, these models have demonstrated superior performance to previous assessment tools such as semantic distance on a number of creative thinking tasks (Dumas et al., 2021; DiStefano et al., 2023; Luchini et al., 2023). Although several computational approaches exist for possible automation of Bloom taxonomy scoring such as using semantic distance approaches or newer Language model methods (Stevenson et al., 2022; Yu et al., 2023 ; Organisciak et al., 2023), recent research on LLM has shown that they can go beyond semantic distance methods, and that automated scoring of divergent thinking greatly improves when using a LLM as opposed to semantic distance scores (Organisciak et al., 2023).

Large Language Models (LLM) in psychological research

LLM are computational tools that are used for a variety of tasks involving language data (Vaswani et al., 2017). They are a class of deep neural networks that undergo pre-training on large amounts of text data, for the purpose of understanding and generating language. These models have shown considerable success across all fields of psychology by measuring task performance nearly instantaneously(Luchini et al., 2023; Demszky et al., 2023; Hardy et al., 2023). Thus, LLM unlock possibilities for scale and efficiency in psychological research and practice that were unimaginable in the past (Demszky et al., 2023). For example, LLM can be used to generate experimental stimuli (Laverghetta & Licato, 2023), model word learning across the lifespan (Portelance et al., 2020), identify emotions in text (Zhang et al., 2023), predict personality traits from text (Peters & Matz, 2023), and predict the creativity of metaphors (DiStefano et al., 2023) and responses to problem solving tasks (Luchini et al., 2023; Dumas et al., 2023).

The wide variety of uses for LLM can largely be attributed to their emergent abilities—capabilities that large models gain only after being exposed to vast amounts of textual data (Brown et al., 2020; Wei et al., 2022). LLM are typically pre-trained through unsupervised learning, a training procedure which involves automatic pattern detection from unlabeled data—text that is not assigned any tag or number that the model has to predict. For LLM, this takes the form of iterative word prediction problems, where the model is required to predict a missing word from the context or vice versa (Jiang et al., 2020). In this way, LLMs acquire an understanding of language and are thus able to outperform previous NLP tools on tasks they were never trained on (Vaswani et al., 2017). Of note, the impressive capabilities of LLMs comes at the cost of interpretability (Barredo Arrieta et al., 2020; Dale, 2021; Gunning et al., 2019; Kojima et al., 2022). The architecture of LLMs is large and complex, typically containing millions to billions of parameters (i.e., weights) that are adjusted during training and which determine the computations applied to the input. It thus becomes challenging, if not impossible, to determine how features of the input relate to the output.

LLMs have recently demonstrated superior performance compared to previous NLP tools at predicting human creativity ratings of AUT responses (Organisciak et al., 2023; Stevenson et al., 2022). In order to achieve this performance, Organisciak et al. implemented a fine-tuning procedure—a further round of training involving supervised learning. Fine-tuning adjusts the weights determined during pre-training by exposing the model to many labeled examples (e.g., AUT responses and human ratings). Organisciak et al. (2022) provided approximately 27,000 examples to the model and found that it robustly predicted the creativity of responses it hadn’t seen before (r = .81). Organisciak et al. (2022) further demonstrated that model performance would remain strong even when reducing the number of examples used in the fine-tuning. The researchers showed that to achieve a strong performance (r > .75) the model had to be fine-tuned on a minimum of 6,000 examples. This is particularly relevant when considering the difficulty of gathering human-rated responses to a creative thinking task. Finally, Organisciak et al. (2022) tested whether the fine-tuned model performed better than semantic distance scores, which showed a poor correlation with the human creativity ratings (r = .20).

The Present Study

The present study aimed to address the gap in the literature on creative question asking and the role of complexity by developing an LLM model capable of scoring participant’s open-ended questions to a divergent thinking creative questions task according to the Bloom taxonomy. This was done in an attempt to advance the availability, cost effectiveness and reliability of question complexity and creativity scoring and to highlight the advantages of the usage of LLMs in education and psychology and their potential in helping study how we ask creative questions. The model was trained on data from more than ten thousand human-rated responses to the alternative questions task (AQT), a creative question asking task introduced by Raz et al. (2023). Responses were questions asked about everyday objects spanning a total of 6 items taken from the suggested items provided by Beaty et al. (2022). To evaluate model performance, its predictions were compared to three other scoring methods—elaboration (i.e. word count) and two semantic distance methods (MAD & DSI) —which reliably predict human creativity ratings in divergent thinking tasks (Luchini et al., 2023).

Participants

The data analyzed here comes from two different sources. The first is a reanalysis of data collected by Raz et al. (2023) (N = 223; 47.9% male, 50.4% female, 1.7% preferred not to say; mean age = 26.1 years, SD = 6.41 years). The second dataset is based on recent data collected for a larger ongoing study on how priming question asking capacity impacts creative problem solving (N = 500; 49% male, 50% female, 1% preferred not to say, mean age = 29.35 SD = 9.62 ). Both samples are composed of data collected from participants recruited on Prolific Academic (N = 723). The total dataset consisted of 10,282 responses to the AQT, spanning a total of 6 items (pencil, sock, pillow, purse, clock, and knife). Average number of AQT responses per participant was 4.74 (SD = 2.341).

The alternative questions task (AQT)

The AQT requires participants to generate creative and unusual questions about three common objects in two minutes for each object (Raz et al., 2023). AQT objects were taken from the suggested items provided by Beaty et al. (2022) for the AUT (pencil, sock, pillow, purse, clock, and knife). Participants were explicitly instructed to come up with as many original and creative questions for objects as they can (Acar & Runco, 2019; Raz et al., 2023). Creative questions were defined in the study as questions that strike people as clever, unusual, interesting, uncommon, humorous, innovative, or different. Participants provided their responses on a single page with 30 available input fields and could view their previous answers. Time limits were two minutes per object. Examples were not provided to avoid biasing towards a specific level of question response, which would impact potential question level variance.

The Bloom taxonomy

AQT responses were scored by human raters for their respective Bloom level (from one to six: Remember, Understanding, Applying, Analyzing, Evaluating and Creating). The revised edition of Bloom’s taxonomy (Bloom et al., 1956; Krathwohl, 2002) was used. Online raters from Prolific Academic were instructed on the Bloom taxonomy levels and subjectively rated AQT responses by assigning the Bloom level they ascertained from the relevant AQT response. Rating instructions included an explanation of the types of questions asked for each Bloom level, alongside key terms related to each Bloom level and examples of scoring. Each object cue was rated by ten different independent raters for study 1 & 2 (Raz et al., 2023) due to some raters failing attention checks, and by five raters for the priming questions dataset. Raters who failed attention checks or gave incomplete ratings were excluded from the final dataset such that not all objects were rated by the same number of raters. Reliability metrics for AQT objects on their Bloom level ratings were overall good (Ko & Lee, 2016) and as follows: Dataset 1: pencil (N = 4, α = .752), pillow (N = 3, α = .727), and sock (N = 10, α = .768), knife (N = 5, α = .702), purse (N = 3, α = .63), and clock (N = 4, α = .61); Dataset 2: pencil (N = 4, α = .751), pillow (N = 5, α = .713), and sock (N = 4, α = .768).

Automated Originality Scoring

Maximal Associative Distance (MAD). To provide a semantic distance baseline to compare to a fine-tuned LLM, we computed MAD scores (Yu et al., 2023) for all AQT responses following the approach described by Patterson et al. (2023). Given all responses in the present dataset are in English, best practices for MAD scoring involve extracting the word embeddings from MBERT, by averaging the embeddings extracted from the two best performing model layers observed in Patterson et al., (2023). MAD was selected given it is an unsupervised machine learning approach (i.e., human-rated originality scores are never shown to the model) for extracting originality measures from text data As such, it is a good benchmark to evaluate the performance of supervised models which involve fine-tuning of model weights. The MAD method has been shown to outperform previous compositional techniques at predicting human-rated originality scores of AUT responses. It has further been shown to correlate with Bloom scores on the AQT (Raz et al., 2023) and as such is a suitable baseline comparison for our LLM rated Bloom score outputs. Semantic distance scores are first computed between all words in a response (e.g., can I bend the pencil without breaking it) and the prompt (e.g., pencil). Then, only the most distant word is retained (i.e., bend), all other words being discarded from the final semantic distance score. Response-level MAD scores are thus the semantic distance of the most distant word from the prompt.

Divergent Semantic Integration (DSI). We calculated DSI scores for each individual response which were employed as a baseline against which LLM performance was compared to. DSI is also an unsupervised machine learning approach.. DSI has been shown to significantly predict originality on other creativity tasks involving long-form text responses (DiStefano et al., 2023; Johnson et al., 2023, Luchini et al., 2023).

To calculate DSI scores, word embeddings were first extracted from a pre-trained language model, following Johnson et al. (2023). These embeddings were then used to calculate the semantic distance between all pairs of words in a response. Semantic distance measures were calculated according to standard procedures employing cosine similarity (Beaty & Johnson, 2021). Calculating the cosine of the angle between any two-word embeddings provides a measure of the similarity between the two target words (Mikolov et al., 2013). As such, subtracting this value from 1 provides the opposite measure: the semantic distance score. After all possible word-pair semantic distances were calculated for a response, the values were averaged to provide an overall estimate of originality: the DSI score. For the present study, word embeddings were extracted from the BERT model (Devlin et al., 2018). BERT has a strong context sensitivity, meaning it is sensitive to the context in which words appear or to words with multiple meanings—also known as polysemy (e.g., computer-server, restaurant-server), and has been noted to reliably predict human creativity ratings on other tasks, and across English linguistic expertise (Johnson et al., 2023; Reilly et al., 2023).

LLM-based Bloom scoring

Model. The RoBERTa model was selected for fine-tuning for this study. The RoBERTa model constitutes an improvement of the BERT model (Liu et al., 2019), and was released by Google in 2018 (Devlin et al., 2018). RoBERTa is a transformer model (see Vaswani et al., 2017) which underwent self-supervised training, without any human ratings being presented to the model. The architecture of the RoBERTa model is similar to BERT with some changes in the pretraining procedure, including a larger training dataset (Liu et al., 2019). For our analysis, we selected the RoBERTa-base model. Given its smaller size, this version comes with a reduced computational cost compared to larger versions of the same model. The base version of RoBERTa has 123 million parameters (i.e., weights) and was pre-trained on the Toronto BookCorpus (800M words), the English portion of Wikipedia (2,500M words), the CC-News English dataset (63M news articles), the OpenWebText collection of Reddit posts, and the Stories dataset which contains a subset of the CommonCrawl repository. RoBERTa was pretrained with a bidirectional approach (i.e., the model saw entire sentences when making predictions) applied to context leveraging (i.e., filling in the blanks by drawing from the surrounding context). Of note, RoBERTa has been shown to perform well on a variety of linguistic tasks (Gilloz et al., 2020). The model is available on the open-access Huggingface platform (https://huggingface.co/). For fine-tuning, the “Huggingface Transformers” suite of the “PyTorch” package was used via the Python programming language.

Datasets. AQT responses (collected as described above) was randomly split into training, validation and holdout-responses sets following a 70/10/20 ratio respectively. The training data was employed to fine-tune the model, as the model saw both the responses and human Bloom ratings and learned to predict the ratings. The validation data served the purpose of iteratively testing different variations of the model, each trained with different combinations of hyperparameter values, to determine the best hyperparameter settings. The holdout-responses data contained responses that the model was never presented with during training and allowed the testing of model performance on unseen responses; holdout-responses included AQT items that the model saw during training, and as such served as a test of near-transfer.

Holdout-item data also contained new responses; critically, they were related to an AQT item that the model never saw during training (i.e., clock) This item was selected for the holdout-item set as it was associated with the smallest number of responses from the entire dataset, leaving more data to be assigned to the other sets. As such, the holdout-responses data served as a test of far-transfer of model performance. Ideal model performance would involve strong predictions for both unseen responses and items, as this would indicate that the model can be extended to other datasets which involve different responses and items.

The training data (AQT responses and Bloom scores) was used to adjust the weights of the pretrained model. To achieve this, batches of responses from the training data were inputted to the model during training. The model would then predict a single Bloom value for each response, and the mean squared error between these predicted values and the true human-rated Bloom scores determined the degree of weight adjustment. Model predictions for the validation data were compared across a variety of model settings, retaining only the best combination for later testing.

Hyperparameter Search. Hyperparameters are settings that determine the learning of a model and which are set prior to training. By iteratively training several partial models we were able to select the best performing values for our selected hyperparameters given our model and data. In this way, hyperparameter search allowed us to select the best model settings prior to our final and complete round of training, the only training round that determined our final model. We implemented a hyperparameter search over the number of epochs, the learning rate, the training batch size, and the evaluation batch size, based on similar applications (e.g., DiStefano et al., 2023; Luchini et al., 2023). The number of epochs determines how many times the model will iterate over the entire training dataset during training (i.e., the epochs). For this, we searched between a range of 100 and 130 epochs. The learning rate determines the speed at which the model will learn from the data. It effectively modifies the impact that one batch of data will have on the weights of the model. We searched between learning rate values of 5e-05 and 5e-04. We also searched over the training and evaluation batch sizes. The batch size refers to the number of responses that are inputted to the model during each iteration. A larger batch size means that the model will see more responses at a time. Batch sizes were searched separately for the training and the evaluation data (i.e., validation, holdout-responses, and holdout-prompt sets). For both training and evaluation batch sizes, we searched through three possible values of 8, 16, and 32.

Hyperparameter search was run by employing the Optuna package in Python language (Akiba et al., 2019). Optuna allows for the search over a variety of hyperparameter combinations by means of a Tree-Structured Parzen Estimator, a Bayesian optimization method. Optuna iteratively trains a number of candidate models, each with its own unique combination of hyperparameter settings. In doing so, it builds a probabilistic model which it then updates as different combinations of hyperparameter settings are tested on the validation set. It then preferentially explores hyperparameter combinations that have a higher likelihood to minimize the prediction error on the validation set. After the best hyperparameter settings were identified, these were then used in the training of our final model. All hyperparameters other than batch sizes, learning rate and number of epochs were left to the default settings for RoBERTa-base.

Data, analysis scripts and weights for our final model are available online on OSF (https://osf.io/823ak/?view_only=2f8f7a94ca5e4d57b810afc26054a6d3).

Descriptive Statistics and Correlations

We computed a series of descriptive statistics for the AQT (Table 1). Across all items, the mean word count was 6.58, and the mean Bloom rating was 2.85. We calculated the intra-class correlation (ICC; Shrout & Fleiss, 1979) between raters in the study and found a strong reliability across all items in the dataset, mean ICC = 0.76 [95% CI: 0.75, 0.77]. on the average ratings using a two-way random-effects model with absolute agreement.

Table 1

*Descriptive Statistics for Each Set*
Set	Number of responses	ICC	Mean Bloom (SD)	Mean Word Count (SD)	Mean DSI (SD)	Mean MAD (SD)
Training	5035	0.82	2.84 (1.19)	6.58 (2.84)	0.72 (0.05)	0.46 (0.07)
Validation	725	0.80	3.11 (1.16)	6.77 (2.65)	0.73 (0.04)	0.46 (0.07)
Holdout-Responses	1443	0.80	2.81 (1.19)	6.55 (2.77)	0.73 (0.05)	0.45 (0.07)
Holdout-Item	445	0.81	2.72 (0.94)	6.26 (2.47)	0.72 (0.05)	0.40 (0.09)
Note:. ICC = intra-class correlation; DSI = divergent semantic integration; MAD = maximum associative distance.

We then computed a series of Pearson’s correlations between our variables of interests, separately for each set (Fig. 1). Before computing any of the linear regression models, we removed outliers based on Cook’s distance metrics (Cook, 1977) by employing on the modified Cook’s cutoff equation (Fox, 2019). We calculated the thresholds for outlier removal separately for the training set (Cook’s Distance > .0008), validation set (Cook’s Distance > .006), holdout-responses set (Cook’s Distance > .003), and holdout-item set (Cook’s Distance > .009). We then calculated correlations between DSI and MAD (two semantic distance metrics). We found moderate correlations for the training set, r = .54, p < .001; validation set, r = .47, p < .001; holdout-responses set, r = .53, p < .001; and holdout-item set, r = .61, p < .001.

Next, we calculated correlations between MAD and word count. Strong to moderate correlations were observed across all sets, with r = .58, p < .001, for the training set, r = .51, p < .001, for the validation set, r = .57, p < .001, for the holdout-responses set, and r = .68, p < .001, for the holdout-item set. We then calculated correlations between DSI and word count. Strong correlations were observed across all sets, with r = .70, p < .001, for the training set, r = .66, p < .001, for the validation set, r = .69, p < .001, for the holdout-responses set, and r = .73, p < .001, for the holdout-item set. The analysis employed Pearson's correlations across different variables of interest. Significant moderate to strong correlations were observed between semantic distance metrics (DSI and MAD) and word count with the training and test data, suggesting good reliability between baseline measures .

Finally, we computed correlations between the baseline measures and human-rated Bloom scores. For word count, we observed moderate correlations of r = .40, p < .001 for the training set, r = .30, p < .001, for the validation set, r = .44, p < .001, for the holdout-responses set, and r = .37, p < .001, for the holdout-prompt item set. For MAD, we again observed moderate correlations of r = .31, p < .001, for the training set, r = .22, p < .001, for the validation set, r = .32, p < .001, for the holdout-responses set, and r = .42, p < .001, for the holdout-item set. Finally, for DSI, we observed moderate correlations of r = .39, p < .001, for the training set, r = .27, p < .001, for the validation set, r = .39, p < .001, for the holdout-responses set, and r = .34, p < .001, for the holdout-item set.

LLM Prediction of Bloom Ratings

Hyperparameter settings for our final RoBERTa model included 114 epochs, a learning rate of 9.2e-05, and a batch size of 16 for both training and evaluation. The fine-tuned RoBERTa model was then used to predict Bloom scores for each response. These predicted scores were then included in a series of linear regressions against the human rated scores. We found that our model perfectly predicted the human ratings at r = 0.99, p < .001, on the training set, and strongly predicted them on the validation set r = .76, p < .001.

As a test of model generalizability, we further explored correlations for the holdout-responses set. The held-out test set consisted of responses that were neither used for model selection nor were seen by the model during its training. It thus served as a test of the model’s ability to generalize to responses it wasn’t trained on. We found that the correlation between the models’ predictions and the holdout-responses was substantially larger, r = .73, p < .001, than the semantic distance and elaboration baselines. The results demonstrate that fine-tuned LLMs, can strongly capture and predict human creativity ratings of question complexity. We then tested the correlation on the holdout-item set, which consisted of item inputs that were neither used for model selection nor were seen by the model during its training. We again found a substantially larger, r = .77, p < .001, correlation than baseline measures. This strengthens our claim that LLMs can be used to accurately predict human ratings of question asking tasks and offer a reliable and efficient alternative to labor-intensive and subjective human ratings of question complexity.

Next, we evaluated whether the model performed better when predicting the Bloom ratings of responses associated with average or extreme human-rated Bloom scores. To this end, we calculated the mean prediction errors separately for the responses that were above or below two standard deviations or between two standard deviations from the mean for the human-rated Bloom scores (Table 2). We observed that mean prediction errors were consistently higher for responses that fell at the extremes of the distributions for the human-rated Bloom scores, suggesting the model performed better when predicting responses with average human-rated Bloom scores. Of note, this distinction appeared to be considerably lower for the holdout-item set than for any of the other sets.

Table 2 Mean Prediction Errors for Human-Rated Bloom scores

Sets

	Training Set	Validation Set	Holdout-Responses Set	Holdout-Item Set
± 2 SD	1.97	1.44	1.31	0.73
Within 2 SD	0.8	0.7	0.73	0.66

Note. Errors are calculated over factor scores and are averaged in the form of absolute values, between two standard deviations of the mean and above or below two standard deviations of the mean.

Questions play a critical role in learning, education, and creativity. However, much is unknown on the role of complex question asking in cognition (Raz et al., 2023). This is in part due to the challenge of scoring and assessing open-ended questions, an issue that is present for the broader creativity research (Kaufman, 2019). Despite recent advancements in LLMs and their emerging role in psychological research (Portelance et al., 2020; Laverghetta & Licato, 2023; Zhang et al., 2023; Peters & Matz, 2023; DiStefano et al., 2023; Luchini et al., 2023; Dumas et al., 2023; Organisciak et al., 2023) there has been little research on automated scoring of question asking (cf. Jayakodi et al., 2015; Gani et al., 2023; Hwang et al., 2023; Mohammed & Omar, 2020) and to our knowledge, no such computational method has been applied to scoring open-ended questions. The current study capitalized on advanced LLMs to develop automatic and accurate scoring of open-ended question complexity, based on the Bloom’s taxonomy.

We trained a LLM (RoBERTa) to predict human-rated Bloom scores for responses to the alternative questions task (AQT; Raz et al. 2023), which measures creative question asking. Bloom scores are a measure of the cognitive complexity of an AQT response and were subjectively scored by human raters (with high inter-rater agreement) in accordance with the revised Bloom taxonomy (Krathwohl, 2002). Our fine-tuned LLM demonstrated robust predictions of Bloom scores, surpassing those achieved by semantic distance models or word count. The model generalized its predictions beyond the data it was trained on and demonstrated good performance on the test data, reaching a Pearson correlation above 0.75. Of note, strong predictions were also observed for an unseen prompt item. By evaluating model predictions for the unseen prompt item, it is then possible to determine generalizability of the model. Ideal model performance would involve strong predictions for both unseen responses and items, as this would indicate that the model can be extended to other datasets which involve different responses and items. We thus provide the first evidence that a LLM can robustly predict Bloom complexity scores and automatically score a question asking task. To enhance dissemination of our model—which can automatically score responses to the AQT—to potential users, we make it freely available (https://osf.io/823ak/?view_only=2f8f7a94ca5e4d57b810afc26054a6d3).

Model predictions strongly correlated with human-rated Bloom scores in both the holdout-responses (r = .73) and holdout-item set (r = .77) and thus model performance on the holdout-item set was even slightly better than inter-rater agreement between human raters in the study, mean ICC = 0.76 [95% CI: 0.75, 0.77]. This meant that model predictions explained 53% and 59% of the variance in human-rated Bloom scores for the holdout-responses and prompt sets respectively. Regarding our baseline measures, correlations between the baseline measures and human-rated Bloom scores were consistently moderate (average r = .38) for both holdout responses and item sets. As such, the baseline measures explain no more than 14% of the variance in the Bloom scores. These findings indicate that the LLM learned consistencies in the data that were generalizable to unseen responses and items. In other words, the model was able to pick up a big part of how humans evaluate questions in terms of cognitive complexity and was able to re-apply this knowledge to new data.

Curiously, the model performed slightly better for the holdout-item set than the holdout-responses set. This may be in part due to the fact that Bloom scores were more spread out in the holdout set of responses (M = 2.81; SD = 1.19) compared to the holdout set of items (M = 2.72; SD = 0.94; see Table 1). The LLM may have had more difficulty in rating the highly and weakly complex questions, rather than those of average complexity. This is supported by examining the prediction errors for the holdout-responses set, which were considerably higher at the extremes (± 2 SD) compared to the center (within 2 SD) of the distribution of Bloom scores (see Table 2). Given that human-rated Bloom scores tend to be normally distributed, the model was trained on fewer examples of highly complex and weakly complex questions. As such, the model would have had less data to learn from which may have resulted in this underestimation of highly complex questions and overestimation of weakly complex questions. Similar findings have been reported in recent work on automated creativity assessment, suggesting that this may be a common obstacle when fine-tuning models for automated scoring tasks (Luchini et al., 2023; Patterson et al., 2022).

Question asking is central to learning (Chin & Osborne, 2008; Salmon & Barrera, 2021), educational programs (Chin & Brown, 2002), the creative process (Acar et al., 2023; Raz et al., 2023), and is a valid measure of creativity (Yager, 1996). Furthermore, asking questions is an important human capacity which has also been found to bet related to curiosity, problem finding and information seeking behavior (Kenett et al., 2023; Ivancovsky et al., 2023; Raz & Kenett, 2023; Raz et al., 2023). But critically, not all questions are the same. The type of questions used by the asker can have a very important role in constructing a facilitative environment for information seeking, education or higher-level thinking (Çakır & Cengiz, 2016). The rfindings of Çakır and Cengiz (2016) support the idea that higher-order open-ended questions elicit more utterances from students, enhance creative thinking, and encourage the learner to contemplate and explore before determining an answer (Ortlieb et al., 2012 p. 31; Çakır & Cengiz. 2016; de Vink et al., 2021). Conversely, close-ended questions limit the respondent to the set of alternative answers offered in the question and bias thinking towards them. Research in classrooms has revealed that closed ended questions are used more than open ended questions in whole-class teaching (Çakır & Cengiz, 2016), a troubling finding which underscores the need for more open-ended question asking. The model developed in this study aims to further advance research into this area of open-ended questions by adding an additional tool to the limited but expanding toolset of question asking measurement.

There are some potential limitations concerning the results of this study. As noted by Luchini et al. (2023), smaller, older LLMs such as RoBERTa and GPT-2 have been shown in the past to underperform on certain benchmark tasks when compared to newer, more advanced ones. The current analysis should therefore be extended to larger models, such as GPT-4, to evaluate whether predictions of Bloom cognitive complexity on the AQT can be further improved. However, larger models like GPT-4 are currently not freely available for use, and researchers would incur considerable costs when employing these models in their work. Additionally, the quality of a fine-tuned model could be limited by the quality of human ratings of Bloom taxonomy. The datasets used to train the LLMs had some inter-rater disagreement that could translate to variance in the model, which may be improved by resolving rater differences, similar to Amabile’s (1982) consensual assessment technique.

Additionally, the model developed in this study was trained on averaged continuous scores of Bloom complexity via a regression model. As such, the output scores of the model are continuous, but are rounded to the nearest whole Bloom level in order to display familiar bloom levels. This is in contrast to a classification model which outputs discrete scores, in this case corresponding to the one to six Bloom levels. We opted to use a prediction, regression-based approach to align with previous LLM applications of open-ended responses (e.g., Distefano et al., 2023; Luchini et al., 2023; Organisciak et al., 2023). The question of whether the complexity of questions is discrete or continuous is still an open one and requires further research into the Bloom taxonomy and how we measure complexity.

Despite these limitations, the results suggest several theoretical and practical implications. As Ronfard et al. (2018) state - research with older elementary school students and college aged students has shown that students can quickly be taught how to ask higher level questions (e.g., “what is the difference between … and …?”) rather than lower complexity factual questions, and that this leads to improvements in learning and reading comprehension. In addition, calls from researchers suggest moving towards open-ended creative question asking teaching methods and advancement beyond the primary use of the closed questioning style (Ortlieb et al. 2012; Raphael, 1994) It is therefore possible that educators may greatly benefit from automated methods of testing and assessing open-ended question complexity, thus helping to foster creativity, learning and advanced comprehension in students. Future research should also focus on expanding the arsenal of psychological tests that can be automated using NLP focused LLMs, which will greatly improve the accessibility of these tests, and creating online “one stop shop” resources combining many automated scoring techniques similarly to the work done by Beaty and Johnson (2021).

In this study, we introduce a novel approach to automatically score the alternative questions task for Bloom taxonomy complexity using a fine-tuned LLM and compare it to three baseline measures: word count/elaboration, and semantic distance scores in the form of DSI and MAD scores (Johnson et al., 2022; Yu et al., 2023). Our results reveal that LLM-generated Bloom scores correlated strongly with human ratings—greatly exceeding both baselines. These results highlight the unique ability of LLMs to accurately predict ratings of open-ended question asking tasks. Our research offers a reliable and efficient alternative to labor-intensive and subjective human ratings of question complexity—improving the reproducibility and scalability of complexity assessment. This study also emphasizes the exciting potential for the continued usage of LLMs in education and psychology and the possibilities they unlock in helping us study how we ask creative questions about the world and help us build the educators and pedagogical programs of the futur

We state that this study has received IRB approval from the Technion university ethics board.

Funding

Funding: R.E.B. is supported by grants from the National Science Foundation [DRL-1920653; DUE-2155070]. This work was partially supported by the US-Israel Binational Science Fund (BSF) grant (number 2021040) to R.E.B and Y.N.K.

Competing interests

The authors declare no conflict of interests.

Author contributions

CRediT Statement: Tuval Raz: Conceptualization, Formal Analysis, Investigation, Writing – Original Draft, Review & Editing; Simone Luchini : Methodology, Resources, Software, Visualization, Review & Editing; Roger E. Beaty: Supervision, Writing – Review & Editing; Yoed N. Kenett: Conceptualization, Supervision, Writing – Review & Editing.

Data: This study was not preregistered. Data and code used in the manuscript is on the Open Science Framework: Data, analysis scripts and weights for our final model are available online on OSF (https://osf.io/823ak/?view_only=2f8f7a94ca5e4d57b810afc26054a6d3).

Acar, S., Berthiaume, K., & Johnson, R. (2023). What kind of questions do creative people ask? Journal of Creativity, 100062. https://doi.org/10.1016/j.yjoc.2023.100062
Adams N. E. (2015). Bloom's taxonomy of cognitive learning objectives. Journal of the Medical Library Association, 103(3), 152–153.https://doi.org/10.3163/1536-5050.103.3.010
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019, July). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2623-2631).
Albergaria-Almeida, P. (2011). Critical thinking, questioning and creativity as components of intelligence. Procedia - Social and Behavioral Sciences, 30, 357–362. https://doi.org/10.1016/J.SBSPRO.2011.10.070
Amabile, T. M. (1982). Social psychology of creativity: A consensual assessment technique. Journal of Personality and Social Psychology, 43(5), 997–1013. https://doi.org/10.1037/0022-3514.43.5.997
Baloche, L. (1994). Breaking down the walls. The Social Studies. 85. 25-30. https://doi.org/10.1080/00377996.1994.10118776
Barbot, B. (2018). The dynamics of creative ideation: Introducing a new assessment paradigm. Frontiers in Psychology, 9. https://doi.org/10.3389/fpsyg.2018.02529
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115. https://doi.org/10.1016/j.inffus.2019.12.012
Beaty, R. E., & Johnson, D. R. (2021). Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods, 53(2), 757–780. https://doi.org/10.3758/s13428-020-01453-w
Beaty, R. E., Johnson, D. R., Zeitlen, D. C., & Forthmann, B. (2022). Semantic distance and the alternate uses task: Recommendations for reliable automated assessment of originality. Creativity Research Journal, 34(3), 245-260. https://doi.org/10.3758/s13428-020-01453-w
Beaty, R. E., & Kenett, Y. N. (2023). Associative thinking at the core of creativity. Trends in Cognitive Sciences 27(7), 671-683. https://doi.org/10.1016/j.tics.2023.04.004
Bloom B. S., Krathwohl D. R., & Masia B. B. (1956). Taxonomy of educational objectives: the classification of educational goals. David McKay Company.
Brock, C. A. (1986). The effects of referential questions on ESL Classroom Discourse. TESOL Quarterly, 20, 77-59. http://dx.doi.org/10.2307/3586388
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing Systems, 1877–1901.
Çakır, H. and Cengiz, Ö. (2016) The use of open ended versus closed ended questions in Turkish classrooms. Open Journal of Modern Linguistics, 6, 60-70. doi: 10.4236/ojml.2016.62006.
Chin, C., & Brown, D. E. (2002). Student-generated questions: A meaningful aspect of learning in science. International Journal of Science Education, 24(5), 521–549. https://doi.org/10.1080/09500690110095249
Chin, C., & Osborne, J. (2008). Students’ questions: a potential resource for teaching and learning science. Studies in Science Education, 44(1), 1–39. https://doi.org/10.1080/03057260701828101
Dale, R. (2021). GPT-3: What’s it good for? Natural Language Engineering, 27(1), 113-118.
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., JonesMitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv. https://arxiv.org/abs/1810.04805v2
DiStefano, P. V., Patterson, J. D., & Beaty, R. (2023). Automatic Scoring of Metaphor Creativity with Large Language Models. PsyArXiv.
Dumas, D., Organisciak, P., & Doherty, M. (2021). Measuring divergent thinking originality with human raters and text-mining models: A psychometric comparison of methods. Psychology of Aesthetics, Creativity, and the Arts, 15(4), 645–663. https://doi.org/10.1037/aca0000319
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. Oxford, UK: Blackwell.
Forthmann, B., Holling, H., Zandi, N., Gerwig, A., Çelik, P., Storme, M., & Lubart, T. (2017). Missing creativity: The effect of cognitive workload on rater (dis-)agreement in subjective divergent-thinking scores. Thinking Skills and Creativity, 23, 129-139. https://doi.org/10.101t6/j.tsc.2016.12.005
Gani, M. O., Ayyasamy, R. K., Sangodiah, A., & Fui, Y. T. (2023). Bloom’s Taxonomy-based exam question classification: The outcome of CNN and optimal pre-trained word embedding technique. Education and Information Technologies, 28(12), 15893–15914. https://doi.org/10.1007/s10639-023-11842-1
Gillioz, A., Casas, J., Mugellini, E., & Abou Khaled, O. (2020, September). Overview of the Transformer-based Models for NLP Tasks. In 2020 15th Conference on Computer Science and Information Systems (FedCSIS) (pp. 179-183). IEEE.
Goh, T.T., Mohamed, H., Jamaludin, N.A., Ismail, M.N., & Chua, H.S. (2020). Questions classification according to Bloom’s taxonomy using universal dependency and Word Net. Test Engineering and Management. 82. 4374-4385
Gottlieb, J. (2021). The effort of asking good questions. Nature Human Behaviour, 5(7), 823-824. https://doi.org/10.1038/s41562-021-01132-6
Grévisse, C. (2024). Comparative Quality Analysis of GPT-Based Multiple Choice Question Generation. In H. Florez & M. Leon (Eds.), Applied Informatics (pp. 435–447). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-46813-1_29
Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., & Yang, G.-Z. (2019). XAI—Explainable artificial intelligence. Science Robotics, 4(37), eaay7120. https://doi.org/10.1126/scirobotics.aay7120
Hardy, M., Sucholutsky, I., Thompson, B., & Griffiths, T. (2023). Large language models meet cognitive science: LLMs as tools, models, and participants. Proceedings of the Annual Meeting of the Cognitive Science Society, 45(45). https://escholarship.org/uc/item/6dp9k2gz
Hwang, K., Challagundla, S., Alomair, M., Chen, L. K., & Choa, F. S. (2023). Towards AI-assisted multiple choice question generation and quality evaluation at scale: Aligning with Bloom’s Taxonomy. Workshop on Generative AI for Education.
Jayakodi, K., Bandara, M., & Perera, I. (2015). An automatic classifier for exam questions in Engineering: A process for Bloom's taxonomy. 2015 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), 195-202.
Jawahar, G., Sagot, B., & Seddah, D. (2019, July). What does BERT learn about the structure of language?. In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics.
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know?. Transactions of the Association for Computational Linguistics, 8, 423-438.
Johnson, D. R., Kaufman, J. C., Baker, B. S., Patterson, J. D., Barbot, B., Green, A. E., van Hell, J., Kennedy, E., Sullivan, G. F., Taylor, C. L., Ward, T., & Beaty, R. E. (2023). Divergent semantic integration (DSI): Extracting creativity from narratives with distributional semantic modeling. Behavior Research Methods, 55(7), 3726–3759. https://doi.org/10.3758/s13428-022-01986-2
Kaufman, J. C. (2019). Self-assessments of creativity: Not ideal, but better than you think. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 187-192. https://doi.org/10.1037/aca0000217
Kaufman, J. C., & Baer, J. (2012). Beyond new and appropriate: Who decides what is creative? Creativity Research Journal, 24(1), 83–91. https://doi.org/10.1080/10400419.2012.649237
Kaufman, J. C., Baer, J., Cropley, D. H., Reiter-Palmon, R., & Sinnett, S. (2013). Furious activity vs. understanding: How much expertise is needed to evaluate creative work? Psychology of Aesthetics, Creativity, and the Arts, 7(4), 332–340. https://doi.org/10.1037/a0034809
Kearsley, G. P. (1976). Questions and question asking in verbal discourse: A cross-disciplinary review. Journal of Psycholinguistic Research, 5(4), 355–375. https://doi.org/10.1007/BF01079934
Kenett, Y. N. (2019). What can quantitative measures of semantic distance tell us about creativity? Current Opinion in Behavioral Sciences, 27, 11–16. https://doi.org/10.1016/j.cobeha.2018.08.010
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. ArXiv. https://doi.org/10.48550/arXiv.2205.11916
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
Krathwohl, D. R. (2002). A revision of Bloom’s taxonomy: An overview. Theory into Practice, 41, 212- http://dx.doi.org/10.1207/s15430421tip4104_2
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028
Laverghetta, A., & Licato, J. (2023). Generating better items for cognitive assessments using large language models. Proceedings of the 18^th Workshop on Innovative Use of NLP for Building Educational Applications (EEA 2023), 414-428. https://doi.org/10.18653/v1/2023.bea-1.34
Lenci, A. (2018). Distributional models of word meaning. Annual Review of Linguistics, 4, 151-171.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewvis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. Arxiv. https://doi.org/10.48550/arxiv.1907.11692
Luchini, S., Maliakkal, N. T., DiStefano, P. V., Patterson, J. D., Beaty, R., & Reiter-Palmon, R. (2023). Automatic Scoring of Creative Problem-Solving with Large Language Models: A Comparison of Originality and Quality Ratings. PsyArXiv.
Mednick, S. (1962). The associative basis of the creative process. Psychological Review, 69(3), 220–232. https://doi.org/10.1037/h0048850
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv a.rXiv:1301.3781.
Mohammed, M., & Omar, N. (2020). Question classification based on Bloom's taxonomy cognitive domain using modified TF-IDF and word2vec. PloS ONE, 15(3), e0230442. https://doi.org/10.1371/journal.pone.0230442
Nelson, J. D. (2005). Finding useful questions: On Bayesian diagnosticity, probability, impact, and information gain. Psychological Review, 112(4), 979–999. https://doi.org/10.1037/0033-295X.112.4.979
Nunan, D. (1987). Communicative language teaching: making it work. ELT Journal, 41, 136-145.
http://dx.doi.org/10.1093/elt/41.2.136
Oliver, D., Dobele, T., Greber, M., & Roberts, T. S. (2004). This course has a Bloom range of 3.9. IFAC Symposium on Advances in Control Education (227-231). Dunedin, NZ: Australian Computer Society Inc.
Omar, N., Haris, S.S., Hassan, R., Arshad, H., Rahmat, M., Zainal, N.F., & Zulkifli, R. (2012). Automated analysis of exam questions according to Bloom's taxonomy. Procedia - Social and Behavioral Sciences, 59, 297-303.https://doi.org/10.1016/j.sbspro.2012.09.278
Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance: automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 101356.
Ortlieb, E., Bowden, R., Inman, A., Hu, B. Y., Pate, R. S., Gauthier, L. R., & Schorzman, E. M. (2012). Educational Research and Innovations. CEDER, Texas A&M University-Corpus Christi. https://hdl.handle.net/1969.6/97734
Patterson, J. D., Barbot, B., Lloyd-Cox, J., & Beaty, R. E. (2023). AuDrA: An automated drawing assessment platform for evaluating creativity. Behavior Research Methods. https://doi.org/10.3758/s13428-023-02258-3
Patterson, J. D., Merseal, H. M., Johnson, D. R., Agnoli, S., Baas, M., Baker, B. S., ... & Beaty, R. E. (2023). Multilingual semantic distance: Automatic verbal creativity assessment in many languages. Psychology of Aesthetics, Creativity, and the Arts, 17(4), 495.
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
Peters, H., & Matz, S. (2023). Large Language Models Can Infer Psychological Dispositions of Social Media Users. ArXiv.
Plack, M. M., Driscoll, M., Marquez, M., Cuppernull, L., Maring, J., & Greenberg, L. (2007). Assessing reflective writing on a pediatric clerkship by using a modified Bloom's Taxonomy. Ambulatory Pediatrics: Tthe Official Journal of the Ambulatory Pediatric Association, 7(4), 285–291. https://doi.org/10.1016/j.ambp.2007.04.006
Portelance, E., Degen, J., & Frank, M.C. (2020). Predicting age of acquisition in early word learning using recurrent neural networks. Annual Meeting of the Cognitive Science Society.
Raphael, T., & McMahon, S. (1994). Book club: An alternative framework for reading instruction. Reading Teacher - READ TEACH, 48, 102–116. https://doi.org/10.1598/RT.48.2.1
Rathje, S., Mirea, D. -M., Sucholutsky, I., Marjieh, R., Robertson, C., & Van Bavel, J. J. (2023). GPT is an effective tool for multilingual psychological text analysis. PsyArxiv. https://doi.org/10.31234/osf.io/sekf5
Raz, T., & Kenett, Y. N. (2023). Question asking as a mechanism that facilitates seeking of information [Peer commentary on Ivancovsky, T., Baror, S., & Bar, M. (2023). A shared novelty-seeking basis for creativity and curiosity]. Behavioral and Brain Sciences, 1-61. https://doi.org/10.1017/S0140525X23002807
Raz, T., Reiter-Palmon, R., & Kenett, Y. N. (2023). The Role of asking more complex questions in creative thinking. Psychology of Aesthetics, Creativity, and the Arts. https://doi.org/10.1037/aca0000658
Reilly, J., Finley, A. M., Litovsky, C., & Kenett, Y. N. (2023). Bigram semantic distance as a measure of conceptual transitions in continuous natural language: Theory, tools, applications. Journal of Experimental Psychology: General, 152(9), 2578-2590. https://doi.org10.1037/xge0001389
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Reiter-Palmon, R., Forthmann, B., & Barbot, B. (2019). Scoring divergent thinking tests: A review and systematic framework. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 144–152. https://doi.org/10.1037/aca0000227
Reja, U., Manfreda, K. L., Hlebec, V., & Vehovar, V. (2003). Open-ended vs. close-ended questions in Web questionnaires. Developments in Applied Statistics, 19, 159–177.
Ronfard, S., Zambrana, I. M., Hermansen, T. K., & Kelemen, D. (2018). Question-asking in childhood: A review of the literature and a framework for understanding its development. Developmental Review, 49, 101–120. https://doi.org/10.1016/j.dr.2018.05.002
Rothe, A., Lake, B. M., & Gureckis, T. M. (2018). Do people ask good questions? Computational Brain & Behavior, 1(1), 69-89.https://doi.org/10.1007/s42113-018-0005-5
Runco, M. A., & Mraz, W. (1992). Scoring divergent thinking tests using total ideational output and a creativity index. Educational and Psychological Measurement, 52(1), 213–221. https://doi.org/10.1177/001316449205200126
Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research Journal, 24(1), 92–96. https://doi.org/10.1080/10400419.2012.650092
Salmon, A. K., & Barrera, M. X. (2021). Intentional questioning to promote thinking and learning. Thinking Skills and Creativity, 40, 100822. https://doi.org/10.1016/j.tsc.2021.100822
Sasson, G., & Kenett, Y. N. (2023). A mirror to human question asking: Analyzing the Akinator online question game. Big Data and Cognitive Computing, 7, 26. https://doi.org/10.3390/bdcc7010026
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420.
Silvia, P. J. (2008). Creativity and intelligence revisited: A latent variable analysis of Wallach and Kogan (1965). Creativity Research Journal, 20(1), 34–39 https://dx.doi.org/10.1080/10400410701841807
Stevenson, C., Smal, I., Baas, M., Grasman, R., & van der Maas, H. (2022). Putting GPT-3's Creativity to the (Alternative Uses) Test. ArXiv. arXiv:2206.08932.
Torrance, E. P. (1970). Group size and question performance of preprimary children. The Journal of Psychology: Interdisciplinary and Applied, 74(1), 71–75. https://doi.org/10.1080/00223980.1970.10545279
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. ArXiv. http://arxiv.org/abs/1706.03762
Wei, J. M., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models. Arxiv.
Yager, R. E. (1996). Science/Technology/Society as Reform in Science Education. SUNY Press.
Yu, Y., Beaty, R. E., Forthmann, B., Beeman, M., Cruz, J. H., & Johnson, D. (2023). A MAD method to assess idea novelty: Improving validity of automatic scoring using maximum associative distance (MAD). Psychology of Aesthetics, Creativity, and the Arts.
Zhang, W., Deng, Y., Liu, B., Pan, S. J., & Bing, L. (2023). Sentiment analysis in the era of large language models: A reality check. ArXiv. arXiv:2305.15005.
Zheng, A. Y., Lawhorn, J. K., Lumley, T., & Freeman, S. (2008). Assessment. Application of Bloom's taxonomy debunks the "MCAT myth". Science, 319(5862), 414–415. https://doi.org/10.1126/science.1147852

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Automated Scoring of Open-Ended Question Complexity: A Large Language Model Approach

Status:

Version 1

Abstract

Figures

Introduction

Question asking

Open-ended vs closed-ended questions

The Bloom taxonomy

Question asking and creativity

Semantic distance and creativity

Large Language Models (LLM) in psychological research

The Present Study

Methods

Participants

The alternative questions task (AQT)

The Bloom taxonomy

Automated Originality Scoring

LLM-based Bloom scoring

Results

Descriptive Statistics and Correlations

LLM Prediction of Bloom Ratings

Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1