Question asking is a common and everyday activity, one which we spend considerable time engaged in. Yet only a very rudimentary technical understanding of question asking currently exists (Kearsley, 1976). Question asking is central to learning (Chin & Osborne, 2008; Salmon & Barrera, 2021), as well as being an important component of educational programs (Chin & Brown, 2002). This is especially true for open-ended questions which encourage larger and syntactically more complex answers as opposed to closed ended questions which do not wholly reflect genuine communication and encourage short, restricted responses (Çakır & Cengiz. 2016). Question asking has also been linked to the creative process (Acar et al., 2023; Raz et al., 2023), and has been shown to be a valid measure of creativity (Yager, 1996). However, questions are difficult to assess, due to their sometimes open-ended nature, and a common emphasis on the answers—and not the questions—that people produce. Recent advances in Natural Language Processing have led to the rapid application of such tools towards automatic and quantitative assessment of linguistic open-ended responses, such as assessing originality of answers to a divergent thinking creativity task (Beaty & Johnson, 2021; Dumas et al., 2021). Nevertheless, few methods exist for automatic scoring of question asking (Jayakodi et al., 2015; Mohammed & Omar, 2020) and even fewer that have utilized recent advances in AI and large language models (LLM) (Gani et al., 2023; Hwang et al., 2023) which perform better than previous linguistic approaches (Vaswani et al., 2017). Yet, to our knowledge, no such computational method has been applied on open-ended questions. In the present study, we extend research on automatic complexity scoring of a creative question asking task, by developing and training a LLM (RoBERTa; Liu et al., 2019) to predict human-rated complexity scores for open-ended questions generated in a creative question asking task (Raz et al., 2023).
Question asking
Ronfard et al. (2018) examined research on question-asking in childhood by focusing on the epistemic function of questions – the use of questions to bridge a gap in knowledge or to resolve uncertainty. These types of epistemic questions can be asked about diverse topics. For example, children (and adults) request information about labels, facts, procedures, and causal mechanisms and they can do so to obtain clarification, to rule out possible hypotheses, and out of curiosity or “wonderment”. Ronfard et al. (2018) highlighted the great variability in the quality and quantity of the questions people ask, which they stated is influenced by the precision, wording, and quantity of the questions and that developments in cognitive skills and increases in prior knowledge may allow for deeper processing and more precise questions. The authors argue that question-asking is a powerful learning strategy, yet research on questions has been relatively sparse and isolated across several disciplines (e.g., Gottlieb, 2021; Nelson, 2005; Rothe et al., 2018; Raz et al., 2023; Sasson & Kenett, 2023). Ortlieb et al. (2012) further argue that the ultimate goal of education should be advancing beyond the use of the closed questioning style as its only means to assess the learners, with Raphael (1994) stating “If you only ever use closed questions, then you are never going to encourage your learners to think” (Raphael, 1994, p. 114). The researchers advocate for advancing open-ended questions, which are high-level divergent questions that encourage the learner to contemplate and explore before determining an answer (Ortlieb et al., 2012).
Open-ended vs closed-ended questions
Open-ended and close-ended questions differ in several characteristics, especially regarding the role of respondents when answering these types of questions. Close-ended questions limit the respondent to the set of alternative answers offered by the questions and require the respondent to engage in convergent thinking (i.e., converging on a single correct solution), while open-ended questions allow expressing an opinion without being largely influenced by the question designer and engaging in divergent thinking (i.e., diverging on multiple possible solutions). The advantages of the open-ended questions include the possibility of discovering responses that individuals may give spontaneously, and thus they avoid the biases that may result from suggesting responses, which may occur in close-ended questions (Reja et al., 2003). Open-ended questions are information-seeking oriented and thus serve the purpose of acquiring information as they genuinely seek knowledge (Çakır & Cengiz. 2016). This is especially pertinent in education as teacher’s questions are indispensable components of classroom discourse and take an important role in facilitating student learning (Chin & Osborne, 2008; Salmon & Barrera, 2021).
Research on teachers’ questions reveal that closed-ended questions are used more than open-ended questions in whole-class teaching (Çakır & Cengiz. 2016), a practice which has been heavily criticized (Nunan, 1987; Brock 1986). Open-ended questions are not only important tools in engaging children in cognitively challenging conversations and promoting higher-order thinking but they are also found to offer linguistic advantages for children as they help develop children’s vocabulary and cognitive skills (Çakır & Cengiz. 2016). Baloche (1994) and Khan and Inamullah (2011) argue that a teacher's ability to ask open-ended questions is crucial for the development of their students' creative thinking skills and for nurturing higher level thinking. Higher level, or complex, open questions are regarded of particular importance in fostering creativity, as these involve more elaborate and abstract ideas such as the creation of new topics and the expression of opinions. Framing questions that are complex, open-ended, focused and uncluttered by irrelevant information is thus believed to support higher level thinking. Although complexity as understood through higher level open questions and higher-level thinking seems to be an essential part of question asking, it is not necessarily clear how to best measure and classify it.
The Bloom taxonomy
One common approach to evaluate question complexity is utilizing the Bloom taxonomy (Bloom et al., 1956). The Bloom taxonomy has been widely accepted as a guideline in designing learning objectives of differing levels of cognitive complexity (Adams, 2015; Goh et al., 2020; Omar et al., 2012). Specifically, the taxonomy includes six cognitive levels, which are hierarchically ordered from simple to complex (Krathwohl; 2002). The Bloom taxonomy is thought to represent a cumulative hierarchy, where mastery of each simpler category is prerequisite for mastery of the next more complex one (Krathwohl, 2002). An updated version of the taxonomy includes the following levels, in ascending order of complexity (Krathwohl, 2002): Remembering (retrieving relevant knowledge from long-term memory, for example when or how did X happen), Understanding (determining the meaning of instructional messages, including oral, written, and graphic communication, for example how would you summarize..?), Applying (carrying out or using a procedure in a given situation, for example how would you solve X using what you have learned), Analyzing (breaking material into its constituent parts and detecting how the parts relate to one another, for example how can you make a distinction between…?), Evaluating (making judgments based on criteria and standards, for example how would you prove or disprove…?), and Creating (putting elements together to form a novel, coherent whole, for example what changes would you make to solve…?).
Previous studies have applied the Bloom taxonomy to the evaluation of question complexity: assigning each question a score from one to six (from simple to complex) and allowing for quantitative analyses to be conducted on question complexity (Oliver et al., 2004; Plack et al., 2007; Zheng et al., 2008). Several attempts at using LLMs to predict Bloom taxonomy scores have been made in the past (Gani et al., 2023; Hwang et al., 2023). In one case researchers automated the quality evaluation of biology and chemistry multiple choice questions (Hwang et al., 2023). This was only partially successful, as model accuracy ranged from 25–90% depending on question type. The findings of Hwang et al., (2023) are further complicated as the authors employed human-rated Bloom scores sourced from a single human rater, raising potential issues of rater subjectivity with the questions themselves also being generated and sourced from GPT 3.5, which in itself has been shown to be at least partially flawed in terms of question quality (Grévisse, 2024). In contrast, Gani et al. (2023) developed a Bloom’s Taxonomy-based exam question classification approach using an LLM and over 2000 labeled multiple choice exam questions as training data, achieving good accuracy (86%) as compared to previous computational models. The study compared six embedding approaches to determine the best for examination question classification according to Bloom’s taxonomy. Their results showed that RoBERTa is the most optimal, and suggested future work could include testing RoBERTa with larger datasets to evaluate its scalability and generalizability.
Critically, multiple choice questions are usually closed-ended, single-solution tasks (SST) (de Vink et al., 2021), which are binary scored for correctness and usually require more closed convergent thinking, whereas open-ended questions that require divergent thinking and involve multiple solution tasks (MST) are usually evaluated in terms of fluency (number of correct solutions), flexibility (diversity of solutions), and originality (novelty of solutions). Previous research has indicated that creative thinking is more strongly related to MST than SST performance (de Vink et al., 2021), that asking questions is a key trait of creativity and an integral part of the creative process, and that question complexity is closely related to creativity (Acar et al., 2023; Raz et al., 2023). Thus, the need is clear for integrating divergent open-ended question asking, together with larger datasets in developing new LLM based approaches to automatically predict question complexity, highlighting the important role of creativity in question asking.
Question asking and creativity
Asking questions is both a key part of creativity and an important component of the creative process (Acar et al., 2023; Raz et al., 2023) that likely facilitates information seeking behavior (Kenett et al., 2023). It has been shown to be part of the creative problem-solving process in children (Torrance, 1970). Almeida et al. (2011) have argued for the importance of critical thinking skills in higher education. They proposed that individual differences in creativity are directly related to question asking, such that students ask questions that are consistent with their creative ability. Almeida et al. (2011) implemented several teaching and learning strategies in a Chemistry and Geology courses, as a way of encouraging students’ questioning. The authors found that lower creativity was associated with closed, less complex questions which relate to simple facts and concepts, and higher creativity with more complex, specialized questions which reveal working hypotheses and applications of new knowledge (see also Acar et al., 2023).Recently, Raz et al. (2023) explored the relation between open-ended question asking and creativity using the alternative questions task (AQT). The AQT requires participants to generate creative and unusual questions about common objects (e.g., pen, book, shoe) such as “who invented the first pencil” or “what library has the most books?”. Responses were rated separately in terms of their creativity, using a 1 (not at all creative) to 5 (very creative) scoring method (Runco & Mraz, 1992; Silvia et al., 2008), and complexity, according to the Bloom taxonomy. The authors observed good inter-rater agreement (alpha Cronbach > 0.7) for the subjectively rated Bloom taxonomy scores. They also found that question complexity and creativity were positively related: questions which were higher on the Bloom taxonomy (i.e., more complex) were also scored as more creative, and those that were less complex were scored as less creative. Thus, this study provided empirical evidence that open-ended question complexity and creativity are related, such that stronger creative abilities accompany stronger question asking abilities.
However, Raz et al. (2023) noted that subjective scoring of responses according to the Bloom taxonomy may suffer from the same limitations as subjectively rated creativity scores (Kaufman, 2019; Silvia et al., 2008), such as inconsistent rater agreement and fatigue. Thus, automating Bloom taxonomy scoring by means of computational approaches may help overcome these limitations, accelerating empirical research of question asking.
Semantic distance and creativity
Recent advances in natural language processing (NLP) tools such as semantic distance methods have allowed for the automated scoring of psychological tasks (e.g., Beaty & Johnson, 2021; Demszky et al., 2023; Rathje et al., 2023). This has made it possible to overcome the typical bottlenecks of human scoring, such as high labor costs (Kaufman & Baer, 2012; Kaufman et al., 2013; Barbot, 2018; Forthmann et al., 2017; Reiter-Palmon et al., 2019). Semantic distances reflect the remoteness between the meanings of two words. This metric rests on the distributional semantics theory, which holds that the meaning of words can be derived by looking at the context in which they appear (Firth, 1957). Words that appear in the same context will have similar meanings, so that the relationship between the meanings of two words can be extracted from their co-occurrence frequencies in text (e.g., books; Lenci, 2018). For example, the words “dog” and “cat” are semantically related and will often co-occur in text, while “dog” and “plane” are semantically unrelated and less likely to co-occur. Semantic distance leverages word embeddings (i.e., word vectors) to quantitatively represent and compare word meanings (Landauer et al., 1998). These embeddings can be generated in several ways, including latent semantic analysis (LSA; Deerwester et al., 1990) and computational models such as Word2Vec (Mikolov, Chen, et al., 2013), GloVe (Pennington et al., 2014). Once extracted, calculations such as the cosine angle between two-word embeddings will provide a measure of semantic similarity. Semantic distance refers to the inverse of the cosine angle (1 - semantic similarity) and indicates the dissimilarity between word pairs (Beaty & Johnson, 2021). The associative theory of creativity provides a framework that highlights the relevance of semantic distance measures in creativity assessment. Creative ideas are thought to arise from the combination of semantically “distant” concepts from memory, such that ideas containing semantically unrelated concepts will tend to be more original (Beaty & Kenett, 2023; Kenett, 2019; Mednick, 1962).
One way of calculating idea originality is termed the maximum associative distance (MAD) method and involves taking the semantic distance score that is maximally distant between a given prompt such as pencil and all words given as a response to it – i.e., possible uses (Yu et al., 2023). MAD scores have been found to robustly predict human-rated originality scores (r = .74) on a classic creativity task, the alternative uses task (AUT; Guilford, 1967) which involves the production of unusual uses for commonplace objects. MAD scores predicted human-rated AUT responses better than previous compositional approaches—involving the addition or multiplication of semantic distances between the prompt and each word in a response (Beaty & Johnson, 2021). Further, MAD scores have been shown to correlate with human-rated Bloom taxonomy scores of AQT responses (r = .51; Raz et al., 2023).
A different approach at calculating semantic distance scores is via divergent semantic integration (DSI; Johnson et al., 2023). This technique was developed for longer text responses, such as short stories, and involves averaging the semantic distance between all possible pairs of words in a response. In this way, DSI does not consider the prompt in its calculation, but rather captures the dissimilarity of concepts within a response. Given the extensive work done on validating semantic distance as a measure of originality (e.g., Patterson et al., 2023), both MAD and DSI scores can serve as a good baseline against which to compare other automated creativity assessment tools.
An additional notable advancement has come in the form of large language models (LLM). When fine-tuned, these models have demonstrated superior performance to previous assessment tools such as semantic distance on a number of creative thinking tasks (Dumas et al., 2021; DiStefano et al., 2023; Luchini et al., 2023). Although several computational approaches exist for possible automation of Bloom taxonomy scoring such as using semantic distance approaches or newer Language model methods (Stevenson et al., 2022; Yu et al., 2023 ; Organisciak et al., 2023), recent research on LLM has shown that they can go beyond semantic distance methods, and that automated scoring of divergent thinking greatly improves when using a LLM as opposed to semantic distance scores (Organisciak et al., 2023).
Large Language Models (LLM) in psychological research
LLM are computational tools that are used for a variety of tasks involving language data (Vaswani et al., 2017). They are a class of deep neural networks that undergo pre-training on large amounts of text data, for the purpose of understanding and generating language. These models have shown considerable success across all fields of psychology by measuring task performance nearly instantaneously(Luchini et al., 2023; Demszky et al., 2023; Hardy et al., 2023). Thus, LLM unlock possibilities for scale and efficiency in psychological research and practice that were unimaginable in the past (Demszky et al., 2023). For example, LLM can be used to generate experimental stimuli (Laverghetta & Licato, 2023), model word learning across the lifespan (Portelance et al., 2020), identify emotions in text (Zhang et al., 2023), predict personality traits from text (Peters & Matz, 2023), and predict the creativity of metaphors (DiStefano et al., 2023) and responses to problem solving tasks (Luchini et al., 2023; Dumas et al., 2023).
The wide variety of uses for LLM can largely be attributed to their emergent abilities—capabilities that large models gain only after being exposed to vast amounts of textual data (Brown et al., 2020; Wei et al., 2022). LLM are typically pre-trained through unsupervised learning, a training procedure which involves automatic pattern detection from unlabeled data—text that is not assigned any tag or number that the model has to predict. For LLM, this takes the form of iterative word prediction problems, where the model is required to predict a missing word from the context or vice versa (Jiang et al., 2020). In this way, LLMs acquire an understanding of language and are thus able to outperform previous NLP tools on tasks they were never trained on (Vaswani et al., 2017). Of note, the impressive capabilities of LLMs comes at the cost of interpretability (Barredo Arrieta et al., 2020; Dale, 2021; Gunning et al., 2019; Kojima et al., 2022). The architecture of LLMs is large and complex, typically containing millions to billions of parameters (i.e., weights) that are adjusted during training and which determine the computations applied to the input. It thus becomes challenging, if not impossible, to determine how features of the input relate to the output.
LLMs have recently demonstrated superior performance compared to previous NLP tools at predicting human creativity ratings of AUT responses (Organisciak et al., 2023; Stevenson et al., 2022). In order to achieve this performance, Organisciak et al. implemented a fine-tuning procedure—a further round of training involving supervised learning. Fine-tuning adjusts the weights determined during pre-training by exposing the model to many labeled examples (e.g., AUT responses and human ratings). Organisciak et al. (2022) provided approximately 27,000 examples to the model and found that it robustly predicted the creativity of responses it hadn’t seen before (r = .81). Organisciak et al. (2022) further demonstrated that model performance would remain strong even when reducing the number of examples used in the fine-tuning. The researchers showed that to achieve a strong performance (r > .75) the model had to be fine-tuned on a minimum of 6,000 examples. This is particularly relevant when considering the difficulty of gathering human-rated responses to a creative thinking task. Finally, Organisciak et al. (2022) tested whether the fine-tuned model performed better than semantic distance scores, which showed a poor correlation with the human creativity ratings (r = .20).
The Present Study
The present study aimed to address the gap in the literature on creative question asking and the role of complexity by developing an LLM model capable of scoring participant’s open-ended questions to a divergent thinking creative questions task according to the Bloom taxonomy. This was done in an attempt to advance the availability, cost effectiveness and reliability of question complexity and creativity scoring and to highlight the advantages of the usage of LLMs in education and psychology and their potential in helping study how we ask creative questions. The model was trained on data from more than ten thousand human-rated responses to the alternative questions task (AQT), a creative question asking task introduced by Raz et al. (2023). Responses were questions asked about everyday objects spanning a total of 6 items taken from the suggested items provided by Beaty et al. (2022). To evaluate model performance, its predictions were compared to three other scoring methods—elaboration (i.e. word count) and two semantic distance methods (MAD & DSI) —which reliably predict human creativity ratings in divergent thinking tasks (Luchini et al., 2023).