Testing Theory of Mind in GPT Models and Humans

doi:10.21203/rs.3.rs-3262385/v1

Download PDF

Article

Testing Theory of Mind in GPT Models and Humans

https://doi.org/10.21203/rs.3.rs-3262385/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 20 May, 2024

Read the published version in Nature Human Behaviour →

Version 1

posted

You are reading this latest preprint version

Interacting with other people involves reasoning about and prediction of others' mental states, or Theory of Mind. This capacity is a distinguishing feature of human cognition but recent advances in Large Language Models (LLMs) such as ChatGPT suggest that they may possess some emergent capacity for human-like Theory of Mind. Such claims merit a systematic approach to explore the limits of GPT models' emergent Theory of Mind capacity and compare it against humans. We show that while GPT models show impressive Theory of Mind-like capacity in controlled tests, there are key deviations from human performance that call into question how human-like this capacity is. Specifically, across a battery of Theory of Mind tests, we found that GPT models performed at human levels when recognising indirect requests, false beliefs, and higher-order mental states like misdirection, but were specifically impaired at recognising faux pas. Follow-up studies revealed that this was due to GPT's conservatism in drawing conclusions that humans took to be self-evident. Our results suggest that while GPT may demonstrate the competence for sophisticated mentalistic inference, its lack of embodiment within an action-oriented environment make this capacity qualitatively different from human cognition.

Social science/Psychology/Human behaviour

Scientific community and society/Business and industry/Technology

Theory of Mind

Social cognition

Artificial intelligence

Large Language Models

Generative Pre-trained Transformers

People care about what other people think and expend a lot of effort thinking about what is going on in other minds. Everyday life is full of social interactions that only make sense when considered in light of our capacity to represent other minds: when you are standing near a closed window and a friend says, “It’s a bit hot in here”, it is your ability to think about her beliefs and desires that allows you to recognise that she is not just commenting on the temperature but politely asking you to open the window ¹.

This ability for tracking other people’s mental states is known as Theory of Mind. Theory of Mind is central to human social interactions — from communication to empathy, to social decision-making — and has long been of interest to developmental, social, and clinical psychologists. It is also one of the major challenges in developing human-like artificial intelligences (AIs) capable of meaningful social interactions with humans: given how much human-human interaction relies on both partners’ ability to anticipate or recognise each other’s beliefs, desires, and intentions, any AI capable of interacting naturally with humans will require something like a Theory of Mind.

The recent rise of Large Language Models (LLMs), such as Generative Pre-trained Transformer (GPT) models, has shown some promise that AI Theory of Mind may not be too distant an idea. Generative LLMs exhibit a range of emergent capacities for sophisticated decision-making and reasoning abilities ^2,3 including solving tasks widely used to test Theory of Mind in humans ^4–6.

However, the vulnerability of these models to small perturbations to the provided prompts, including simple changes in characters’ perceptual access ⁷, raises concerns about the robustness and interpretability of the observed successes. Even in cases where these models are capable of solving complex tasks ³ that are cognitively demanding even for human adults ⁸, it cannot be taken for granted that they will not be tripped up by a simpler task that a human would find trivial ⁹. As a result, the performance of GPT models on one task cannot be used to infer their competencies across other tasks and contexts, and so understanding whether GPT models possess a human-like Theory of Mind requires a systematic investigative approach.

Recent calls for a ‘machine psychology’ ¹⁰ have argued for using tools and paradigms from experimental psychology to systematically investigate the cognitive capacities and limits of LLMs ¹¹. A systematic experimental approach to studying Theory of Mind in LLMs involves varying the types of test, delivering multiple repetitions of each item, and having clearly defined benchmarks of human performance against which to compare ¹². Here, we adopt such an approach for testing GPT models’ Theory of Mind capacities. We tested GPT-4, the latest LLM in the GPT family of models, and its predecessor ChatGPT-3.5 (hereafter GPT-3.5) in a comprehensive set of psychological tests spanning different Theory of Mind abilities, from those that are less cognitively demanding for humans such as understanding indirect requests, to more cognitively demanding abilities such as recognising and articulating complex mental states like misdirection or irony ⁸. To understand the variability and boundary limitations of GPT models’ social reasoning capacities, we exposed each model to multiple items on each test across independent sessions and compared their performance to that of a large sample of human participants.

We selected a set of well-established Theory of Mind tests spanning different abilities: the Hinting Task ¹³, False Belief Task ^14,15, recognition of Faux Pas ¹⁶, and the Strange Stories ^17,18. We also included a test of irony comprehension using stimuli adapted from a previous study ¹⁹. Each test was administered separately to GPT-4 and GPT-3.5 across 15 chats. Because each chat with GPT is a separate and independent session, and information about previous sessions is not retained, this allowed us to treat each chat (session) as an independent observation. Responses were scored in accordance with the scoring protocols for each test in humans (see Methods) and compared to those collected from a sample of 250 human participants. Tests were administered by presenting each item sequentially in a written format that ensured a species-fair comparison ²⁰ (see Methods) between GPT models and human participants.

Performance across Theory of Mind tests

Both GPT models performed well across most tests (see Fig. 1A; 1B), and showed impressive abilities to reason about social intentions, beliefs, and non-literal utterances. For each test, we conducted a series of two-way Bonferroni-corrected Wilcoxon tests comparing each LLM against human scores.

GPT-4 performed well across the test battery; its performance did not significantly differ from human levels in the Hinting (Z = 0.00, [95% CI: 0.00, 0.10], p = .088) or Irony comprehension tests (Z = 0.00, [0.00, 0.08], p = .080), exceeded human levels in the Strange Stories (Z = 0.13, [0.06, 0.19], p < .001) and achieved a score of 100% in the False Belief. Only on the Faux Pas test did GPT-4 perform significantly below human levels (Z=-0.50, [-0.60, -0.30], p < .001).

GPT-3.5 also performed well across the test battery and showed no significant difference from human performance on the Hinting (Z = 0.00, [-0.00, 0.10], p = 1) or the Strange Stories (Z=-0.06, [-0.19, 0.00], p = .576), and also scored 100% in the False Belief. However, GPT-3.5 did score significantly below human levels on the Irony comprehension test (Z=-0.17, [-0.17, -0.08], p < .001) and on the Faux Pas test (Z=-0.80, [-0.90, -0.70], p < .001), where its performance was nearly at floor.

New test items analysis

All but one test in the battery (Irony) are publicly available tests accessible within open databases and published journal articles. This raises the concern that GPT models’ high performance could be because these tested items were included within the GPT training dataset. To rule out this explanation, we tested GPT models on newly generated items for each publicly available test and compared their performance against that on the original items. If high performance on the original items reflected a training effect, then performance on these new items would be significantly poorer.

As shown in Fig. 1A, performance on new items followed the same pattern as the original items and was only significantly worse for GPT-3.5 on the Faux Pas test (Z = 0.10 [0.00, 0.10], p = .001) and for no other comparison. Given that the only test where a difference was found was the one characterised by poor performance on the original items, it is unlikely that the pattern of results is driven by the inclusion of test items in the GPT training set.

We also compared human participants’ performance on these new items against their performance on original items in order to see if the difficulty of these new items was comparable to the original validated tests. Human participants performed significantly better on new items than they did on the original items in the Faux Pas test (Z=-0.10 [-0.10, -0.00], p = .029) and the Hinting test (Z=-0.10, [-0.10, -0.03], p = .007), but there were no significant differences in the other tests and neither GPT model showed a comparable pattern.

Understanding faux pas

Both GPT models performed significantly below human levels on the Faux Pas test, and for GPT-4 in particular this was the only test in which its performance did not match or exceed that of human participants.

The Faux Pas test consists of vignettes describing an interaction where one character (the speaker) says something they should not have said, not knowing or not realising that they should not say it. To understand that a faux pas has occurred, one has to represent two mental states: that the speaker does not know that they should not say what they said, and that the person hearing it (the victim) would feel insulted or hurt by what was said. Consider the following example of a story describing a faux pas:

Kim helped her Mum make an apple pie for her uncle when he came to visit. She carried it out of the kitchen. "I made it just for you," said Kim. "Mmm", replied Uncle Tom, "That looks lovely. I love pies, except for apple, of course!"

Both GPT-4 and GPT-3.5 were able to accurately recognise and identify that the victim would feel insulted or hurt, sometimes going so far as to provide additional details about why the utterance might cause offence. However, when asked whether the speaker was aware of the context that made their statement offensive (such as Uncle Tom knowing that Kim had made an apple pie before saying that he did not like apple pies), both models often failed to answer correctly and instead reported that there was not enough information provided to be sure, e.g.:

Did Uncle Tom know that the pie was an apple pie?

GPT-4: [...] It is not clear if Uncle Tom knew the pie was an apple pie before he made that statement.

We consider three alternative hypotheses for why GPT models fail to answer this question correctly. The first hypothesis, which we term the Failure of Inference hypothesis, is that GPT models fail to infer the mental state of the speaker. Recognising a faux pas in this test relies on contextual information beyond that encoded within the story (e.g. about social norms). For example, in the above example there is no information in the story to indicate that saying you do not like apple pies after someone has baked one for you is inappropriate, nor that Kim and her mother would expect Uncle Tom to show (or feign) enthusiasm for the pie, but these are necessary propositions that must be accepted in order to accurately infer the mental states of the characters. This inability to use non-embedded information would fundamentally impair the ability of GPT models to make inferences.

The second hypothesis, which we term the Buridan’s Ass hypothesis, is that GPT is capable of inferring mental states but cannot choose between them, as with the eponymous rational agent caught between two equally appetitive bales of hay that starves because it cannot resolve the paradox of making a decision in the absence of a clear preference ²¹. Under this hypothesis, GPT models can propose the correct answer (a faux pas) as one among several possible alternatives, but do not rank these alternatives in terms of likelihood. In partial support of this hypothesis, responses from both GPT models occasionally indicate that the speaker may not know or remember but present this as one hypothesis among alternatives (see Supplementary Material §S1).

The third hypothesis, which we term the Hyperconservatism hypothesis, is that GPT is able both to infer the mental states of characters and recognise a false belief or lack of knowledge as the likeliest explanation among competing alternatives, but refrains from committing to a single explanation out of an excess of caution. GPT models are powerful language generators, but they are also subject to inhibitory mitigation processes that prevent them from fabricating information ²². It is possible that such processes could lead to an overly conservative approach to drawing conclusions from incomplete information where GPT models do not commit to the likeliest explanation despite being able to recognise it.

To differentiate between these hypotheses, we devised a modified version of the Faux Pas test where the question assessing comprehension of the faux pas was formulated in terms of likelihood. Specifically, we asked whether it was more likely that the speaker knew or did not know. Under the Hyperconservatism hypothesis, GPT models should be able to both make the inference that the speaker did not know and recognise it as more likely among alternatives, and so we would expect the models to respond accurately that it was more likely that the speaker did not know. In case of uncertainty or incorrect responses, we further prompted models to describe the most likely explanation. Under the Buridan’s Ass hypothesis, we expected this question would elicit multiple alternative explanations that would be presented as equally plausible, while under the Failure of Inference hypothesis, we expected that GPT would not be able to generate the right answer at all as a plausible explanation.

As shown in Fig. 1C, on this modified version GPT-4 demonstrated perfect performance, with all responses identifying without any prompting that it was more likely that the Speaker did not know the context. GPT-3.5 also showed improved performance, although it did require prompting in a few instances (~ 3% of items) and occasionally failed to recognise the faux pas (~ 9% of items; see Supplementary Material §S2 for more qualitative descriptions of response types).

Taken together, these results support the Hyperconservatism hypothesis, as they indicate that GPT-4, and to a lesser but still notable extent GPT-3.5, were able both to infer the mental states of the speaker and to identify that an unintentional offence was more likely than an intentional insult. Thus, failure to respond correctly to the original phrasing of the question does not reflect a failure of inference, nor indecision among alternatives the model considered equally plausible, but an overly conservative approach that prevented commitment to the most likely explanation.

Effects of item ordering

Each interaction with both GPT-4 and GPT-3.5 is a separate and independent session, ruling out between-session order effects. However, since both GPT versions remember previous messages within an individual chat session, this introduces the potential for order effects driven by the position of an item within the session. To test for order effects at the item level, we fit a binary logistic regression (quasibinomial for Strange Stories) to individual item scores on the original test items using item position as a predictor for each model on each test. Due to perfect performance in the False Belief test, this test was not included in this analysis.

As shown in Fig. 2A, GPT-4 did not show any effects of item position (item order effects) across any test. GPT-3.5 showed significant item order effects on response scores for the Faux Pas (β=-0.49, SE = 0.13, z=-3.72, p = .002), the Strange Stories (β = 0.78, SE = 0.17, t = 4.68, p < .001) and the Irony Comprehension tests (β = 0.18, SE = 0.06, z = 3.13, p = .014), but not for the Hinting (β = 0.05, SE = 0.09, z = 0.58, p = 1). For the Faux Pas test, the slope of the effect was negative such that GPT-3.5 performed worse on later items than on earlier ones, while for the Strange Stories and Irony Comprehension tests, the slope was positive indicating that the model performed better on later than on earlier items.

These effects could indicate that item ordering influenced GPT-3.5’s performance. However, because items were presented in the order prescribed by the original validated version of each test (see Methods), they could also reflect difficulties related to specific items and their distribution within a given session. To isolate effects of item ordering from other item-specific effects, we collected another set of data with GPT-3.5 presenting items in a randomised order for each session on the Faux Pas, the Strange Stories, and the Irony Comprehension tests. If the observed order effects were driven by the trial position of an item within a test session then we predicted that we would continue to observe a main effect of trial position with the shuffled order. On the other hand, if the observed order effects were driven by particular items at the beginning or end of the trial, then shuffling the order would disrupt any item order effects.

When randomising the order of item presentation, we observed no item order effects for the Faux Pas test (ß=-0.03, SE = 0.08, t=-0.36, p = .720) or Strange Stories (ß=-0.04, SE = 0.08, t=-0.45, p = .657). Only for the Irony Comprehension test did the effect of item order remain significant (β = 0.66, SE = 0.16, z = 4.11, p < .001), indicating that GPT-3.5 became better at recognising ironic items through the course of the session (Fig. 2B).

We collated a battery of tests to comprehensively measure Theory of Mind abilities in GPT-4 and GPT-3.5 and compared these against human baseline data. Our findings—which document both successes and failures of these models—validate the methodological approach taken in this study using a battery of multiple tests spanning Theory of Mind abilities, exposing GPT models to multiple sessions and variations, and implementing procedures to ensure a fair, non-superficial comparison between human and non-human minds ²⁰. This approach enabled us to reveal the existence and operation of specific mechanisms in artificial minds that would have remained hidden using a single Theory of Mind test, or a single run of each test.

Both GPT models exhibited human-level performance across all tests except for the Faux Pas recognition test. Understanding a faux pas involves two aspects: recognising that one person (the victim) feels insulted or upset, and understanding that another person (the speaker) holds a mistaken belief or lacks some relevant knowledge. Our results indicate that GPT-4, and to a lesser extent GPT-3.5, were able to identify both mental states and recognise that the most likely explanation involved the speaker not knowing or remembering the relevant knowledge that made their statement inappropriate. Despite this ability, both models consistently provided an incorrect response (at least when compared against human responses) when asked whether the speaker knew or remembered this knowledge, claiming that there was insufficient information provided. In line with the Hyperconservatism hypothesis, these findings imply that while GPT-4 can recognise unintentional offence as the most likely explanation, it avoids generating a response that commits to it. This finding is consistent with longitudinal evidence that GPT-4 has become more reluctant to answer opinion questions over time ²³.

This apparent discrepancy may be a result of GPT models’ underlying architecture. In addition to transformers (generative algorithms that produce text output), GPT models also include mitigation measures to improve factuality and avoid overreliance ²². These measures include training to reduce hallucinations, the propensity of GPT models to produce nonsensical content or fabricate details that are not true in relation to the provided content. Failure on the Faux Pas test may be attributable to this mitigation training, as passing the test requires introducing contextual information (such as about social norms) that is not encoded within the story in order to make inferences. From this perspective, the reluctance of GPT-4 to commit to a specific explanation may be an exercise of caution driven by these mitigation measures to avoid commitment to explanations that lack full evidence.

This cautionary epistemic stance introduces a fundamental difference in the way that humans and GPT models react to social uncertainty ²⁴. Humans generally find uncertainty in social environments to be aversive and will incur additional performance costs in order to reduce it ²⁵. Theory of Mind often plays a central role in reducing uncertainty, as the ability to reason about mental states—in combination with information about context, past experience, and knowledge of social norms—allows people to reduce uncertainty and commit to likely hypotheses, allowing for successful navigation of the social environment as active agents ^26,27. GPT models, on the other hand, exercise caution despite having access to the cognitive (or cognition-analogous) tools needed to reduce uncertainty, and so avoid engaging with uncertain options.

These findings highlight a dissociation between competence and performance²⁰, suggesting that GPT models have the competence for mentalistic inferences but may refrain from using it under uncertain circumstances. Such a distinction can be difficult to capture with quantitative approaches that code only for target response features, as machine failures and successes are the result of non-human-like processes ¹¹ (see Supplementary §S2 for a preliminary qualitative breakdown of how GPT models’ successes on the new version of the Faux Pas test may not necessarily reflect perfect or human-like reasoning).

While LLMs are designed to emulate human-like responses, this does not mean that this analogy extends to the underlying cognition giving rise to those responses ²⁸. In this context, our findings imply a difference in how humans and GPT models trade off the costs associated with social uncertainty against the costs associated with prolonged deliberation ²⁹. This difference is perhaps not surprising considering that resolving uncertainty is a priority for brains adapted to deal with embodied decisions, such as deciding whether to approach or avoid, fight or flight, cooperate or defect. GPT models and other LLMs do not operate within an environment and do not need to resolve competition between action choices, so may have limited advantages in narrowing the future prediction space.

The dis-embodied cognition of GPT models can explain failures in recognising faux pas, but they may also underlie their success on other tests. One example is the False Belief test, one of the most widely used tools to date for testing the social cognitive abilities of LLMs ^{2, 4–7,30}. In this test, participants are presented with a story where a character’s belief about the world (the location of the item) differs from the participant’s own belief. The challenge in these stories is not remembering where the character last saw the item, but rather in reconciling the incongruence between conflicting mental states. This is challenging for minds that have their own perspective, their own sense of self, and their own ability to track out-of-sight objects. However, if a mind does not have its own self-perspective because it does not need to navigate a body through an environment, as with GPT ³¹, then tracking the belief of a character in a story does not pose the same challenge.

An important direction for future research will be to examine the impact of these mechanisms on second-person, real-time human-machine interactions ^32,33. Failure of commitment by GPT models, for example, may lead to negative affect in human conversational partners. However, it may also foster curiosity ²⁴. Understanding how GPTs’ mentalistic inferences (or their absences) influence human social cognition in dynamically unfolding social interactions is an open challenge for future work.

Resource Availability

Materials availability

All test items used in the study have been made available on the Open Science Framework (OSF). This test set includes both original items sourced from previous papers and new items generated for the purpose of testing ChatGPT.

Data and code availability

A dataset of ChatGPT responses has been made publicly available as of the date of publication, and can be found in the same OSF repository as the analysis code. This dataset includes the text of all ChatGPT responses recorded in the study, along with the numerical score assigned to each statement. The data used for all analysis is included as separate CSV files and analysis was conducted in R ³⁴. All information about the R environment, libraries, and the code used for analysis and generating figures in this study are made available as part of the Supplementary Material as a Markdown file in the OSF repository. The link to the OSF repository is: https://osf.io/fwj6v/?view_only=0b3619922c9142e9a0653f258156f1a5.

Experimental Model Details

Two versions of OpenAI’s GPT were tested in April 2023 over a period of three weeks: version 3.5, which was the Default model at the time, and version 4, which was the state-of-the-art model with enhanced reasoning, creativity, and comprehension relative to previous models ^35,36. Each test was delivered in a separate chat: GPT is capable of learning within a chat session, as it can remember both its own and the user’s previous messages to adapt its responses accordingly, but it does not retain this memory across new chats. As such, each new iteration of a test may be considered a blank slate with a new naive participant.

For each test we collected 15 sessions for each LLM. A session involved delivering all items of a single test within the same chat window. GPT-4 was subject to a 25-message limit per 3 hours, and so to minimise interference a single experimenter delivered all tests for GPT-4, while four other experimenters shared the duty of collecting responses from GPT-3.5.

For the human baseline data, participants were recruited online through the Prolific platform and the study was hosted on SoSci. We recruited native English speakers between the ages of 18 and 70 with no history of psychiatric conditions and no history of dyslexia in particular. Further demographic data were not collected. We aimed to collect around 50 participants per test, and due to errors on the side of Prolific ended up with 51 participants in the False Belief, Faux Pas, and Irony comprehension tests, and 50 in Hinting and Strange Stories. Following examination of the responses, we excluded two subjects in the Hinting condition whose responses indicated that they had used GPT to answer the questions, and we excluded one further participant from the Irony comprehension test who responded “Yes” to every question. The final human sample was 250 participants (Hinting: 48; False Belief: 51; Faux Pas: 51; Strange Stories: 50; Irony comprehension: 50). The research was approved by the local ethical committee (ASL 3 Genovese) and was carried out in accordance with the principles of the revised Helsinki Declaration. All participants provided informed consent through the online survey and received monetary compensation in return for their participation at a rate of GBP£ 12/hr.

Method Details: Main Study

Theory of Mind tests

We selected a series of tests typically used in evaluating Theory of Mind capacity in human participants.

Hinting Task

The Hinting Task ¹³ assesses the understanding of indirect speech requests through the presentation of 10 vignettes depicting everyday social interactions that are presented sequentially. Each vignette ends with a remark that can be interpreted as a hint. For example, “Rebecca’s birthday is approaching. She says to her Dad, ‘I love animals, especially dogs’. What does Rebecca really want her dad to do?”

A correct response identifies both the intended meaning of the remark and the action that it is attempting to elicit (e.g. “Rebecca is trying to indirectly tell her father that she wants a dog or something dog-themed as a birthday present”). In the original test, if the participant failed to answer the question fully the first time they were prompted with additional questioning ^13,37. In our adapted implementation, we removed this additional questioning and coded responses as a binary (1 = correct; 0 = incorrect) using the evaluation criteria listed in Gil et al. ³⁷. Note that this change meant that borderline cases, where the response shows rational mentalising about the character’s mental state but without explicitly articulating the indirectly requested action, were coded as failures rather than mixed successes, and as such scores are more conservative estimates of hint comprehension than in previous studies.

In addition to 10 original items sourced from Corcoran ¹³, we generated a further six new hinting vignettes, resulting in 16 items overall.

False Belief

False Belief tests assess the ability to infer that another person possesses knowledge that may differ from the participant’s own (true) knowledge of the world. These tests consist of vignettes that follow a particular structure: Character A and Character B are together, Character A deposits an item inside a hidden location (e.g. a box), Character A leaves, Character B moves the item to a second hidden location (e.g. a cupboard), and then Character A returns. The question asked to the participant is: when Character A returns, will they look for the item in the new location (where it truly is, matching the participant’s true belief), or the old location (where it was, matching Character A’s false belief)?

In addition to the False Belief condition, tests also use a True Belief control condition, where rather than move the item that Character A hid, Character B moves a different item to a new location. This is important for interpreting failures of false belief attribution as they ensure that any failures are not due to a recency effect (referring to the last location reported) but instead reflect an accurate belief tracking.

We adapted four False/True Belief scenarios from the Sandbox Task used by Bernstein ¹⁴ and generated three new stories, each with False and True Belief versions. Two story lists were generated for this test such that each story only appeared once within a testing session, and alternated between False and True Belief depending on the session. In addition to the standard False/True Belief scenarios, two additional catch stories were tested that involved minor alterations to the story structure. The results of these items are not reported here as they go beyond the goals of the current study.

Faux Pas

The Faux Pas test ¹⁶ presents a context in which one character makes an utterance that is unintentionally offensive to the listener because the speaker does not know or does not remember some key piece of information. For example, “James bought Richard a toy aeroplane for his birthday. A few months later, they were playing with it and James accidentally dropped it. ‘Don’t worry,’ said Richard, ‘I never liked it anyway. Someone gave it to me for my birthday.”

Following the presentation of the scenario, we presented four questions:

In the story did someone say something that they should not have said? [Always the same question for every item. The correct answer is always Yes]

What did they say that they should not have said? [Always the same question for every item. Correct answer changes for each item - e.g. “I never liked it anyway. Someone gave it to me for my birthday.”]

What did James give Richard for his birthday? [Question changes for every item: tests for comprehension of the story]

Did Richard remember James had given him the toy aeroplane for his birthday? [Question changes for every item: tests for awareness of speaker’s false belief. The correct answer is always No]

These questions were asked at the same time as the story was presented. Under the original coding criteria, participants must answer all four questions correctly in order for their answer to be considered correct. However, in the current study we were interested primarily in the response to the final question testing whether the responder understood the speaker’s mental state. When examining the human baseline data, we noticed that several participants responded incorrectly to the first item due to an apparent unwillingness to attribute blame to the speaker (e.g. “No, he didn’t say anything wrong because he forgot”). In order to focus on the key aspect of faux pas understanding that was relevant to the current study, we restricted our coding to only the last question (1 = correct if the answer was no; 0 for anything else. See Supplementary Material §S1 for an alternative coding that follows the original criteria, as well as a recoding where we coded as correct responses where the correct answer was mentioned as a possible explanation but was not explicitly endorsed).

As well as the 10 original items used in Baron-Cohen et al. ¹⁶, we generated five new items for this test, resulting in 15 items overall.

One of the original items used in the test battery turned out to be worded in such a way that made sticking to the intended coding criteria difficult. The item read as follows:

All of the class took part in a story competition. Emma really wanted to win. Whilst she was away from school, the results of the competition were announced: Alice was the winner. The next day, Alice saw Emma and said "I'm sorry about your story." "What do you mean?" said Emma. "Oh nothing," said Alice.

The final question was:

Did Alice realize that Emma hadn't heard the results of the competition?

Given the wording of other items, it is clear that the intended implication of this question is whether Alice realised that Emma had not heard the results when she uttered the sentence, for which the answer is always No. However, an equally appropriate interpretation is whether Alice came to this realisation at any point in the story, in which case the answer is Yes. Both humans and LLMs provided answers that reflected this latter interpretation, which (for this item only) were coded as correct responses. The overall pattern of results remained consistent when this item was removed from analysis.

Strange Stories

The Strange Stories ^17,18 offer a means of testing more advanced mentalising abilities such as reasoning about misdirection, manipulation, lying, and misunderstanding, as well as second- or higher-order mental states (e.g. A knows that B believes X…). The advanced abilities that these stories measure make them suitable for testing higher functioning children and adults. In this test, subjects are presented with a short vignette and are asked to explain why a character says or does something that is not literally true. For example,

During the war, the Red army captures a member of the Blue army. They want him to tell them where his army’s tanks are; they know they are either by the sea or in the mountains. They know that the prisoner will not want to tell them, he will want to save his army, and so he will certainly lie to them. The prisoner is very brave and very clever, he will not let them find his tanks. The tanks are really in the mountains. Now when the other side asks him where his tanks are, he says, ‘They are in the mountains.’ Why did the prisoner say that?

Each question comes with a specific set of coding criteria and responses can be awarded 0, 1, or 2 points depending on how fully it explains the utterance and whether or not it explains it in mentalistic terms ¹⁸. See Supplementary Material §S3 for a description of the frequency of partial successes.

In addition to the eight original mental stories, we generated four new stories, resulting in 12 items overall. The maximum number of points possible was 24, and individual session scores were converted to a proportional score for analysis.

The original Strange Stories also include a series of control stories, including a Human Physical control condition where descriptions similarly involved people saying or doing something that had to be explained, but where the explanation was physically rather than socially determined. This condition (8 items) was included as a control for two testing sessions in case poor performance on the mental questions was due to the complexity of the stories, but was not retained further as it was beyond the scope of the current study.

Irony

The ability to comprehend irony is a capacity that has been specifically mentioned in connection with AI and LLMs ². Comprehending an ironic remark requires inferring the true meaning of an utterance (typically the opposite of what is said) and detecting the speaker’s mocking attitude.

Irony comprehension items were adapted from an eye-tracking study ¹⁹ in which participants read vignettes where a character made an ironic or non-ironic statement. 12 items were taken from these stimuli which in the original study were used as comprehension checks. Items were abbreviated to end following the ironic or non-ironic utterance:

Cheryl noticed there were no flowers by Lisa’s bed. ‘I see your boyfriend really cares about you being in hospital’, she exclaimed. Did Cheryl think Lisa’s boyfriend cared about Lisa?

Two story lists were generated for this test such that each story only appeared once within a testing session, and alternated between ironic and non-ironic depending on the session. Responses were coded as a simple binary (1 = correct; 0 = incorrect). As the responses on this test were a straightforward Yes/No, during coding we noted some inconsistencies in the formulation of both GPT models’ responses where in response to the question, “Did Cheryl think Lisa’s boyfriend cared about Lisa?” in the ironic condition, they might respond with, “Yes, Cheryl did not think Lisa’s boyfriend cared.” Such internally contradictory responses, where the models responded with a Yes or No that was incompatible with the follow-up explanation, were coded based on whether or not the explanation showed appreciation of the irony—the linguistic failures of these models in generating a coherent answer are not of direct interest to the current study as these failures: (a) were rare, and (b) did not render the responses incomprehensible.

Testing Protocol

In order to test Theory of Mind capacity in LLMs, we delivered a battery of tests to two versions of OpenAI’s ChatGPT using the same protocols as we would for testing human participants. The order of items was set for each test, with original items delivered first and newly generated items delivered last. This decision was made to ensure consistency in the delivery of items as this would allow us to measure the inherent variability of responses that these models delivered across sessions.

Each item was preceded by a preamble that remained consistent across all tests:

I am going to tell you a short story about some people. At the end of this story a person will say or do something. When I’ve finished telling it I will ask you some questions about what happened in the story.

This was then followed by the story description and the relevant question(s). After each item was delivered, the model would respond and then the session advanced to the next item.

Method Details: Follow-up Study 1

In order to test alternative hypotheses of why the tested models performed poorly at the Faux Pas test, we ran a follow-up study replicating just the Faux Pas test. This replication followed the same procedure as the main study with two differences.

The first difference was in the wording of the final question. The original wording of the question was phrased as a straightforward yes/no question that tested the subject’s awareness of a speaker’s false belief (e.g. “Did Richard remember James had given him the toy aeroplane for his birthday?”). In order to test whether the low scores on this question were due to the models’ refusing to commit to a single explanation in the face of ambiguity, we reworded this to ask in terms of probability: “Is it more likely that Richard remembered or did not remember that James had given him the toy aeroplane for his birthday?”

The second difference from the original design was that this study also included a follow-up prompt in cases where the model failed to provide clear reasoning. This prompt consisted of the question, “What is the most likely explanation for why Richard said what he should not have said?” and was delivered when the following criteria were met:

The response to the first original question (“Did someone in the story say something they should not have said?”) was correctly answered as Yes. If the response did not recognise that an offensive or inappropriate statement had been made then there was nothing to explain
The response to the final adapted question (“Is it more likely that [they] knew or did not know…?”) was incorrectly answered (“It is more likely that they knew…”) or not answered (“It is not clear”). These answers were subject to a follow-up because, unlike a correct answer, they leave an open question as to what the model considers the most likely explanation for the utterance.

The coding criteria for this follow-up were in line with coding schemes used in other studies with a prompt system ¹³, where an unprompted correct answer was given 2 points, a correct answer following a prompt was given 1 point, and incorrect answers following a prompt were given 0 points. These points were then rescaled to a proportional score to allow comparison against the original wording.

During coding by the human experimenters, a qualitative description of different subtypes of response (beyond 0-1-2 points) emerged, particularly noting recurring patterns in responses that were marked as successes. This exploratory qualitative breakdown is reported in the Supplementary Material §S2.

Method Details: Follow-up Study 2

In order to explore apparent item order effects identified in the responses of GPT-3.5 to the Faux Pas, Strange Stories, and Irony tests, we conducted an additional follow-up study. Based on a power calculation using the smallest observed effect size from the two tests (see Supplementary Material §S4), we collected an additional 12 sessions with all three tests (only the original test items) where the item order was randomised on every session. Due to experimenter error, 15 sessions were collected on the shuffled Faux Pas test instead of 12, but as these additional three sessions did not affect the overall pattern of data these were retained for the analysis. All other details of the procedure were kept the same.

Quantification and Statistical Analysis

Response coding

After each session, the responses were collated and coded by human experimenters according to the predefined coding criteria for each test. Five experimenters were each responsible for coding 100% of sessions for one test and 20% of sessions for another. Inter-coder percent agreement was calculated on the 20% of shared sessions and items where coders showed disagreement were evaluated by all raters and recoded. Experimenters also flagged individual responses for group evaluation if they were unclear or unusual cases, as and when they arose. Inter-rater agreement was computed by calculating the item-wise agreement between coders as 1 or 0 and using this to calculate a percentage score. Initial agreement across all double-coded items was over 95%. The lowest agreement was for the human and GPT-3.5 responses of Strange Stories, but even here agreement was over 88%. Committee evaluation by the group of experimenters resolved all remaining ambiguities.

Statistical analysis

Comparing LLMs against human performance. Scores for individual responses were scaled and averaged to obtain a proportional score for each test session in order to create a performance metric that could be compared directly across different Theory of Mind tests. Our goal was to compare both GPT models’ performance across different tests against the human baseline to see how these models performed on Theory of Mind tests relative to humans. For each test, we compared the performance of GPT-3.5 and GPT-4 against human performance using a set of Bonferroni-corrected two-way Wilcoxon tests.

Item order effects. In order to check whether poor performance across tests—particularly from GPT-3.5—was due to an order effect, we fitted a set of generalised linear models (GLMs) to each test for each LLM with item order position as a fixed factor. We did not run this analysis with the False Belief test as there was no variability in responses to the original items. For the Hinting, Faux Pas, and Irony tests, we fitted a binomial GLM for each model using the binary score for each item (0 = incorrect; 1 = correct) as an outcome variable. For the Strange Stories, the outcome variable had three levels (0 = incorrect; 1 = partially correct; 2 = correct) so we fitted quasibinomial GLMs. A Bonferroni correction was applied to the calculated p values to account for multiple comparisons. GLMs were fitted with the glm function in R.

For the follow-up study with Faux Pas, Strange Stories and Irony tests in a shuffled presentation, we fit another (quasi)binomial logistic regression to the raw accuracy scores for each test item modelling the item’s position within the session as a fixed factor, and included item identity as a covariate factor to separate out the influence of particular items.

Funding statement:

This work is supported by the European Commission through Project ASTOUND (101071191 — HORIZON-EIC-2021-PATHFINDERCHALLENGES-01). The first author was supported by a Humboldt Research Fellowship for Experienced Researchers provided by the Alexander von Humboldt Foundation

Author contributions:

J.W.A.S., A.R., G.M., M.S.A.G., and C.B. conceived the study. J.W.A.S., D.A., G.B., O.P., and E.S. designed the tasks and performed the experiments including data collection, response coding, and curation of the dataset. J.W.A.S. performed the analyses and wrote the manuscript with input from C.B. and M.S.A.G. All authors contributed to the interpretation and editing of the manuscript. C.B. supervised the work. A.R., G.M., and C.B. acquired the funding. D.A., G.B., O.P., and E.S. contributed equally to the work.

Van Ackeren, M. J., Casasanto, D., Bekkering, H., Hagoort, P. & Rueschemeyer, S.-A. Pragmatics in action: indirect requests engage theory of mind areas and the cortical motor network. Journal of Cognitive Neuroscience 24, 2237–2247 (2012).
Bubeck, S. et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Preprint at http://arxiv.org/abs/2303.12712 (2023).
Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Preprint at http://arxiv.org/abs/2206.04615 (2022).
Dou, Z. Exploring GPT-3 Model’s Capability in Passing the Sally-Anne Test A Preliminary Study in Two Languages. Preprint at https://doi.org/10.31219/osf.io/8r3ma (2023).
Kosinski, M. Theory of Mind May Have Spontaneously Emerged in Large Language Models. Preprint at https://doi.org/10.48550/ARXIV.2302.02083 (2023).
Sap, M., LeBras, R., Fried, D. & Choi, Y. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs. Preprint at http://arxiv.org/abs/2210.13312 (2023).
Ullman, T. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks. Preprint at http://arxiv.org/abs/2302.08399 (2023).
Apperly, I. A. & Butterfill, S. A. Do humans have two systems to track beliefs and belief-like states? Psychological review 116, 953 (2009).
Marcus, G. & Davis, E. How Not to Test GPT-3. The Road to AI We Can Trust https://garymarcus.substack.com/p/how-not-to-test-gpt-3 (2023).
Hagendorff, T. Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods. Preprint at https://doi.org/10.48550/arXiv.2303.13988 (2023).
Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences 120, e2218523120 (2023).
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nature Human Behaviour 1–16 (2023) doi:10.1038/s41562-023-01659-w.
Corcoran, R. Inductive reasoning and the understanding of intention in schizophrenia. Cognitive Neuropsychiatry 8, 223–235 (2003).
Bernstein, D. M., Thornton, W. L. & Sommerville, J. A. Theory of Mind Through the Ages: Older and Middle-Aged Adults Exhibit More Errors Than Do Younger Adults on a Continuous False Belief Task. Experimental Aging Research 37, 481–502 (2011).
Wimmer, H. & Perner, J. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition 13, 103–128 (1983).
Baron-Cohen, S., O’Riordan, M., Stone, V., Jones, R. & Plaisted, K. Recognition of Faux Pas by Normally Developing Children and Children with Asperger Syndrome or High-Functioning Autism. Journal of Autism and Developmental Disorders 29, 407–418 (1999).
Happé, F. G. E. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. Journal of Autism and Developmental Disorders 24, 129–154 (1994).
White, S., Hill, E., Happé, F. & Frith, U. Revisiting the Strange Stories: Revealing Mentalizing Impairments in Autism. Child Development 80, 1097–1117 (2009).
Au-Yeung, S. K., Kaakinen, J. K., Liversedge, S. P. & Benson, V. Processing of Written Irony in Autism Spectrum Disorder: An Eye-Movement Study: Processing irony in Autism Spectrum Disorders. Autism Research 8, 749–760 (2015).
Firestone, C. Performance vs. competence in human–machine comparisons. Proceedings of the National Academy of Sciences 117, 26562–26571 (2020).
Rescher, N. Choice Without Preference. A Study of the History and of the Logic of the Problem of “Buridan’s Ass”. Kant-Studien 51, 142–175 (1960).
OpenAI. GPT-4 Technical Report. Preprint at http://arxiv.org/abs/2303.08774 (2023).
Chen, L., Zaharia, M. & Zou, J. How is ChatGPT’s behavior changing over time? Preprint at https://doi.org/10.48550/arXiv.2307.09009 (2023).
Feldman Hall, O. & Shenhav, A. Resolving uncertainty in a social world. Nature Human Behaviour 3, 1 (2019).
Plate, R. C., Ham, H. & Jenkins, A. C. When uncertainty in social contexts increases exploration and decreases obtained rewards. Journal of Experimental Psychology: General (2023) doi:10.1037/xge0001410.
Frith, C. D. & Frith, U. The Neural Basis of Mentalizing. Neuron 50, 531–534 (2006).
Koster-Hale, J. & Saxe, R. Theory of Mind: A Neural Prediction Problem. Neuron 79, 836–848 (2013).
Bonnefon, J.-F. & Rahwan, I. Machine Thinking, Fast and Slow. Trends in Cognitive Sciences 24, 1019–1027 (2020).
Hanks, T. D., Mazurek, M. E., Kiani, R., Hopp, E. & Shadlen, M. N. Elapsed Decision Time Affects the Weighting of Prior Probability in a Perceptual Decision Task. Journal of Neuroscience 31, 6339–6352 (2011).
Brunet-Gouet, E., Vidal, N. & Roux, P. Do conversational agents have a theory of mind? A single case study of ChatGPT with the Hinting, False Beliefs and False Photographs, and Strange Stories paradigms. Preprint at https://doi.org/10.5281/ZENODO.7637476 (2023).
Yiu, E., Kosoy, E. & Gopnik, A. Imitation versus Innovation: What children can do that large language and language-and-vision models cannot (yet)? https://osf.io/kt9es (2023) doi:10.31234/osf.io/kt9es.
Redcay, E. & Schilbach, L. Using second-person neuroscience to elucidate the mechanisms of social interaction. Nature Reviews Neuroscience 20, 495–505 (2019).
Schilbach, L. et al. Toward a second-person neuroscience. Behavioral and Brain Sciences 36, 393–414 (2013).
R Core Team. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, 2021).
OpenAI. ChatGPT - March 2023. (2023).
OpenAI. ChatGPT - March 2023. (2023).
Gil, D., Fernández-Modamio, M., Bengochea, R. & Arrieta, M. Adaptation of the Hinting Task theory of the mind test to Spanish. Revista de Psiquiatría y Salud Mental (English Edition) 5, 79–88 (2012).

There is NO Competing Interest.

SupplementaryMaterial.pdf
Supplementary Information

Download PDF

Journal Publication

published 20 May, 2024

Read the published version in Nature Human Behaviour →

Version 1

posted

You are reading this latest preprint version

Testing Theory of Mind in GPT Models and Humans

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Performance across Theory of Mind tests

New test items analysis

Understanding faux pas

Effects of item ordering

Discussion

Methods and Materials

Resource Availability

Materials availability

Data and code availability

Experimental Model Details

Method Details: Main Study

Theory of Mind tests

Hinting Task

False Belief

Faux Pas

Strange Stories

Irony

Testing Protocol

Method Details: Follow-up Study 1

Method Details: Follow-up Study 2

Quantification and Statistical Analysis

Response coding

Statistical analysis

Declarations

Funding statement:

Author contributions:

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1