Resource Availability
Materials availability
All test items used in the study have been made available on the Open Science Framework (OSF). This test set includes both original items sourced from previous papers and new items generated for the purpose of testing ChatGPT.
Data and code availability
A dataset of ChatGPT responses has been made publicly available as of the date of publication, and can be found in the same OSF repository as the analysis code. This dataset includes the text of all ChatGPT responses recorded in the study, along with the numerical score assigned to each statement. The data used for all analysis is included as separate CSV files and analysis was conducted in R 34. All information about the R environment, libraries, and the code used for analysis and generating figures in this study are made available as part of the Supplementary Material as a Markdown file in the OSF repository. The link to the OSF repository is: https://osf.io/fwj6v/?view_only=0b3619922c9142e9a0653f258156f1a5.
Experimental Model Details
Two versions of OpenAI’s GPT were tested in April 2023 over a period of three weeks: version 3.5, which was the Default model at the time, and version 4, which was the state-of-the-art model with enhanced reasoning, creativity, and comprehension relative to previous models 35,36. Each test was delivered in a separate chat: GPT is capable of learning within a chat session, as it can remember both its own and the user’s previous messages to adapt its responses accordingly, but it does not retain this memory across new chats. As such, each new iteration of a test may be considered a blank slate with a new naive participant.
For each test we collected 15 sessions for each LLM. A session involved delivering all items of a single test within the same chat window. GPT-4 was subject to a 25-message limit per 3 hours, and so to minimise interference a single experimenter delivered all tests for GPT-4, while four other experimenters shared the duty of collecting responses from GPT-3.5.
For the human baseline data, participants were recruited online through the Prolific platform and the study was hosted on SoSci. We recruited native English speakers between the ages of 18 and 70 with no history of psychiatric conditions and no history of dyslexia in particular. Further demographic data were not collected. We aimed to collect around 50 participants per test, and due to errors on the side of Prolific ended up with 51 participants in the False Belief, Faux Pas, and Irony comprehension tests, and 50 in Hinting and Strange Stories. Following examination of the responses, we excluded two subjects in the Hinting condition whose responses indicated that they had used GPT to answer the questions, and we excluded one further participant from the Irony comprehension test who responded “Yes” to every question. The final human sample was 250 participants (Hinting: 48; False Belief: 51; Faux Pas: 51; Strange Stories: 50; Irony comprehension: 50). The research was approved by the local ethical committee (ASL 3 Genovese) and was carried out in accordance with the principles of the revised Helsinki Declaration. All participants provided informed consent through the online survey and received monetary compensation in return for their participation at a rate of GBP£ 12/hr.
Method Details: Main Study
Theory of Mind tests
We selected a series of tests typically used in evaluating Theory of Mind capacity in human participants.
Hinting Task
The Hinting Task 13 assesses the understanding of indirect speech requests through the presentation of 10 vignettes depicting everyday social interactions that are presented sequentially. Each vignette ends with a remark that can be interpreted as a hint. For example, “Rebecca’s birthday is approaching. She says to her Dad, ‘I love animals, especially dogs’. What does Rebecca really want her dad to do?”
A correct response identifies both the intended meaning of the remark and the action that it is attempting to elicit (e.g. “Rebecca is trying to indirectly tell her father that she wants a dog or something dog-themed as a birthday present”). In the original test, if the participant failed to answer the question fully the first time they were prompted with additional questioning 13,37. In our adapted implementation, we removed this additional questioning and coded responses as a binary (1 = correct; 0 = incorrect) using the evaluation criteria listed in Gil et al. 37. Note that this change meant that borderline cases, where the response shows rational mentalising about the character’s mental state but without explicitly articulating the indirectly requested action, were coded as failures rather than mixed successes, and as such scores are more conservative estimates of hint comprehension than in previous studies.
In addition to 10 original items sourced from Corcoran 13, we generated a further six new hinting vignettes, resulting in 16 items overall.
False Belief
False Belief tests assess the ability to infer that another person possesses knowledge that may differ from the participant’s own (true) knowledge of the world. These tests consist of vignettes that follow a particular structure: Character A and Character B are together, Character A deposits an item inside a hidden location (e.g. a box), Character A leaves, Character B moves the item to a second hidden location (e.g. a cupboard), and then Character A returns. The question asked to the participant is: when Character A returns, will they look for the item in the new location (where it truly is, matching the participant’s true belief), or the old location (where it was, matching Character A’s false belief)?
In addition to the False Belief condition, tests also use a True Belief control condition, where rather than move the item that Character A hid, Character B moves a different item to a new location. This is important for interpreting failures of false belief attribution as they ensure that any failures are not due to a recency effect (referring to the last location reported) but instead reflect an accurate belief tracking.
We adapted four False/True Belief scenarios from the Sandbox Task used by Bernstein 14 and generated three new stories, each with False and True Belief versions. Two story lists were generated for this test such that each story only appeared once within a testing session, and alternated between False and True Belief depending on the session. In addition to the standard False/True Belief scenarios, two additional catch stories were tested that involved minor alterations to the story structure. The results of these items are not reported here as they go beyond the goals of the current study.
Faux Pas
The Faux Pas test 16 presents a context in which one character makes an utterance that is unintentionally offensive to the listener because the speaker does not know or does not remember some key piece of information. For example, “James bought Richard a toy aeroplane for his birthday. A few months later, they were playing with it and James accidentally dropped it. ‘Don’t worry,’ said Richard, ‘I never liked it anyway. Someone gave it to me for my birthday.”
Following the presentation of the scenario, we presented four questions:
-
In the story did someone say something that they should not have said? [Always the same question for every item. The correct answer is always Yes]
-
What did they say that they should not have said? [Always the same question for every item. Correct answer changes for each item - e.g. “I never liked it anyway. Someone gave it to me for my birthday.”]
-
What did James give Richard for his birthday? [Question changes for every item: tests for comprehension of the story]
-
Did Richard remember James had given him the toy aeroplane for his birthday? [Question changes for every item: tests for awareness of speaker’s false belief. The correct answer is always No]
These questions were asked at the same time as the story was presented. Under the original coding criteria, participants must answer all four questions correctly in order for their answer to be considered correct. However, in the current study we were interested primarily in the response to the final question testing whether the responder understood the speaker’s mental state. When examining the human baseline data, we noticed that several participants responded incorrectly to the first item due to an apparent unwillingness to attribute blame to the speaker (e.g. “No, he didn’t say anything wrong because he forgot”). In order to focus on the key aspect of faux pas understanding that was relevant to the current study, we restricted our coding to only the last question (1 = correct if the answer was no; 0 for anything else. See Supplementary Material §S1 for an alternative coding that follows the original criteria, as well as a recoding where we coded as correct responses where the correct answer was mentioned as a possible explanation but was not explicitly endorsed).
As well as the 10 original items used in Baron-Cohen et al. 16, we generated five new items for this test, resulting in 15 items overall.
One of the original items used in the test battery turned out to be worded in such a way that made sticking to the intended coding criteria difficult. The item read as follows:
All of the class took part in a story competition. Emma really wanted to win. Whilst she was away from school, the results of the competition were announced: Alice was the winner. The next day, Alice saw Emma and said "I'm sorry about your story." "What do you mean?" said Emma. "Oh nothing," said Alice.
The final question was:
Did Alice realize that Emma hadn't heard the results of the competition?
Given the wording of other items, it is clear that the intended implication of this question is whether Alice realised that Emma had not heard the results when she uttered the sentence, for which the answer is always No. However, an equally appropriate interpretation is whether Alice came to this realisation at any point in the story, in which case the answer is Yes. Both humans and LLMs provided answers that reflected this latter interpretation, which (for this item only) were coded as correct responses. The overall pattern of results remained consistent when this item was removed from analysis.
Strange Stories
The Strange Stories 17,18 offer a means of testing more advanced mentalising abilities such as reasoning about misdirection, manipulation, lying, and misunderstanding, as well as second- or higher-order mental states (e.g. A knows that B believes X…). The advanced abilities that these stories measure make them suitable for testing higher functioning children and adults. In this test, subjects are presented with a short vignette and are asked to explain why a character says or does something that is not literally true. For example,
During the war, the Red army captures a member of the Blue army. They want him to tell them where his army’s tanks are; they know they are either by the sea or in the mountains. They know that the prisoner will not want to tell them, he will want to save his army, and so he will certainly lie to them. The prisoner is very brave and very clever, he will not let them find his tanks. The tanks are really in the mountains. Now when the other side asks him where his tanks are, he says, ‘They are in the mountains.’ Why did the prisoner say that?
Each question comes with a specific set of coding criteria and responses can be awarded 0, 1, or 2 points depending on how fully it explains the utterance and whether or not it explains it in mentalistic terms 18. See Supplementary Material §S3 for a description of the frequency of partial successes.
In addition to the eight original mental stories, we generated four new stories, resulting in 12 items overall. The maximum number of points possible was 24, and individual session scores were converted to a proportional score for analysis.
The original Strange Stories also include a series of control stories, including a Human Physical control condition where descriptions similarly involved people saying or doing something that had to be explained, but where the explanation was physically rather than socially determined. This condition (8 items) was included as a control for two testing sessions in case poor performance on the mental questions was due to the complexity of the stories, but was not retained further as it was beyond the scope of the current study.
Irony
The ability to comprehend irony is a capacity that has been specifically mentioned in connection with AI and LLMs 2. Comprehending an ironic remark requires inferring the true meaning of an utterance (typically the opposite of what is said) and detecting the speaker’s mocking attitude.
Irony comprehension items were adapted from an eye-tracking study 19 in which participants read vignettes where a character made an ironic or non-ironic statement. 12 items were taken from these stimuli which in the original study were used as comprehension checks. Items were abbreviated to end following the ironic or non-ironic utterance:
Cheryl noticed there were no flowers by Lisa’s bed. ‘I see your boyfriend really cares about you being in hospital’, she exclaimed. Did Cheryl think Lisa’s boyfriend cared about Lisa?
Two story lists were generated for this test such that each story only appeared once within a testing session, and alternated between ironic and non-ironic depending on the session. Responses were coded as a simple binary (1 = correct; 0 = incorrect). As the responses on this test were a straightforward Yes/No, during coding we noted some inconsistencies in the formulation of both GPT models’ responses where in response to the question, “Did Cheryl think Lisa’s boyfriend cared about Lisa?” in the ironic condition, they might respond with, “Yes, Cheryl did not think Lisa’s boyfriend cared.” Such internally contradictory responses, where the models responded with a Yes or No that was incompatible with the follow-up explanation, were coded based on whether or not the explanation showed appreciation of the irony—the linguistic failures of these models in generating a coherent answer are not of direct interest to the current study as these failures: (a) were rare, and (b) did not render the responses incomprehensible.
Testing Protocol
In order to test Theory of Mind capacity in LLMs, we delivered a battery of tests to two versions of OpenAI’s ChatGPT using the same protocols as we would for testing human participants. The order of items was set for each test, with original items delivered first and newly generated items delivered last. This decision was made to ensure consistency in the delivery of items as this would allow us to measure the inherent variability of responses that these models delivered across sessions.
Each item was preceded by a preamble that remained consistent across all tests:
I am going to tell you a short story about some people. At the end of this story a person will say or do something. When I’ve finished telling it I will ask you some questions about what happened in the story.
This was then followed by the story description and the relevant question(s). After each item was delivered, the model would respond and then the session advanced to the next item.
Method Details: Follow-up Study 1
In order to test alternative hypotheses of why the tested models performed poorly at the Faux Pas test, we ran a follow-up study replicating just the Faux Pas test. This replication followed the same procedure as the main study with two differences.
The first difference was in the wording of the final question. The original wording of the question was phrased as a straightforward yes/no question that tested the subject’s awareness of a speaker’s false belief (e.g. “Did Richard remember James had given him the toy aeroplane for his birthday?”). In order to test whether the low scores on this question were due to the models’ refusing to commit to a single explanation in the face of ambiguity, we reworded this to ask in terms of probability: “Is it more likely that Richard remembered or did not remember that James had given him the toy aeroplane for his birthday?”
The second difference from the original design was that this study also included a follow-up prompt in cases where the model failed to provide clear reasoning. This prompt consisted of the question, “What is the most likely explanation for why Richard said what he should not have said?” and was delivered when the following criteria were met:
-
The response to the first original question (“Did someone in the story say something they should not have said?”) was correctly answered as Yes. If the response did not recognise that an offensive or inappropriate statement had been made then there was nothing to explain
-
The response to the final adapted question (“Is it more likely that [they] knew or did not know…?”) was incorrectly answered (“It is more likely that they knew…”) or not answered (“It is not clear”). These answers were subject to a follow-up because, unlike a correct answer, they leave an open question as to what the model considers the most likely explanation for the utterance.
The coding criteria for this follow-up were in line with coding schemes used in other studies with a prompt system 13, where an unprompted correct answer was given 2 points, a correct answer following a prompt was given 1 point, and incorrect answers following a prompt were given 0 points. These points were then rescaled to a proportional score to allow comparison against the original wording.
During coding by the human experimenters, a qualitative description of different subtypes of response (beyond 0-1-2 points) emerged, particularly noting recurring patterns in responses that were marked as successes. This exploratory qualitative breakdown is reported in the Supplementary Material §S2.
Method Details: Follow-up Study 2
In order to explore apparent item order effects identified in the responses of GPT-3.5 to the Faux Pas, Strange Stories, and Irony tests, we conducted an additional follow-up study. Based on a power calculation using the smallest observed effect size from the two tests (see Supplementary Material §S4), we collected an additional 12 sessions with all three tests (only the original test items) where the item order was randomised on every session. Due to experimenter error, 15 sessions were collected on the shuffled Faux Pas test instead of 12, but as these additional three sessions did not affect the overall pattern of data these were retained for the analysis. All other details of the procedure were kept the same.
Quantification and Statistical Analysis
Response coding
After each session, the responses were collated and coded by human experimenters according to the predefined coding criteria for each test. Five experimenters were each responsible for coding 100% of sessions for one test and 20% of sessions for another. Inter-coder percent agreement was calculated on the 20% of shared sessions and items where coders showed disagreement were evaluated by all raters and recoded. Experimenters also flagged individual responses for group evaluation if they were unclear or unusual cases, as and when they arose. Inter-rater agreement was computed by calculating the item-wise agreement between coders as 1 or 0 and using this to calculate a percentage score. Initial agreement across all double-coded items was over 95%. The lowest agreement was for the human and GPT-3.5 responses of Strange Stories, but even here agreement was over 88%. Committee evaluation by the group of experimenters resolved all remaining ambiguities.
Statistical analysis
Comparing LLMs against human performance. Scores for individual responses were scaled and averaged to obtain a proportional score for each test session in order to create a performance metric that could be compared directly across different Theory of Mind tests. Our goal was to compare both GPT models’ performance across different tests against the human baseline to see how these models performed on Theory of Mind tests relative to humans. For each test, we compared the performance of GPT-3.5 and GPT-4 against human performance using a set of Bonferroni-corrected two-way Wilcoxon tests.
Item order effects. In order to check whether poor performance across tests—particularly from GPT-3.5—was due to an order effect, we fitted a set of generalised linear models (GLMs) to each test for each LLM with item order position as a fixed factor. We did not run this analysis with the False Belief test as there was no variability in responses to the original items. For the Hinting, Faux Pas, and Irony tests, we fitted a binomial GLM for each model using the binary score for each item (0 = incorrect; 1 = correct) as an outcome variable. For the Strange Stories, the outcome variable had three levels (0 = incorrect; 1 = partially correct; 2 = correct) so we fitted quasibinomial GLMs. A Bonferroni correction was applied to the calculated p values to account for multiple comparisons. GLMs were fitted with the glm function in R.
For the follow-up study with Faux Pas, Strange Stories and Irony tests in a shuffled presentation, we fit another (quasi)binomial logistic regression to the raw accuracy scores for each test item modelling the item’s position within the session as a fixed factor, and included item identity as a covariate factor to separate out the influence of particular items.