In this paper, we investigated reasoning abilities in a broad array of large language models (LLMs), by employing tasks typically designed to highlight bounded rationality in humans. Concurrently, we did a systematic comparison of the performance of these models with that of human participants, to investigate their potential as models of human reasoning.
In order to ensure the interpretability of our results, we mitigated potential contamination issues by developing new variants of classic tests. The importance of this step was evidentiated through an analysis of each token probability in a given question, in which we confirmed that our newly-designed experimental items were less accurately predicted by all language models, compared to the original test items who, given their high predictability, were likely already present in the models’ corpus.
To ensure the generalizability of our findings, we focused on two well-established cognitive tests: the Cognitive Reflection Test (CRT) and the Linda/Bill problems (designed to test the Conjunction Fallacy; CF;14,48,53). The CRT was originally designed to probe the tendency to produce intuitive responses associated with a "System 1" mode of reasoning (intuitive and impulsive decision-making), as opposed to “System 2” reasoning (reflective and deliberative reasoning16). The test items, often comprising mathematical riddles, were intentionally crafted to elicit intuitive (albeit incorrect) responses. The Linda/Bill scenarios, on the other hand, were initially used to demonstrate the conjunction fallacy by Kahneman and Tversky8. In our modified version, they consisted of a series of vignettes depicting hypothetical scenarios involving individuals' backgrounds, followed by simple questions regarding the likeliness of congruous and incongruous future scenarios. The presence of a fallacy was revealed when participants wrongly judged a conjunction of two events to be more probable than an event in isolation, as a result of the incongruous content of the "Linda" vignettes. The classical interpretation behind this prominent probabilistic reasoning error traditionally lies on the notion that individuals rely on a "representativeness" heuristic48.
Our analysis encompassed several advanced models delivered by Open AI, more specifically those belonging to the ‘Da Vinci’ family, but also ChatGPT and GPT-4. Thus, our comprehensive approach enabled us to assess the developmental time course of reasoning performance in these models.
The results revealed that models belonging to the ‘Da Vinci’ family (DV) exhibited reasoning limitations typically associated with humans in these tasks. Specifically, they displayed a higher rate of intuitive responses compared to correct ones in the CRT. Critically ‘Da Vinci’ models underperformed with respect to humans when presented with the same tasks. Nonetheless, it is important to consider that our human experimental subjects, recruited via a testing platform, may have been over-performing compared to a general population due to their potential familiarity with psychometric testing. Similar findings were observed in the Linda/Bill problems, where DV models generally manifested a conjunction fallacy comparable to that observed in humans, while their average correct response rate remained lower than that of humans.
Furthermore, a seemingly counterintuitive result emerged within the ‘DV’ family of models: reasoning biases surfaced or intensified in the more refined models. Although correct response rates increased from davinci-beta to davinci-003, intuitive response rates exhibited an even greater increment. Concerning the Linda/Bill problems, accuracy levels remained relatively consistent and low, while the conjunction fallacy emerged only for the later models. Importantly, this trend exhibited an opposite pattern when examining simple mathematical questions that matched the computational complexity of the items presented in our experimental tests. This observation suggests an intriguing dissociation between language-based and formal reasoning in LLMs26. Interestingly, analysis of human performance in mathematically equivalent problems revealed that, especially regarding the calculation of joint probabilities, humans were much less proficient compared to LLMs, further suggesting that language-based and formal reasoning abilities do not necessary progress hand by hand in humans and machines.
One possible explanation for these puzzling findings could be that the later models of the DV family differ from their predecessors by virtue of an additional level of fine-tuning with reinforcement learning and human feedback44. During this process, the model's weights were revised based on a value function calculated using human ratings. It is plausible to consider that certain cognitive biases may have been introduced through this interaction and tuning process, as these are inherently present in the human mind5,8,15,49. We posit that this insight has significant implications for future research in machine learning54.
The analysis of behavioral patterns in the subsequent models, such as ChatGPT and GPT-4, revealed a distinct shift. Not only did these models surpass humans in terms of accuracy levels, but they also demonstrated little (ChatGPT) to no propensity (GPT-4) for intuitive reasoning. Due to limited information provided by OpenAI, determining the exact cause of this substantial performance improvement is challenging. However, available data suggests that an increase in size alone is unlikely to account for the improvement55. Other potential sources of performance enhancement may include a more deliberate choice or weighting of the training corpus, as well as the incorporation of a "hidden" prompting mechanism aimed at inducing a state of improved reasoning in the models. Albeit speculative, this last proposition is supported by the explicit presence of an additional prompting layer, at least in ChatGPT. Further research utilizing open-source models and training/tuning experiments will be necessary to better understand the features that mitigate reasoning errors in these models56.
Finally, we explored whether simple prompting strategies, such as in-context learning and chain-of-thought, could improve the performance of DV models (davinci-003 results are presented in the main text, but see Supplementary Materials for the other DV models). These strategies indeed resulted in a significant performance enhancement. However, comparable manipulations in humans did not yield equivalent results. While providing examples led to an increase in performance, verbal prompts to employ more formal or mathematical forms of reasoning were ineffective. This discrepancy does not necessarily imply that humans are incapable of benefiting from formal reasoning, but it may rather reflect different disposition toward engaging in costly mental computations57. Namely, it is likely that human participants were more inclined to comply with the example strategy, which only required passive reading, rather than the reasoning strategy, which necessitated active engagement in formal reasoning. However, at a minimum, this finding indicates that direct translation of experimental set-up between humans and machines may be problematic and point to an often overlooked, but important and fundamental difference in the cognitive processes underlying human and LLM cognition. Humans are subject to consider (and avoid) costly computational processes, while LLM are not58,59.
In the final part of this discussion, we propose to contextualize our results within the broader perspective of the interaction between psychological/behavioral research, and AI research focused on large language models (LLMs). As LLMs become increasingly complex and versatile, capable of being used in a "domain general” manner to answer a wide range of questions (beyond specific tasks like translation or grammar checking), machine intelligence researchers can and should turn to behavioral science to find suitable benchmark tasks22,60.
By doing so, researchers can leverage the century-long tradition of conceptual and methodological refinement in behavioral tools, which often lack customized, ad hoc benchmarks. While it is true that machine learning benchmarks and cognitive-behavioral tasks are designed with different goals in mind (ML researchers seek challenging benchmark tasks, while cognitive scientists prioritize interpretable results), shared tools are crucial for assessing human-machine alignment at the cognitive level61–64. It should be assessed on a case-by-case basis whether alignment is desirable (e.g., machines aligned with moral principles), or not (e.g. machines free of human cognitive limitations – alignment problem;65. Using cognitive-behavioral tasks specifically tailored to pinpoint precise and specific cognitive processes holds potential heuristic value for machine intelligence researchers, particularly in the case of LLMs22,60. These models currently represent black boxes of complexity comparable to the human brain, requiring experimental investigation to understand their emergent behaviors, many of which have been discovered rather than predicted by design21,30.
On the other side, it is in the cognitive scientists’ very nature to look at advances in machine intelligence as a window to understand natural cognition66–68. Following this line of reasoning, I can be argued that behavioral scientists can benefit from studying the behavior of LLMs. The idea that neural network models serve as useful cognitive models is not new and is now widely accepted31. Emerging evidence suggests that deep neural networks not only provide accurate predictions of behavioral data, notably in learning, but also of related neural activity in fields such as vision and language processing26. There is therefore reasonable hope that transformer-based LLMs will not be different. In fact, recent studies have demonstrated that they successfully predict language-related neural activity32. But our and other recent studies raised the question of whether LLMs can be used as tools to understand the computational foundations of decision-making, heuristics and biases35–37, 47,69,70. To assess the merit of this approach, translates into asking two related questions: what are the sources of cognitive biases and limitations in LLM? And what they can teach us about the origin of similar behaviors in humans.
We believe that there are two possible sources of these emergent cognitive biases and limitations. The first possibility is that they arise from the biases present in the corpus. Human reasoning, soaked with bounded rationality and biased reasoning, translates into written language, which is then captured by the model as it is. This suggests the intriguing notion that a statistically-driven model of language may serve as a "minimal" model for cognitive biases and heuristics. While it is likely that in many cases, LLMs may, in many domains, achieve good performance through different computational mechanisms than humans[3], it is still possible that some biases could emerge from similar mechanisms, such as an overreliance on statistical regularities or biased attentional capture, rather than radically wrong algorithm.
The second possibility could be that biases may be introduced through human feedback-based fine-tuning. In this perspective, a cognitively bounded human tuner introduces biases by upvoting wrong responses that they mistakenly believe to be correct. This interpretation receives support from the fact that, at least within the GPT-3 family, the propensity for expressing the conjunction fallacy and intuitive reasoning increased in fine-tuned models. Of course, these two possibilities are not mutually exclusive, but could work in synergy.
On the other side, It is challenging to understand why these biases and limitations tend to disappear in the latest models (ChatGPT and GPT-4), given the lack of openly-shared information by OpenAI. Yet, it is possible that the increase in performance is not solely attributed to model size. Other possibilities, such as careful selection or weighting of the training corpus.
To better ascertain these possibilities, we suggest future research should include training, tuning, and prompting experiments to evaluate the performance of open-source models, and disentangle the contributions of these strategies to the improved reasoning abilities observed in ChatGPT and GPT-456.
In any case, the significant variability observed between models in our study indicates that a definitive answer to the question of whether LLMs exhibit cognitive limitations and biases cannot be drawn. Our results also claim for cautiousness concerning the possibility of using LLMs are proxies for human participants38,39,71. Previous studies focused on only one model, while our study includes a larger array of LLMs belonging to the OpenAI GPT family35–37, 47. In doing so we have traced a nonlinear developmental trajectory of cognitive abilities of sorts, where earlier models (DV) underperformed humans, and newer ones surpassed them. Moreover, humans and machines responded differently to our prompting strategies. Further research involving the investigation of multiple models and tasks simultaneously would shed light on the sources of this variance.
In conclusion, our study compared cognitive limitations and biases in LLMs and humans using new variants of well-validated tests. Unlike previous studies, our analysis includes a broader range of LLMs from the OpenAI GPT family, enabling us to observe a developmental trace of cognitive abilities, where earlier models underperformed with respect to humans, but newer models surpassed them in a number of key aspects. We were therefore unable to find, in the considered model space, no model capable to correctly mimic the performance of our standard experimental sample. In addition, the differential response of humans and machines to our prompting strategies further highlights the potential and complexities of comparing human and machine reasoning in the cognitive domain.