Summary of results
In our study, we showed that ChatGPT 4.0 screened and decided upon 15 306 abstracts, vastly outperforming the semi-automated screening tool Rayyan. When comparing manual title-abstract screening to ChatGPT 4.0 and 3.5, we observed consistent levels of accuracy (68%), precision (11%), negative predictive value (99%), specificity (67%), false negative rate (11%) and workload savings (64%). The sensitivity for ChatGPT 4.0 was high at 88% and 89%. At a model temperature of 0.7, the interrater reliability for the ChatGPT 4.0 ratings was substantial, with a moderate reliability when compared with ChatGPT 3.5. Both models demonstrated only slight reliability when compared with human researchers’ decisions. While ChatGPT 4.0 consistently used the same output format, ChatGPT 3.5 produced different output patterns. The cost of deployment varied significantly: Rayyan was free of charge, whereas using ChatGPT 3.5 cost $9.06 and ChatGPT 4.0 amounted to $505.72.
Comparison with semi-automated tools
Our study demonstrates that chatbots are a more feasible alternative in scoping review screening compared to current semi-automated tools, which require substantial human interaction to train and refine their models [10]. Despite using a training set with double the articles recommended by the software producer and 50 articles to be included, Rayyan only decided on four articles [27]. However, larger training samples and continuous updating of Rayyan’s predictions as human screeners advance might improve its performance [34].
In systematic review screening, Rayyan achieved a sensitivity of 78% and a proportion of missed references of 0.5% [35]. Semi-automated tools such as Abstrackr and Distiller AI have shown similar sensitivities, with specificities ranging from 72–95% and for Distiller AI a precision of 16% [34, 36]. The low performance of Rayyan in this study highlights the need for alternative tools in scoping reviews. For systematic reviews with complex inclusion criteria or multiple research questions, current semi-automated tools perform worse, with some studies suggesting limiting applications to reviews only including randomised controlled trials [34, 37, 38].
Additionally, efforts to decrease workload by using semi-automated tools often compromise recall (i.e., missing relevant articles) due to three main reasons: 1. unclear stopping point, 2. imbalanced dataset and 3. biased researchers [1, 12, 39]. Researchers use various metrics to determine when to discontinue manual screening, such as a certain number of consecutive articles excluded, a minimum prediction score, a pre-determined number of articles screened, or the time spent on screening [1, 12, 34, 40]. The optimum stopping point likely varies between reviews and can only be determined in retrospect [12, 14, 40]. A heuristic estimate of 50% seems widely accepted across tools, with studies reporting 95% of all abstracts to be included after screening 29.5% − 47.1% of abstracts [7, 12]. However, even low-ranked articles have a non-zero relevance probability [12, 14]. Moreover, the highly imbalanced dataset, skewed towards exclusion, might negatively affect the training process for the tools, leading to a bias towards exclusion [13, 37]. Lastly, ranking the articles according to relevance might influence researchers’ decisions, potentially causing complacency and a tendency to underestimate the importance of articles presented at a later stage [12].
Comparison with large language models
Chatbots such as ChatGPT demonstrate particularly advantageous abilities for screening processes as compared to semi-automated tools [12, 41]. These include: 1) no need for prior training, 2) understanding and reasoning capabilities and 3) multi-language understanding. First, as a zero-shot model, ChatGPT requires no prior training by the end-user, nor are seed articles needed [11, 42]. Second, ChatGPT’s ability to analyse fuzzy or unstructured data on a semantic level enables it to understand whether ‘review’ refers to a literature review or a customer review [41]. Similarly to human researchers, ChatGPT’s decisions can be backed up by an explanation, potentially highlighting errors in reasoning [11]. Lastly, ChatGPT’s multilingual understanding offers two advantages: First, it is trained on a large corpus of English data, giving it an advantage over human researchers who might not be native English language speakers [10]. Second, it enables researchers to extend their search to non-English research articles, which are currently underrepresented in reviews [43].
Albeit the performance metrics are slightly lower than for semi-automated tools frequently used for systematic reviews, ChatGPT shows promising results: Several studies demonstrated the efficacy of ChatGPT in systematic review screening [10, 11, 21, 34, 36]. A singular study using ChatGPT 4.0 for a scoping review screening reported an accuracy of 94%, a specificity of 94% and a sensitivity of 100% [11]. The sensitivity of our study was higher than the specificity, suggesting that ChatGPT is effective at including relevant abstracts but less efficient in excluding irrelevant ones.
Repeating the same prompt on different days to test the reliability of ChatGPT 4.0 resulted in a substantial interrater reliability of 0.76, consistent with findings in the scientific literature [28]. The slight interrater reliability between humans and ChatGPT of our study is lower than in a study with systematic reviews but consistent with other scoping reviews [10, 11]. This difference is likely due to scoping reviews demonstrating less clear inclusion and exclusion criteria than systematic reviews. Human researchers demonstrate also a lower interrater reliability in scoping reviews as compared to systematic reviews [10, 11]. To foster trust in the tool, a good alignment between the decisions of human researchers and ChatGPT is imperative [10].
Next steps
Different strategies for reducing screening workload have been proposed in the literature, such as using semi-automated tools or relying on a single reviewer. However, these approaches often affect recall: For instance, a single reviewer might miss around 13% of relevant studies [7]. ChatGPT-supported screening, with its speedy analysis of large datasets, opens up new possibilities for supporting human researchers while ensuring accurate screening and a high recall [11, 44, 45]. Potential strategies include:
-
Sequential approach: This approach includes an initial screening round using ChatGPT to exploit its speed and scalability, with human researchers reviewing only the abstracts included by ChatGPT [10, 34, 41]. This strategy, based on ChatGPT’s high negative predictive value, could effectively limit the number of abstracts to be screened by researchers, possibly improving researchers’ concentration and motivation [40].
-
Hybrid approach: A hybrid approach combines decisions by one human researcher and the Chatbot, with conflicts being resolved by an additional researcher [2, 28]. This might balance the high sensitivity of chatbots with the high specificity of human raters [2, 10].
-
Multiple chatbot voting rounds: In this approach, ChatGPT conducts multiple screening rounds. Then, abstracts voted for inclusion at least once are included, or abstracts are included when they achieved a minimum number of votes for inclusion [44]. This approach can also be combined with either the sequential approach or the hybrid approach.
Regardless of the chosen approach, the workload and cost savings possibly limit screening fatigue and enable researchers to use a broader search string, maximising sensitivity and improving recall [13, 15, 40, 41].
Several considerations are necessary when using ChatGPT: First, accurate outputs rely on complete, correct and non-biased inputs [21]. Therefore, input data needs to be meticulously prepared, including the manual addition of abstracts. Second, clear and specific instructions (prompts) to the model improve the accuracy of the answers [46, 47]. Using the PCC scheme (Population-Concept-Context), frequently applied in scoping reviews, for providing inclusion criteria may be beneficial. Initial trials with prompts using the PICOS scheme (Population-Intervention-Comparison-Outcome-Study design) for systematic review screening with ChatGPT demonstrated good results [10]. Providing more information, such as study type and year, might further enhance the chatbot’s performance [11]. Iterative engineering of the prompts based on these factors is necessary in order to achieve a good human-chatbot interrater reliability [28]. Trials with different chatbot parameters, such as temperature, affecting the randomness of the output generated might further be beneficial [31]. Lastly, human oversight and trust are crucial. The uptake of ChatGPT might be slow, as researchers tend to be hesitant to use tools not yet widely accepted in the scientific community [38]. While automatic deduplication of references is an accepted standard procedure, large language models such as ChatGPT are perceived as a black box due to the complexity and lack of transparency in their output generation process and because system parameters in terms of both technology and input by researchers are not commonly shared [41]. Continuous updates of the tool might yield different results, complicating replication efforts [10]. To gain researchers’ trust and acceptance in the scientific community, human oversight and a standardised and transparent approach is needed, alongside studies demonstrating the AI tool’s non-inferiority to human researchers in specific phases of a literature review [7, 11, 36, 38]. Furthermore, allowing researchers to set the AI tools’ decision threshold might reduce risk and increase trust in the tool [38].
Future research
To deepen our understanding of ChatGPT’s decisions, future research is needed to qualitatively analyse the explanations provided by ChatGPT and compare its reasoning to that of human researchers. Special attention should be given to decisions where ChatGPT diverges from human judgement or from its own judgement in another rating [10]. Additionally, we recommend prospectively evaluating the performance and workload savings of ChatGPT when used alongside a researcher, compared against a researcher pair and investigating the different approaches mentioned above [13].
Beyond its application in abstract screening, ChatGPT offers potential for implementation in various stages of the review process, including search strategy derivation, full text screening and data extraction [5, 10, 21, 41]. ChatGPT could generate search terms and adapt them to different databases [21]. Currently, ChatGPT’s potential in full-text screening is limited by token restrictions for input [21, 41]. However, with increasing maximum token lengths, ChatGPT could become a viable tool for this phase, as well [11]. Additionally, ChatGPT’s ability to understand context suggests its usefulness in data extraction [14].
Limitations
Key strengths of our study are to elicit reasons for ChatGPT’s decisions, to repeat the rating of ChatGPT 4.0 and to use the current gold standard (final decisions of two independent researchers with another researcher settling differences) as reference [34]. However, this study also has some limitations. First, despite being the gold standard, research has shown that human decisions are not flawless, being dependent upon their expertise, experience and language proficiency [5, 10, 35]. Reviewers are further trained to be over-inclusive, retrieving the full text even when minimally in doubt, as evidenced by a 70% exclusion rate during full text screening, yet missing 3% of relevant studies [36, 44, 48]. Second, our results are based on a single scoping review with a well-defined scope. Further research is needed to investigate the generalisability of the results to scoping reviews in other disciplines and less well-defined topics [10–12]. Lastly, due to practical constraints, we compared only one chatbot (but two models) and one semi-automated tool whose core technology (supported vector machine) might not be the strongest currently available [13]. Comparing additional tools is advisable to elucidate the best approach.