Capability of chatbots powered by large language models to support the screening process of scoping reviews: a feasibility study

doi:10.21203/rs.3.rs-4687319/v1

Download PDF

Article

Capability of chatbots powered by large language models to support the screening process of scoping reviews: a feasibility study

https://doi.org/10.21203/rs.3.rs-4687319/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The recent surge in publications increases the screening time required to maintain up-to-date and high-quality literature reviews. One of the most time-consuming phases is the screening of titles and abstracts. With the support of machine learning tools, this process has been semi-automated for systematic reviews, with limited success for scoping reviews. ChatGPT, a large language model, might support scoping review screening with its ability to identify key concepts and themes within texts. We hypothesise that ChatGPT’s performance in abstract screening surpasses that of the semi-automated tool Rayyan, increasing efficiency at acceptable costs while maintaining a low type II error. In our retrospective analysis, ChatGPT 4.0 decided upon 15 306 abstracts, vastly outperforming Rayyan. ChatGPT demonstrated high levels of accuracy (68%), specificity (67%) and sensitivity (88–89%) and a negative predictive value of 99% when compared to human researchers’ decisions. The workload savings were at 64% reasonable costs. Despite the promising results, human oversight remains paramount, as ChatGPT’s decisions resulted in a 11% false negative rate. A hybrid screening approach combining human raters and ChatGPT might ensure accuracy and quality while enhancing efficiency. Further research on ChatGPT’s parameters, the prompts and screening scenarios is necessary in order to validate these results and to develop a standardised approach.

Health sciences/Health care/Public health

Health sciences/Health care/Health policy

Scoping review

ChatGPT

artificial intelligence

automation

large language model

screening

Given the exponential increase in scientific publications in recent years, there is a pressing need for comprehensive and timely literature reviews to summarise the available evidence and inform guidelines in the healthcare sector [1]. However, these reviews are labour-intensive and time-consuming, with systematic reviews often taking over a year to reach publication [1–8]. This has led to a growing interest in leveraging artificial intelligence tools to support the review process [4, 5, 8]. A particularly time-consuming effort lies in the screening of titles and abstracts, which typically involves screening thousands of abstracts by two independent reviewers [7, 9]. This phase is crucial for the quality and validity of the review, as results are dependent on a comprehensive database search and a thorough, unbiased screening process [10, 11]. The screening phase of titles and abstracts takes an average of 33 days and the yield rate of relevant articles from the overall search is relatively low, ranging from 1 to 2.9% [2, 3, 8, 12]. This highlights the potential benefits of artificial intelligence support in this area [2, 4].

Text mining was the first artificial intelligence technique to be used for automation in systematic reviews in 2006 [4]. Since then, the underlying artificial intelligence techniques have been refined with review tools such as Rayyan, Covidence, Abstrackr and EPPI-Reviewer employing semi-automated machine learning algorithms to rank references by relevance [7, 13, 14]. These tools learn from human decisions to develop an initial ranking and improve it with subsequent human input [5]. Natural language processing and text mining techniques – such as word tokenisation, removal of stop words, stemming and data-mining – extract features of the abstracts to structure the textual data and infer non-explicit knowledge [2, 4, 12]. These features are then used to train a classifier to predict the likelihood of an unclassified article’s inclusion or exclusion and to rank the articles accordingly [2, 4, 7, 13, 14]. Core learning algorithms vary depending on the tool, with most using support vector machine and some others relying on Naive Bayes and k-nearest neighbours [2, 4, 15]. Several studies estimate that machine learning tools can halve the screening workload, while detecting 95% of relevant articles [5, 12, 14].

Whilst automation tools have proven effective for systematic reviews, they have not proven equally beneficial for scoping reviews [12]. Scoping reviews are increasingly utilised to synthesise scientific evidence and identify evidence gaps, gaining scientific robustness through guidance frameworks [16]. Unlike systematic reviews, which address narrow, well-defined research questions, scoping reviews tackle broader questions with wider search and screening criteria [17, 18]. Their inclusion criteria typically focus on broad concepts and contexts rather than specific interventions and comparators [16]. For instance, a scoping review on digital tools for interprofessional communication in healthcare encompasses a variety of digital solutions – from emails to electronic health records – across diverse professional contexts including physicians and nurses [19]. Scoping reviews further incorporate evidence from various sources, such as primary research and non-empirical evidence [20].

Given that chatbots employing large language models have demonstrated potential to identify key concepts and themes within texts and have shown promising results when employed in the context of systematic reviews, they might be effectively employed to enhance screening processes for scoping reviews [6, 10, 11, 21]. Large language models, with the most prominent example being Open AI’s ChatGPT (Chat Generative Pre-Trained Transformer), are trained on extensive datasets to learn language patterns and understand context, followed by fine-tuning to produce expected outcomes [22]. They use deep neural networks to process information through multiple interconnected layers [23]. In the case of ChatGPT, the input layer tokenises text. The tokens are then converted into high-dimensional vectors in the embedding layer in order to capture semantic meaning. This information is then passed to several transformer blocks, with the output being processed through several layers to generate the final probability distribution over the vocabulary [24]. This structure enables ChatGPT to respond to queries without prior training from end-users [10, 25].

We hypothesise that using ChatGPT in the first screening phase of scoping reviews is superior to using Rayyan, a review tool developed and used for systematic reviews, and increases efficiency at acceptable costs while maintaining a low type II error (i.e., missing relevant articles).

This study uses a retrospective study design, applying human researchers’ final decisions at the first screening phase as a benchmark.

Scoping review description

In December 2022, we conducted a comprehensive search of five scientific databases for publications related to digitally supported interprofessional communication and collaboration in healthcare. Details on the search string and process are published elsewhere [19]. After removing 11 767 duplicates with Covidence (Veritas Health Innovation), the search yielded 15 307 unique records, which were downloaded into Microsoft Excel (Microsoft Corporation). Nearly all abstracts were written in English, with seven being uniquely in German, four each in French and Spanish and one in Portuguese. Overall, 3 852 abstract texts were missing in the identified publications, from which 3 612 could be retrieved and were added manually. This refined dataset served as the basis for our analysis.

Selection of chatbot and comparator

We chose ChatGPT 4.0 and its previous and less costly version, ChatGPT 3.5, as representatives of large language models due to their widespread use and strong performance across various tasks [23, 25]. As the comparator we chose Rayyan (Qatar Computing Research Institute), a free, well-established web-based automated screening tool using AI that provided the best user experience in a comparative study of various tools for title and abstract screening [26]. Unlike other literature review tools, it further provides a quantifiable rating of the likelihood of inclusion of abstracts [13]. Rayyan uses text mining methods, extracting words from titles and abstracts and converting them into numerical features (n-grams) to structure the data. It uses a support vector machine classifier to analyse these features and create a predictive model based on the patterns identified in reviewers’ decisions, updating its predictions as reviewers proceed with screening. An abstract’s inclusion likelihood is displayed in the form of a 5-star rating [12, 13]. In our study, a rating of < 2.5 out of 5 stars was deemed to be an exclusion; a rating of > 2.5 stars an inclusion and a rating of 2.5 stars was ‘not rated’.

Training the tools

To create a training set of 100 abstracts, we randomly selected 50 excluded abstracts and 50 included abstracts according to the screening decisions produced by two independent researchers, with a third researcher resolving any discrepancies. Of the 50 included abstracts, 30 were excluded after full text screening. This training set was used 1) to train Rayyan, which requires a total of 50 decisions with at least five inclusions and five exclusions and 2) to engineer the prompt for ChatGPT [27]. The complete set of abstracts was uploaded to Rayyan and the articles in the training set marked according to the human screening results. Rayyan’s ratings were computed and recorded in a Microsoft Excel sheet (Microsoft Corporation).

The first author (KN) conducted iterative prompt engineering, refining the prompt based on the decisions and explanations that ChatGPT provided for inclusion or exclusion. The final prompt, including both the instructions and the abstract (see Fig. 1), resulted in a sensitivity of 69% and a specificity of 82%. Five days later, the training set was re-evaluated with the same prompt to test ChatGPT’s decision consistency, resulting in a sensitivity of 68% and a specificity of 79%, displaying a moderate interrater reliability of 0.75 (Cohen’s Kappa) [28, 29]. The results of both trials were deemed acceptable and we proceeded with the analysis of the full dataset on 21/02/2024, with a second analysis on 27/02/2024. We obtained ChatGPT’s decisions, including an explanation for the decision, through an application programming interface (API) from Excel, using version 6.9 of an Excel macro (ListenData, 2023) [30]. The ‘AIAssistant’ function was used without setting a maximum word count, allowing ChatGPT to generate the decision, with the corresponding explanation based on the given prompt without recalling the prior conversation. ChatGPT’s temperature was set to 0.7 throughout all ratings. The temperature setting, ranging from 0 to 2, controls the randomness of the output: a value of 0 selects the most probable words, leading to less creative but repeatable outputs, while a temperature of 1 and beyond allows for more creative but less repeatable outputs [31].

Data analysis

We conducted descriptive analysis with Stata version 16 (StataCorp LLC). Using the human decisions made by two reviewers – or three in the case of conflicts – as a benchmark, we calculated for each software and model the number of falsely included and excluded abstracts, as well as those correctly included and excluded. Based on these numbers, we computed the accuracy (proportion of abstracts correctly judged by AI among all abstracts), sensitivity (proportion of abstracts included by AI among all included as determined by reference standard), specificity (proportion of irrelevant abstracts that were excluded by AI), precision (proportion of relevant abstracts determined by reference standard that were included by AI) and negative predictive value (proportion of irrelevant abstracts determined by reference standard that were excluded by AI) (see Fig. 2). Additionally, we derived the false negative rate (proportion of falsely excluded abstracts by AI among all included as determined by reference standard), the proportion missed (proportion of falsely excluded abstracts by AI among irrelevant articles as determined by reference standard) and the workload saving (proportion of abstracts correctly excluded by AI among all abstracts) [10]. We then calculated the interrater reliability (Cohen’s Kappa) after excluding the not-rated abstracts for each interrater reliability calculation and the costs involved per screening method. As we conducted two sets of ratings for ChatGPT 4.0, we used the average cost. For ChatGPT 4.0 and 3.5, we also qualitatively analysed the structure of the output.

Manual screening results

Manual screening of title and abstracts led to the exclusion of 14 633 manuscripts (95.6%). The interrater reliability across five researchers was 0.29 with a standard error (SE) of 0.07, demonstrating a large variability with a range between 0.62 (SE: 0.08) and 0.08 (SE: 0.14). The screening was conducted between December 2022 and February 2023.

Description of the (semi-)automated screening results

Reasons for not-rated

After training Rayyan on the set of 100 abstracts, it rated only four abstracts differently than 2.5 stars: two with a rating of 1.5 stars (rated as ‘exclude’) and one each with 3.5 and 4.5 stars (rated as ‘include’). One abstract was not rated by ChatGPT 4.0, indicating that the maximum token length of 8 192 was surpassed by 325 tokens. Conversely, ChatGPT 3.5 rated that abstract but did not rate 94 other abstracts; 93 of which had no abstract, representing 39% of all missing abstracts. In all the other cases where no abstract was given, ChatGPT 3.5 labelled the abstract as ‘exclude’. ChatGPT 4.0 labelled all entries without abstract as ‘exclude’.

Errors & other indicators

After excluding not-rated abstracts, the calculations for all ChatGPT models showed similar levels of accuracy (68%), precision (11%), negative predictive value (99%) and specificity (67%). The sensitivity for model 3.5 was 4% lower than for 4.0 (rating 1: 88% and rating 2: 89%). Rayyan’s 100% achievement across these indicators is unreliable, as Rayyan rated only four abstracts. Despite a false negative rate of 11% for all ChatGPT ratings, the proportion of falsely excluded abstracts compared to decisions after the full text screening was at 3% for ChatGPT 4.0 and 4% for ChatGPT 3.5. Using ChatGPT for screening achieved a workload saving of 64%.

Interrater reliability

Rayyan evaluated only four abstracts, resulting in an interrater reliability of 1.00 (SE: 0.50). The interrater reliability between humans and all ChatGPT ratings was similar, at 0.12–0.13 (SE: 0.00) (see Fig. 3). For 1 648 abstracts the evaluations of the two ChatGPT 4.0 ratings differed, resulting in an interrater reliability of 0.76 (SE: 0.01). The interrater reliability between ChatGPT 3.5 and both ratings of ChatGPT 4.0 was similar and lower, at 0.41 (SE: 0.01).

Output structure

ChatGPT 4.0 consistently explained its ratings leading with the decision (Include or Exclude), followed by an explanation. In contrast, ChatGPT 3.5 used three different patterns for the decision and explanation of the rated abstracts:

Decision (e.g., ‘Include’) followed by the explanation (n: 11 711),

Textual decision (e.g., ‘This abstract should be excluded based on the inclusion and exclusion criteria provided.’) followed by the explanation (n: 1 374) and

Explanation followed by the decision (n: 2 129).

Cost evaluation

The costs incurred for screening differed according to the tool and model used:

Rayyan: free of charge.

ChatGPT 3.5: base subscription ($0) + API usage fee ($9.06) = $9.06.

ChatGPT 4.0: base subscription ($20) + API usage fee ($485.72*) = $505.72.

Note

*The API usage fee represents the average of both ratings (rating 1: $491.91 and rating 2: $479.53), as in some cases the Excel file crashed, requiring reanalysis.

The API usage fee depends on the number of tokens per prompt and the number of tokens in the output. In English, a token is approximately equivalent to four characters [33].

Summary of results

In our study, we showed that ChatGPT 4.0 screened and decided upon 15 306 abstracts, vastly outperforming the semi-automated screening tool Rayyan. When comparing manual title-abstract screening to ChatGPT 4.0 and 3.5, we observed consistent levels of accuracy (68%), precision (11%), negative predictive value (99%), specificity (67%), false negative rate (11%) and workload savings (64%). The sensitivity for ChatGPT 4.0 was high at 88% and 89%. At a model temperature of 0.7, the interrater reliability for the ChatGPT 4.0 ratings was substantial, with a moderate reliability when compared with ChatGPT 3.5. Both models demonstrated only slight reliability when compared with human researchers’ decisions. While ChatGPT 4.0 consistently used the same output format, ChatGPT 3.5 produced different output patterns. The cost of deployment varied significantly: Rayyan was free of charge, whereas using ChatGPT 3.5 cost $9.06 and ChatGPT 4.0 amounted to $505.72.

Comparison with semi-automated tools

Our study demonstrates that chatbots are a more feasible alternative in scoping review screening compared to current semi-automated tools, which require substantial human interaction to train and refine their models [10]. Despite using a training set with double the articles recommended by the software producer and 50 articles to be included, Rayyan only decided on four articles [27]. However, larger training samples and continuous updating of Rayyan’s predictions as human screeners advance might improve its performance [34].

In systematic review screening, Rayyan achieved a sensitivity of 78% and a proportion of missed references of 0.5% [35]. Semi-automated tools such as Abstrackr and Distiller AI have shown similar sensitivities, with specificities ranging from 72–95% and for Distiller AI a precision of 16% [34, 36]. The low performance of Rayyan in this study highlights the need for alternative tools in scoping reviews. For systematic reviews with complex inclusion criteria or multiple research questions, current semi-automated tools perform worse, with some studies suggesting limiting applications to reviews only including randomised controlled trials [34, 37, 38].

Additionally, efforts to decrease workload by using semi-automated tools often compromise recall (i.e., missing relevant articles) due to three main reasons: 1. unclear stopping point, 2. imbalanced dataset and 3. biased researchers [1, 12, 39]. Researchers use various metrics to determine when to discontinue manual screening, such as a certain number of consecutive articles excluded, a minimum prediction score, a pre-determined number of articles screened, or the time spent on screening [1, 12, 34, 40]. The optimum stopping point likely varies between reviews and can only be determined in retrospect [12, 14, 40]. A heuristic estimate of 50% seems widely accepted across tools, with studies reporting 95% of all abstracts to be included after screening 29.5% − 47.1% of abstracts [7, 12]. However, even low-ranked articles have a non-zero relevance probability [12, 14]. Moreover, the highly imbalanced dataset, skewed towards exclusion, might negatively affect the training process for the tools, leading to a bias towards exclusion [13, 37]. Lastly, ranking the articles according to relevance might influence researchers’ decisions, potentially causing complacency and a tendency to underestimate the importance of articles presented at a later stage [12].

Comparison with large language models

Chatbots such as ChatGPT demonstrate particularly advantageous abilities for screening processes as compared to semi-automated tools [12, 41]. These include: 1) no need for prior training, 2) understanding and reasoning capabilities and 3) multi-language understanding. First, as a zero-shot model, ChatGPT requires no prior training by the end-user, nor are seed articles needed [11, 42]. Second, ChatGPT’s ability to analyse fuzzy or unstructured data on a semantic level enables it to understand whether ‘review’ refers to a literature review or a customer review [41]. Similarly to human researchers, ChatGPT’s decisions can be backed up by an explanation, potentially highlighting errors in reasoning [11]. Lastly, ChatGPT’s multilingual understanding offers two advantages: First, it is trained on a large corpus of English data, giving it an advantage over human researchers who might not be native English language speakers [10]. Second, it enables researchers to extend their search to non-English research articles, which are currently underrepresented in reviews [43].

Albeit the performance metrics are slightly lower than for semi-automated tools frequently used for systematic reviews, ChatGPT shows promising results: Several studies demonstrated the efficacy of ChatGPT in systematic review screening [10, 11, 21, 34, 36]. A singular study using ChatGPT 4.0 for a scoping review screening reported an accuracy of 94%, a specificity of 94% and a sensitivity of 100% [11]. The sensitivity of our study was higher than the specificity, suggesting that ChatGPT is effective at including relevant abstracts but less efficient in excluding irrelevant ones.

Repeating the same prompt on different days to test the reliability of ChatGPT 4.0 resulted in a substantial interrater reliability of 0.76, consistent with findings in the scientific literature [28]. The slight interrater reliability between humans and ChatGPT of our study is lower than in a study with systematic reviews but consistent with other scoping reviews [10, 11]. This difference is likely due to scoping reviews demonstrating less clear inclusion and exclusion criteria than systematic reviews. Human researchers demonstrate also a lower interrater reliability in scoping reviews as compared to systematic reviews [10, 11]. To foster trust in the tool, a good alignment between the decisions of human researchers and ChatGPT is imperative [10].

Next steps

Different strategies for reducing screening workload have been proposed in the literature, such as using semi-automated tools or relying on a single reviewer. However, these approaches often affect recall: For instance, a single reviewer might miss around 13% of relevant studies [7]. ChatGPT-supported screening, with its speedy analysis of large datasets, opens up new possibilities for supporting human researchers while ensuring accurate screening and a high recall [11, 44, 45]. Potential strategies include:

Sequential approach: This approach includes an initial screening round using ChatGPT to exploit its speed and scalability, with human researchers reviewing only the abstracts included by ChatGPT [10, 34, 41]. This strategy, based on ChatGPT’s high negative predictive value, could effectively limit the number of abstracts to be screened by researchers, possibly improving researchers’ concentration and motivation [40].

Hybrid approach: A hybrid approach combines decisions by one human researcher and the Chatbot, with conflicts being resolved by an additional researcher [2, 28]. This might balance the high sensitivity of chatbots with the high specificity of human raters [2, 10].

Multiple chatbot voting rounds: In this approach, ChatGPT conducts multiple screening rounds. Then, abstracts voted for inclusion at least once are included, or abstracts are included when they achieved a minimum number of votes for inclusion [44]. This approach can also be combined with either the sequential approach or the hybrid approach.

Regardless of the chosen approach, the workload and cost savings possibly limit screening fatigue and enable researchers to use a broader search string, maximising sensitivity and improving recall [13, 15, 40, 41].

Several considerations are necessary when using ChatGPT: First, accurate outputs rely on complete, correct and non-biased inputs [21]. Therefore, input data needs to be meticulously prepared, including the manual addition of abstracts. Second, clear and specific instructions (prompts) to the model improve the accuracy of the answers [46, 47]. Using the PCC scheme (Population-Concept-Context), frequently applied in scoping reviews, for providing inclusion criteria may be beneficial. Initial trials with prompts using the PICOS scheme (Population-Intervention-Comparison-Outcome-Study design) for systematic review screening with ChatGPT demonstrated good results [10]. Providing more information, such as study type and year, might further enhance the chatbot’s performance [11]. Iterative engineering of the prompts based on these factors is necessary in order to achieve a good human-chatbot interrater reliability [28]. Trials with different chatbot parameters, such as temperature, affecting the randomness of the output generated might further be beneficial [31]. Lastly, human oversight and trust are crucial. The uptake of ChatGPT might be slow, as researchers tend to be hesitant to use tools not yet widely accepted in the scientific community [38]. While automatic deduplication of references is an accepted standard procedure, large language models such as ChatGPT are perceived as a black box due to the complexity and lack of transparency in their output generation process and because system parameters in terms of both technology and input by researchers are not commonly shared [41]. Continuous updates of the tool might yield different results, complicating replication efforts [10]. To gain researchers’ trust and acceptance in the scientific community, human oversight and a standardised and transparent approach is needed, alongside studies demonstrating the AI tool’s non-inferiority to human researchers in specific phases of a literature review [7, 11, 36, 38]. Furthermore, allowing researchers to set the AI tools’ decision threshold might reduce risk and increase trust in the tool [38].

Future research

To deepen our understanding of ChatGPT’s decisions, future research is needed to qualitatively analyse the explanations provided by ChatGPT and compare its reasoning to that of human researchers. Special attention should be given to decisions where ChatGPT diverges from human judgement or from its own judgement in another rating [10]. Additionally, we recommend prospectively evaluating the performance and workload savings of ChatGPT when used alongside a researcher, compared against a researcher pair and investigating the different approaches mentioned above [13].

Beyond its application in abstract screening, ChatGPT offers potential for implementation in various stages of the review process, including search strategy derivation, full text screening and data extraction [5, 10, 21, 41]. ChatGPT could generate search terms and adapt them to different databases [21]. Currently, ChatGPT’s potential in full-text screening is limited by token restrictions for input [21, 41]. However, with increasing maximum token lengths, ChatGPT could become a viable tool for this phase, as well [11]. Additionally, ChatGPT’s ability to understand context suggests its usefulness in data extraction [14].

Limitations

Key strengths of our study are to elicit reasons for ChatGPT’s decisions, to repeat the rating of ChatGPT 4.0 and to use the current gold standard (final decisions of two independent researchers with another researcher settling differences) as reference [34]. However, this study also has some limitations. First, despite being the gold standard, research has shown that human decisions are not flawless, being dependent upon their expertise, experience and language proficiency [5, 10, 35]. Reviewers are further trained to be over-inclusive, retrieving the full text even when minimally in doubt, as evidenced by a 70% exclusion rate during full text screening, yet missing 3% of relevant studies [36, 44, 48]. Second, our results are based on a single scoping review with a well-defined scope. Further research is needed to investigate the generalisability of the results to scoping reviews in other disciplines and less well-defined topics [10–12]. Lastly, due to practical constraints, we compared only one chatbot (but two models) and one semi-automated tool whose core technology (supported vector machine) might not be the strongest currently available [13]. Comparing additional tools is advisable to elucidate the best approach.

With an exponentially growing body of research, maintaining quality in reviews will likely require increased screening time. Our study demonstrates ChatGPT’s potential to be applied in the first screening phase of scoping reviews, demonstrating high levels of accuracy, specificity and sensitivity and vastly outperforming the semi-automated machine learning tool Rayyan. ChatGPT also demonstrated a negative predictive value of 99% and workload savings of 64%.

Despite the promising results, caution is warranted in solely relying on ChatGPT, as its decisions resulted in a false negative rate of 11%. Human oversight remains paramount. A combined screening approach using both human raters and a chatbot might ensure accuracy and quality while increasing efficiency. Further research on ChatGPT’s parameters, the prompt, screening scenarios and fields of research is necessary in order to validate these results and develop a standardised approach.

Acknowledgements

The authors want to extend their gratitude to Marie-Christin Redlich and Patricia Möbius-Lerch for supporting the screening process of the scoping review, without which the comparison in this present feasibility study would not have been possible.

CRediT authorship contribution statement

Kim Nordmann: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data Curation, Visualization, Project administration, Writing - Original Draft, Writing - Review & Editing. Stefanie Sauter: Writing - Review & Editing. Michael Schaller: Methodology, Writing - Review & Editing. Florian Fischer: Supervision, Methodology, Writing - Review & Editing

Data availability statement

The data that support the findings of this study are available from the corresponding author, KN, upon reasonable request.

Conflict of interest

The author(s) declare no competing interests.

Ethics approval

As the study did not involve sensitive data, no ethical clearance was necessary.

Declaration of generative AI in scientific writing

During the preparation of this work, the authors used GTP 4.0 in order to improve readability and language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

van Dijk, S. H. B. et al. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ open 13, e072254 (2023).
Blaizot, A. et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Research synthesis methods 13, 353–362 (2022).
Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ open 7, e012545 (2017).
La Torre-López, J. de, Ramírez, A. & Romero, J. R. Artificial intelligence to automate the systematic review of scientific literature. Computing 105, 2171–2194 (2023).
Tsafnat, G. et al. Systematic review automation technologies. Systematic reviews 3 (2014).
Christou, P. Ηow to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research? TQR 28, 1968–1980 (2023).
Hamel, C. et al. Guidance for using artificial intelligence for title and abstract screening while conducting knowledge syntheses. BMC medical research methodology 21, 285 (2021).
Zhang, Y. et al. Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol. Systematic reviews 11, 11 (2022).
Muthu, S. The efficiency of machine learning-assisted platform for article screening in systematic reviews in orthopaedics. International orthopaedics 47, 551–556 (2023).
Issaiy, M. et al. Methodological insights into ChatGPT's screening performance in systematic reviews. BMC medical research methodology 24, 78 (2024).
Guo, E. et al. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. Journal of medical Internet research 26, e48996 (2024).
Chai, K. E. K., Lines, R. L. J., Gucciardi, D. F. & Ng, L. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Systematic reviews 10, 93 (2021).
Valizadeh, A. et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC medical research methodology 22, 160 (2022).
Marshall, I. J. & Wallace, B. C. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic reviews 8, 163 (2019).
van de Schoot, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3, 125–133 (2021).
Peters, M. D. J. et al. Best practice guidance and reporting items for the development of scoping review protocols. JBI evidence synthesis 20, 953–968 (2022).
Huang, Y., Procházková, M., Lu, J., Riad, A. & Macek, P. Family Related Variables' Influences on Adolescents' Health Based on Health Behaviour in School-Aged Children Database, an AI-Assisted Scoping Review, and Narrative Synthesis. Frontiers in psychology 13, 871795 (2022).
Pollock, D. et al. Methodological quality, guidance, and tools in scoping reviews: a scoping review protocol. JBI evidence synthesis 20, 1098–1105 (2022).
Nordmann, K. et al. Conceptualizing Interprofessional Digital Communication and Collaboration in Health Care: Protocol for a Scoping Review. JMIR research protocols 12, e45179 (2023).
Campbell, F. et al. Mapping reviews, scoping reviews, and evidence and gap maps (EGMs): the same but different- the "Big Picture" review family. Systematic reviews 12, 45 (2023).
Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A. E. E. & Zayed, T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems 11, 351 (2023).
Alberts, I. L. et al. Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging 50, 1549–1552 (2023).
Buhr, C. R. et al. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions. JMIR medical education 9, e49183 (2023).
Belatrix. ChatGPT System Architecture: Exploring the Basics of AI, ML, and NLP. Available at https://www.pentalog.com/blog/tech-trends/chatgpt-fundamentals/ (2024).
Naveed, H. et al. A Comprehensive Overview of Large Language Models, 12.07.2023.
Harrison, H., Griffin, S. J., Kuhn, I. & Usher-Smith, J. A. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC medical research methodology 20, 7 (2020).
Rayyan. Using Rayyan's Predictions Classifier for Relevance Ranking? Available at https://help.rayyan.ai/hc/en-us/articles/17461088734353-Using-Rayyan-s-Predictions-Classifier-for-Relevance-Ranking (2024).
Huang, Y.-M. & Rocha, T. (eds.). Innovative Technologies and Learning. 6th International Conference, ICITL 2023 Porto, Portugal, August 28–30, 2023 Proceedings (Springer Nature Switzerland, 2023).
Stephens, L. D., Jacobs, J. W., Adkins, B. D. & Booth, G. S. Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention. Transfusion medicine reviews 37, 150753 (2023).
Bhalla, D. 3 Ways to Integrate ChatGPT into Excel. Available at https://www.listendata.com/2023/03/how-to-run-chatgpt-inside-excel.html (2023).
Davis, J., van Bulck, L., Durieux, B. N. & Lindvall, C. The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research. JMIR human factors 11, e53559 (2024).
Landis, J. R. & Koch, G. G. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 159 (1977).
OpenAI. Pricing. Simple and flexible. Only pay for what you use. Available at https://openai.com/api/pricing/ (2024).
Carey, N., Harte, M. & Mc Cullagh, L. A text-mining tool generated title-abstract screening workload savings: performance evaluation versus single-human screening. Journal of clinical epidemiology 149, 53–59 (2022).
Dos Reis, A. H. S. et al. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Systematic reviews 12, 68 (2023).
Gartlehner, G. et al. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. Journal of clinical epidemiology 121, 20–28 (2020).
Rathbone, J., Hoffmann, T. & Glasziou, P. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Systematic reviews 4, 80 (2015).
Gates, A., Johnson, C. & Hartling, L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Systematic reviews 7, 45 (2018).
Shemilt, I., Khan, N., Park, S. & Thomas, J. Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews. Systematic reviews 5, 140 (2016).
Oude Wolcherink, M. J., Pouwels, X. G. L. V., van Dijk, S. H. B., Doggen, C. J. M. & Koffijberg, H. Can artificial intelligence separate the wheat from the chaff in systematic reviews of health economic articles? Expert review of pharmacoeconomics & outcomes research 23, 1049–1056 (2023).
Wagner, G., Lukyanenko, R. & Paré, G. Artificial intelligence and the conduct of literature reviews. Journal of Information Technology 37, 209–226 (2022).
Wiggers, K. The emerging types of language models and why they matter. Available at https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/?guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAMrf8uxMxhzZhGieB8Fifg_auk00DivUWtQOTVPCBJFGfxZ3Nn8D8h_R15WJW1eJlAk_5WMgwF8Bj-f-IHv_YOu9QrloVk6FJM09opGM7qj9GrzYW_KI5LPZgVpipW0g9RWqUkQzv3UK265FGCJNmPuV45g8QyAkZG9Adn347KHm&guccounter=2 (2022).
Neimann Rasmussen, L. & Montgomery, P. The prevalence of and factors associated with inclusion of non-English language studies in Campbell systematic reviews: a survey and meta-epidemiological study. Systematic reviews 7, 129 (2018).
O'Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M. & Ananiadou, S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews 4, 5 (2015).
Ruksakulpiwat, S. et al. Assessing the Efficacy of ChatGPT Versus Human Researchers in Identifying Relevant Studies on mHealth Interventions for Improving Medication Adherence in Patients With Ischemic Stroke When Conducting Systematic Reviews: Comparative Analysis. JMIR mHealth and uHealth 12, e51526 (2024).
Qureshi, R. et al. Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation? Systematic reviews 12, 72 (2023).
Mahuli, S. A., Rai, A., Mahuli, A. V. & Kumar, A. Application ChatGPT in conducting systematic reviews and meta-analyses. British dental journal 235, 90–92 (2023).
Wang, Z., Nayfeh, T., Tetzlaff, J., O'Blenis, P. & Murad, M. H. Error rates of human reviewers during abstract screening in systematic reviews. PloS one 15, e0227742 (2020).

No competing interests reported.

Download PDF

Reviews received at journal
15 Sep, 2024
Reviewers agreed at journal
05 Sep, 2024
Reviewers agreed at journal
04 Sep, 2024
Reviewers invited by journal
31 Jul, 2024
Editor assigned by journal
31 Jul, 2024
Editor invited by journal
15 Jul, 2024
Submission checks completed at journal
08 Jul, 2024
First submitted to journal
04 Jul, 2024

You are reading this latest preprint version

Capability of chatbots powered by large language models to support the screening process of scoping reviews: a feasibility study

Status:

Version 1

Abstract

Figures

Introduction

Methodology

Scoping review description

Selection of chatbot and comparator

Training the tools

Data analysis

Results

Manual screening results

Description of the (semi-)automated screening results

Reasons for not-rated

Errors & other indicators

Interrater reliability

Output structure

Cost evaluation

Discussion

Summary of results

Comparison with semi-automated tools

Comparison with large language models

Next steps

Future research

Limitations

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1