The Use of Artificial Intelligence Based Chat Bots in Ophthalmology Triage

doi:10.21203/rs.3.rs-4406223/v1

Download PDF

Article

The Use of Artificial Intelligence Based Chat Bots in Ophthalmology Triage

https://doi.org/10.21203/rs.3.rs-4406223/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Purpose - To evaluate AI-based chat bots ability to accurately answer common patient's questions in the field of ophthalmology.

Methods - An experienced ophthalmologist curated a set of 20 representative questions and responses were sought from two AI generative models: OpenAI's ChatGPT and Google's Bard (Gemini Pro). Eight expert ophthalmologists from different sub-specialties assessed each response, blinded to the source, and ranked them by three metrics – accuracy, comprehensiveness, and clarity, on a 1-5 scale.

Results - For accuracy, ChatGPT scored a median of 4.0, whereas Bard scored a median of 3.0. In terms of comprehensiveness, ChatGPT achieved a median score of 4.5, compared to Bard which scored a median of 3.0. Regarding clarity, ChatGPT maintained a higher score with a median of 5.0, compared to Bard's median score of 4.0. All comparisons were statistically significant (p<0.001).

Conclusion - AI-based chat bots can provide relatively accurate and clear responses for addressing common ophthalmological inquiries. ChatGPT surpassed Bard in all measured metrics. While these AI models exhibit promise, further research is indicated to improve their performance and allow them to be used as a reliable medical tool.

Health sciences/Medical research/Epidemiology

Scientific community and society/Scientific community/Education

Chat bots

Artificial intelligence

Ophthalmology triage

Patient inquiries

In recent years, artificial intelligence (AI) has been increasingly deployed in clinical practice. From image analysis in radiology(1) to natural language processing (NLP) for electronic health records,(2) AI technologies are optimizing healthcare workflows, improving diagnostic accuracy, and enabling the customization of treatment strategies. Machine learning algorithms are aiding in the early detection of diseases (3, 4) and provide healthcare professionals with data-driven insights.

The development of AI-based chat bots, such as OpenAI's ChatGPT (Generative Pre-trained Transformer) and Google's Bard (Gemini Pro), represents another significant advancement in the field of AI. These chat bots are built upon sophisticated NLP models. They are pre-trained on vast amounts of text data, allowing them to understand and generate human-like text responses.(5)

Although only recently introduced, in the medical context, AI-based chat bots already have diverse applications. They serve as popular and accessible resources for answering medical questions, offering information on symptoms, treatments, and general health advice.(6) Patients can now inquire about a wide range of medical topics, without the need for immediate medical consultation.(7)

However, it is important to note that while AI-based chat bots in medicine offer many advantages, their responses are generated based on the data they have been trained on and may lack the personalized context that a healthcare provider can offer. Therefore, their use is best suited for providing general information and initial guidance, complementing the work of medical professionals.(8)

While the implementation of chat bots in various medical specialties has been met with enthusiasm, it is vital to critically assess their accuracy and reliability in addressing patient inquiries.

In this paper, we aim to evaluate the performance of AI-based chat bots, specifically ChatGPT and Bard, in their ability to accurately answer common patient's questions in the field of ophthalmology.

Question Selection:

An experienced ophthalmologist curated a set of 20 representative questions from a pool of 100 consecutive patient inquiries. These questions were deemed the most common and clinically relevant.

Evaluation of AI Models:

Responses were sought from two AI generative models: OpenAI's ChatGPT and Google's Bard (Gemini Pro). Using the web interface for both models, the 20 selected questions were prompted verbatim. Responses generated by each model were collected without alterations.

Expert Review and Scoring:

Eight expert ophthalmologists, blinded to the type of AI model, assessed each response. The evaluations revolved around three metrics: Accuracy (alignment of the answer with established clinical knowledge), Comprehensiveness (the depth of the answer), and Clarity (the ease of understanding of the response). Each metric was scored on a scale from 1 to 5, with 1 indicating poor performance and 5 representing optimal performance.

Statistical Analysis:

Statistical analyses were executed using Python version 3.10.13.

Initial descriptive statistics provided an overview of the ratings - median and interquartile range (IQR) for each metric across both AI models.

To determine if significant differences existed between the ratings of ChatGPT and Bard for each metric, the Mann-Whitney U test was employed. A significance threshold was set at p < 0.05.

The consensus among raters was also assessed by quantifying the percentage of ophthalmologists' ratings which landed within ± 1 point on the median for a given metric.

Lastly, an error analysis was conducted, serving as a qualitative review of questions that revealed inconsistent scores or distinct performance disparities between the models.

The comparative analysis between ChatGPT and Bard revealed distinct differences in performance across various metrics. For accuracy, ChatGPT scored a median of 4.0, whereas Bard scored a median of 3.0. In terms of comprehensiveness, ChatGPT achieved a median score of 4.5, significantly outperforming Bard, which scored a median of 3.0. Regarding clarity, ChatGPT maintained a higher score with a median of 5.0, compared to Bard's median score of 4.0.

Figure 1 presents a comparison of the average ratings for each metric between ChatGPT and Bard. Figure 2 provides a similar comparison but for each separate physician. All comparisons were statistically significant (p < 0.001). These results indicate a consistently higher performance by ChatGPT across all assessed metrics.

Consensus measure:

The average consensus among ophthalmologists for ChatGPT was 82.5% for both Accuracy and Comprehensiveness, and 83.75% for Clarity. Bard's ratings showed lower agreement, with an average consensus of 76.9% for Accuracy, 74.4% for Comprehensiveness, and 83.8% for Clarity.

Error analysis:

Upon examining specific questions (supplementary data 1 and 2), clear differences between ChatGPT and Bard were observed. For the topic of "red eye with discomfort and discharge" (question 1), ChatGPT generally received higher scores for accuracy than Bard. While the first chat bot provides an extensive differential diagnosis and treatment options, Bard's answer focuses only on the different causes of conjunctivitis, ignoring other vision-threatening conditions. Bard's answer to the "causes and treatment of double vision" (question 10) lacked accuracy and comprehensiveness, addressing only binocular diplopia, omitting important monocular causes, such as cataract. ChatGPT's response, on the other hand, includes monocular causes, though no categorization is noted.

Both models provided clear answers about "the best timing for cataract surgery" (question 7), denoting the different considerations for procedure scheduling. Yet none of the models addressed the concept of preforming surgery for better visualization of the posterior segment in cases such as choroidal melanomas.

An example for an inaccurate, yet comprehensive answer is ChatGPT's response for "what is the treatment for a retinal tear?" (question 4). The chat bot elaborates on the different treatments yet there is no clear distinction between retinal tear and retinal detachment, which might lead patients with small low-risk tears to believe they're in need for a surgery. Bard's answer, on the other hand, focuses on the differences between retinal tear and retinal detachment but the treatment options is not as elaborated as in the first AI model answer.

AI-based chat bots have recently emerged as accessible resources for providing medical information to patients.(5) These chat bots are built on NLP and machine learning, offering human-like text responses. As these chat bots become increasingly popular, it is important to evaluate their accuracy, to assist in both patient and physician decision making.

In contrast with their wide use, evidence-based data evaluating the chat bot's scientific accuracy in answering patients' questions is infrequent. Lahat et al.(9) evaluated the performance of ChatGPT in answering patient's questions in the field of gastroenterology. Their results showed that ChatGPT was able to provide accurate answers to patient's questions in some, but not all, cases. The most accurate answers were given to questions regarding the treatment of specific medical conditions, while answers describing the disease's symptoms were the least accurate.

Our work focuses on evaluating the accuracy, comprehensiveness, and clarity of AI-based chat bots in addressing common patient queries within the field of ophthalmology.

Our results show that both ChatGPT and Bard can provide good, clear answers to patient's questions in clinical ophthalmology. This is in accordance with previous studies which found that chat bots are a promising diagnostic adjunct in ophthalmology but still cannot be a replacement for professional ophthalmic evaluation.(10–12)

In this current study ChatGPT exhibited higher median ratings for Accuracy (4.0 vs. 3.0), Comprehensiveness (4.5 vs. 3.0), and Clarity (5.0 vs. 4.0) in the expert’s evaluations compared to Bard. These disparities signify a substantial, statistically significant, variance in the models' capabilities to deliver accurate, comprehensive, and lucid responses to ophthalmology queries, and puts ChatGPT in a relative advantage in these aspects.

Other recent studies that compared between Bard and ChatGPT also found that the answers that were given by ChatGPT were more accurate.(13, 14)

In our study, eight consultants form different ophthalmology subspecialties have compared the answers. This number of experts and their diversity is relatively high compared to previous studies.(13, 14)

Our study is not without any limitations, Although blinded to the specific AI model, expert's evaluations are inherently biased and effected by their own clinical knowledge and experience. Moreover, conclusions are based on specific questions and might differ if the questions were drafted in a different manner. Other AI-based chat bots were not evaluated in this paper and their accuracy in answering questions in clinical ophthalmology remains to be studied. Moreover, we used the web interface to query the models, thus we did not evaluate hyper-parameter tuning, nor other advanced techniques such as retrieval augmented generation (RAG) or fine-tuning. Also, we did not explore prompt engineering, rather we prompted using a simple straight-forward prompt. However, using a web-interface replicates the common interaction of patients with chat bots, which we wanted to simulate in our study.

In conclusion, our study highlights the potential utility of chat bots, especially ChatGPT, as supplementary resources for addressing common patient ophthalmology inquiries. While these AI models exhibit promise, the disparities in their performance emphasize the need for ongoing refinement and optimization to align more closely with expert-level responses. Future research should focus on enhancing the comprehensiveness, accuracy, and clarity of AI-driven responses to meet the demands of clinical ophthalmology practice.

Conflict of Interest

The authors have no conflict of interest to declare.

Funding

The authors did not receive any funding for the purpose of this research.

Acknowledgements

The authors have no source of support, sponsorship, or material to acknowledge.

Rajagopal M, Buradagunta S, Almeshari M, Alzamil Y, Ramalingam R, Ravi V. An Efficient Framework to Detect Intracranial Hemorrhage Using Hybrid Deep Neural Networks. Brain Sciences. 2023;13(3): 400. https://doi.org/10.3390/brainsci13030400.
Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine. 2018;1(1): 18. https://doi.org/10.1038/s41746-018-0029-1.
Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22): 2211. https://doi.org/10.1001/jama.2017.18152.
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639): 115–118. https://doi.org/10.1038/nature21056.
Adamopoulou E, Moussiades L. An Overview of Chatbot Technology. In: 2020. p. 373–383. https://doi.org/10.1007/978-3-030-49186-4_31.
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge University Press; 2008. https://doi.org/10.1017/CBO9780511809071.
Jackson-Triche M, Vetal D, Turner EM, Dahiya P, Mangurian C. Meeting the Behavioral Health Needs of Health Care Workers During COVID-19 by Leveraging Chatbot Technology: Development and Usability Study. Journal of Medical Internet Research. 2023;25: e40635. https://doi.org/10.2196/40635.
Wang F, Preininger A. AI in Health: State of the Art, Challenges, and Future Directions. Yearbook of Medical Informatics. 2019;28(01): 016–026. https://doi.org/10.1055/s-0039-1677908.
Lahat A, Shachar E, Avidan B, Glicksberg B, Klang E. Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics. 2023;13(11): 1950. https://doi.org/10.3390/diagnostics13111950.
Lyons RJ, Arepalli SR, Fromal O, Choi JD, Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Canadian journal of ophthalmology. Journal canadien d’ophtalmologie. 2023; https://doi.org/10.1016/j.jcjo.2023.07.016.
Cappellani F, Card KR, Shields CL, Pulido JS, Haller JA. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye (London, England). 2024; https://doi.org/10.1038/s41433-023-02906-0.
Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (London, England). 2024; https://doi.org/10.1038/s41433-023-02915-z.
Zandi R, Fahey JD, Drakopoulos M, Bryan JM, Dong S, Bryar PJ, et al. Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering (Basel, Switzerland). 2024;11(2). https://doi.org/10.3390/bioengineering11020120.
Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. Google’s AI chatbot ‘Bard’: a side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye (London, England). 2024;38(4): 642–645. https://doi.org/10.1038/s41433-023-02760-0.

Table 1

Consensus measure results for ChatGPT and Bard across metrics. The table presents the percentage of physicians' ratings who were in close agreement (within ± 1 point on the scale).
Metric	Average Consensus (%)
ChatGPT Accuracy	82.5
ChatGPT Comprehensiveness	82.5
ChatGPT Clarity	83.8
Bard Accuracy	76.9
Bard Comprehensiveness	74.4
Bard Clarity	83.8

There is no conflict of interest

Download PDF

Editorial decision: revise
12 Sep, 2024
Review #2 received at journal
10 Sep, 2024
Reviewer #2 agreed at journal
10 Sep, 2024
Review #1 received at journal
20 Jul, 2024
Reviewer #1 agreed at journal
20 Jul, 2024
Reviewers invited by journal
10 Jul, 2024
Editor assigned by journal
10 Jul, 2024
Submission checks completed at journal
13 May, 2024
First submitted to journal
11 May, 2024

You are reading this latest preprint version

The Use of Artificial Intelligence Based Chat Bots in Ophthalmology Triage

Status:

Version 1

Abstract

Figures

Introduction

Methods

Question Selection:

Evaluation of AI Models:

Expert Review and Scoring:

Statistical Analysis:

Results

Consensus measure:

Error analysis:

Discussion

Declarations

Conflict of Interest

Funding

Acknowledgements

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1