AI-based chat bots have recently emerged as accessible resources for providing medical information to patients.(5) These chat bots are built on NLP and machine learning, offering human-like text responses. As these chat bots become increasingly popular, it is important to evaluate their accuracy, to assist in both patient and physician decision making.
In contrast with their wide use, evidence-based data evaluating the chat bot's scientific accuracy in answering patients' questions is infrequent. Lahat et al.(9) evaluated the performance of ChatGPT in answering patient's questions in the field of gastroenterology. Their results showed that ChatGPT was able to provide accurate answers to patient's questions in some, but not all, cases. The most accurate answers were given to questions regarding the treatment of specific medical conditions, while answers describing the disease's symptoms were the least accurate.
Our work focuses on evaluating the accuracy, comprehensiveness, and clarity of AI-based chat bots in addressing common patient queries within the field of ophthalmology.
Our results show that both ChatGPT and Bard can provide good, clear answers to patient's questions in clinical ophthalmology. This is in accordance with previous studies which found that chat bots are a promising diagnostic adjunct in ophthalmology but still cannot be a replacement for professional ophthalmic evaluation.(10–12)
In this current study ChatGPT exhibited higher median ratings for Accuracy (4.0 vs. 3.0), Comprehensiveness (4.5 vs. 3.0), and Clarity (5.0 vs. 4.0) in the expert’s evaluations compared to Bard. These disparities signify a substantial, statistically significant, variance in the models' capabilities to deliver accurate, comprehensive, and lucid responses to ophthalmology queries, and puts ChatGPT in a relative advantage in these aspects.
Other recent studies that compared between Bard and ChatGPT also found that the answers that were given by ChatGPT were more accurate.(13, 14)
In our study, eight consultants form different ophthalmology subspecialties have compared the answers. This number of experts and their diversity is relatively high compared to previous studies.(13, 14)
Our study is not without any limitations, Although blinded to the specific AI model, expert's evaluations are inherently biased and effected by their own clinical knowledge and experience. Moreover, conclusions are based on specific questions and might differ if the questions were drafted in a different manner. Other AI-based chat bots were not evaluated in this paper and their accuracy in answering questions in clinical ophthalmology remains to be studied. Moreover, we used the web interface to query the models, thus we did not evaluate hyper-parameter tuning, nor other advanced techniques such as retrieval augmented generation (RAG) or fine-tuning. Also, we did not explore prompt engineering, rather we prompted using a simple straight-forward prompt. However, using a web-interface replicates the common interaction of patients with chat bots, which we wanted to simulate in our study.
In conclusion, our study highlights the potential utility of chat bots, especially ChatGPT, as supplementary resources for addressing common patient ophthalmology inquiries. While these AI models exhibit promise, the disparities in their performance emphasize the need for ongoing refinement and optimization to align more closely with expert-level responses. Future research should focus on enhancing the comprehensiveness, accuracy, and clarity of AI-driven responses to meet the demands of clinical ophthalmology practice.