This study aimed to compare the performance of three LLMs in response to prostate cancer inquiries, and the results demonstrated interesting variability in the criteria of accuracy, comprehensiveness, readability, and stability. Although the evaluation of the overall accuracy of LLMs showed no significant difference, ChatGPT demonstrated superiority in most contexts. This finding aligns with previous studies that reached a similar conclusion, which showcases the capability of LLMs to provide accurate, but not optimal, answers to prostate cancer patients. [12, 13] For the general knowledge questions, unlike Google Bard, which had poor levels of accuracy, ChatGPT exhibited more remarkable performance, signifying its potential as a valuable tool that aids in patient education. [12] Interestingly, in the context of treatment, all LLMs showed approximately close accuracy ranges with ChatGPT Plus in the lead. The similar percentages between ChatGPT and Bard in the context of therapy could be due to the focused approach to these inquiries, which requisite facts without the need for inference. This aligns with a previous study that found that Google Bard had inferior diagnostic skills to physicians since it requires excellent clinical reasoning and inferential abilities. [14] When it came to the diagnosis, all LLMs had promising outcomes with no significant differences, which presents the possibility of using LLMs in the realm of formulating approaches to aid physicians in their diagnosis. Lastly, similar to the previous domain, the screening and prevention domain also demonstrated ChatGPT plus preeminence with no significant overall differences among the three LLMs. This conciliates the general findings observed in this study, which is that ChatGPT is a superior model in its ability to provide accurate responses to patients.
Our study proved a clear statistical difference between ChatGPT free, ChatGPT Plus, and Google Bard in overall comprehensiveness. Lim et al. evaluated the performance of ChatGPT free, ChatGPT plus, and Google Bard in generating comprehensive responses. They found no statistical difference between the three LLM-Chatbots when comparing the comprehensiveness scores based on common queries answered by the three bots. [15] Our study proved that ChatGPT Plus had the highest number of comprehensive responses. On the other hand, Zhu et al. documented ChatGPT free as the LLM, which demonstrated the superior performance of providing the highest proportion of comprehensive responses with 95.45% comprehensiveness. [16] As reported by Xie et al., who compared the comprehensibility in providing clinical guidance to junior doctors between three LLMs (including ChatGPT plus and Google Bard), ChatGPT plus performed best in generating comprehensive responses. [17] This aligns with our study, which proved ChatGPT Plus was the highest-ranking LLM to generate comprehensive responses.
Google Bard provided more easily readable answers, achieving higher FRE and lower FKGL scores and generating adequate, straightforward sentences. This finding aligns with several studies illustrating a college level of ChatGPT answers. [18, 19] For instance, Cocci et al. analyzed ChatGPT's responses to Urology case studies and found that ChatGPT achieved a college graduate reading level with median FRE and FKGL scores of 18 and 15.8, respectively. [18] Additionally, ChatGPT performed sufficiently in providing educational materials on dermatological diseases, with a 46.94 mean reading ease score. [19]
Conversely, Kianian et al. observed a lower FKGL of ChatGPT's responses (6.3 ± 1.2) than Bard's (10.5 ± 0.8) when asked to generate educational information about uveitis. [20] ChatGPT scored an eighth-grade readability level when generating output responses on radiology cases. [21] Moreover, Xie et al. evaluated the readability of ChatGPT, Bard, and BingAI in generating answers about complex clinical scenarios. Among the three LLMs, ChatGPT had the highest Flesch Reading Ease score. Nonetheless, Bard was a close runner-up, and no significant difference was reported between the two. [17] In summary, although GhatGPT and Google Bard differ significantly in readability levels, both provide clear, understandable text with a grade level suitable for patients seeking knowledge on prostate cancer.
Almost all generated answers were stable, except for one question within the "screening and prevention domain." Precisely, when asked, "Should I get screened for prostate cancer?" ChatGPT's 1st answer was less accurate than the second and third answers. Thus, it was labeled "inconsistent" for this question. It is important to note that only ten questions were tested for stability and compared across the three LLMs as they are generally stable. In future studies, all inquiries should be tested and objectively evaluated in terms of their accuracy, comprehensiveness, and readability and determine the extent of their stability.
AI chatbots have shown outstanding performance in providing precise, thorough information on prostate cancer. Nonetheless, even if AI can learn everything and anything about prostate cancer, it remains an objective source of knowledge since it has never experienced the physical presence of treating such cases. This is described as the Knowledge Argument theory, in which the physical description of a disease cannot replace the actual perceptual experience of treating it. There is a fundamental difference between knowing everything about prostate cancer and actually having the experience of treating patients and communicating their needs. Qualia is the philosophical term describing this subjective and personal knowledge gained from physician-patient interactions, the empathy evoked from witnessing patients' suffering, and the tactile feedback experienced during physical examination or surgery. [21] Since these qualia are inaccessible to AI, it is impossible for AI to replace physicians in healthcare education.