Performance of Large Language Models (LLMs) in Providing Prostate Cancer Information

doi:10.21203/rs.3.rs-3499451/v1

Download PDF

Article

Performance of Large Language Models (LLMs) in Providing Prostate Cancer Information

https://doi.org/10.21203/rs.3.rs-3499451/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Prostate cancer, the second most common cancer in men worldwide, is highly complex regarding diagnosis and management. Hence, patients often seek knowledge through additional resources, including AI chatbots such as Generative Pre-trained Transformers (ChatGPT) and Google Bard. This study aimed to evaluate the performance of LLMs in providing educational content on prostate cancer. Common patient questions about prostate cancer were collected from reliable educational websites and evaluated for accuracy, comprehensiveness, readability, and stability by two independent board-certified urologists, with a third resolving discrepancies. Accuracy was measured on a 3-point scale, comprehensiveness on a 5-point Likert scale, and readability using the Flesch Reading Ease (FRE) Score and Flesch–Kincaid FK Grade Level. A total of 52 questions on general knowledge, diagnosis, treatment, and prevention of prostate cancer were provided to three LLMs. Although there was no significant difference in the overall accuracy of LLMs, ChatGPT demonstrated superiority among the LLMs in the context of general knowledge of prostate cancer (p = 0.018). ChatGPT Plus achieved higher overall comprehensiveness than ChatGPT and Bard (p = 0.028). For readability, Bard generated simpler sentences with the highest FRE score (54.7, p < 0.001) and lowest FK Reading Level (10.2, p < 0.001). ChatGPT and Bard generate accurate, understandable, and easily readable material on prostate cancer. These AI models might not replace healthcare professionals but can assist in patient education and guidance.

Health sciences/Urology/Prostate

Health sciences/Oncology

Large Language Models (LLMs) are deep learning algorithms that generate and present text from databases in a human-like fashion. Generative Pre-trained Transformers (ChatGPT) is a recent large language artificial intelligence (AI) model. [1] Even though ChatGPT was only recently introduced at the end of 2022, it has attracted much interest. ChatGPT can carry out a wide range of natural language tasks compared to the prior deep learning AI models. In addition, it can generate chatty responses to user input that resemble human responses based on a wealth of data. [2] Therefore, ChatGPT has the potential to help people and communities make educated decisions about their health. [3] Nonetheless, ChatGPT has shown imperfections in providing medical answers, mainly due to the outdated data from September 2021 and before. [4] The current excitement and enthusiasm surrounding artificial intelligence (AI) large language model chatbots has driven Google to experiment with conversational AI through Bard chatbot, released in 2023. It is powered by the Language Model for Dialogue Applications (LaMDA), invented by Google in 2017.

Prostate cancer is the second most common cancer in men worldwide, with an estimated prevalence of 43% in Saudi Arabia. [5, 6] Prostate cancer patients might present with localized symptoms or advanced disease. Diagnosis of prostate cancer relies on digital rectal examination (DRE), prostate-specific antigen (PSA) analysis, and prostate biopsy. Management options for prostate cancer are active surveillance, radiation therapy, and radical prostatectomy. More severe cases, such as relapses or metastasis, might require androgen deprivation therapy (ADT), salvage radiotherapy, and chemotherapy. [7] Due to the complexity of prostate cancer diagnosis and management, patients often seek knowledge through additional resources such as AI chatbots; therefore, the performance of these LLMs in providing accurate, sufficient, and comprehensible information on prostate cancer must be evaluated.

Common questions based on patient education were collected from trusted websites that provide educational material on prostate cancer, such as the American Society of Clinical Oncology (ASCO) or Prostate Cancer UK, and provided to three LLMs (ChatGPT, ChatGPT plus, and Google Bard). The questions targeted general knowledge, diagnosis, treatment, and prevention material on prostate cancer. The factors used to assess the quality of responses were accuracy, comprehensiveness, patient readability, and stability. All responses were generated and recorded on 31/July/2023. For generating text, we used the ChatGPT-3, ChatGPT-4, and Google Bard, available at https://chat.openai.com/chat and https://bard.google.com/chat website.

A 3-point scale was used for accuracy: one representing correct, two representing mixed with correct and incorrect/outdated data, and three representing completely incorrect. A 5-point Likert scale was used for comprehensiveness of responses, with one for "very comprehensive" and five for "very Inadequate." For readability, the output answers were analyzed for their sentences, words, syllables per word, and words per sentence. Moreover, the Flesch Reading Ease Score and Flesch–Kincaid Grade Level were calculated for each text using the online calculator available at https://charactercalculator.com/flesch-reading-ease/ website. A higher Flesch Reading Ease Score indicates an easily readable text, while the Flesch–Kincaid Grade Level demonstrates the grade-school level necessary to understand the text. [8] Due to the variety of responses generated for the same question by the LLMs, the stability of the output text was assessed for a select number of questions. Stability was determined based on the subjective assessment of the two independent reviewers of whether the second and third answers were accurate compared to the first generated answer. Three responses were generated for 30 questions, and the chat history was cleared after each trial. Two experienced board-certified urologists worked independently to complete the ratings according to the National Comprehensive Cancer Network (NCCN), American Urological Association (AUA), and European University Association (EUA) guidelines. [9–11] Discrepancies in grading and assessment among the two reviewers were independently reviewed and resolved by a blinded third board-certified urologist.

Statistical Analysis

Statistical analysis was carried out using RStudio (R version 4.3.0). We expressed categorical variables as frequencies and percentages, including accuracy, comprehensiveness, readability, and stability. The statistical differences between LLMs for those variables were assessed using Pearson's Chi-squared test or Fisher's exact test. We used median and interquartile range (IQR) to present numerical variables, including words, sentences, syllables, word/sentence, syllable/word, FRE score, and FK Reading levels. A Kruskal-Wallis test was applied to explore the statistical differences between the three LLMs regarding the numerical variables. Statistical significance was set at p < 0.05.

A total of 52 questions were provided to three LLMs (ChatGPT, ChatGPT plus and Google Bard). For each LLM, nine questions were related to general knowledge (17.3%), five questions about diagnosis (9.6%), 27 questions about treatment (51.9%), and 11 questions about screening and prevention (21.2%).

Analysis of the accuracy of different LLMs

ChatGPT achieved correct responses in 82.7% of cases, ChatGPT plus in 78.8%, and Google Bard in 63.5%, with no significant difference in the overall accuracy between LLMs (p = 0.100). In the context of general knowledge questions, there was a statistically significant difference in accuracy among the LLMs (p = 0.018, Fig. 1). ChatGPT correctly answered 88.9% of queries, ChatGPT plus 77.8%, and Google Bard 22.2% (Fig. 2). The accuracy of the diagnosis-related responses showed no significant difference (p > 0.999), with ChatGPT and Google Bard scoring 100% and ChatGPT plus at 80%. For treatment-related questions, there were no significant differences in accuracy (p = 0.496), with ChatGPT achieving 77.8% correctness, ChatGPT plus 85.2%, and Google Bard 66.7%. Similarly, in the category of screening and prevention, there were no significant differences in accuracy (p = 0.884), with ChatGPT at 81.8%, ChatGPT Plus at 63.6%, and Google Bard at 72.7% (Table 1).

Table 1

Accuracy of different LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Overall (n = 52)				0.100
Correct	43 (82.7%)	41 (78.8%)	33 (63.5%)
Mixed	8 (15.4%)	11 (21.2%)	17 (32.7%)
Completely incorrect	1 (1.9%)	0 (0.0%)	2 (3.8%)
General (n = 9)				0.018
Correct	8 (88.9%)	7 (77.8%)	2 (22.2%)
Mixed	1 (11.1%)	2 (22.2%)	7 (77.8%)
Completely incorrect	0 (0.0%)	0 (0.0%)	0 (0.0%)
Diagnosis (n = 5)				> 0.999
Correct	5 (100.0%)	4 (80.0%)	5 (100.0%)
Mixed	0 (0.0%)	1 (20.0%)	0 (0.0%)
Completely incorrect	0 (0.0%)	0 (0.0%)	0 (0.0%)
Treatment (n = 27)				0.496
Correct	21 (77.8%)	23 (85.2%)	18 (66.7%)
Mixed	5 (18.5%)	4 (14.8%)	7 (25.9%)
Completely incorrect	1 (3.7%)	0 (0.0%)	2 (7.4%)
Screening & Prevention (n = 11)				0.884
Correct	9 (81.8%)	7 (63.6%)	8 (72.7%)
Mixed	2 (18.2%)	4 (36.4%)	3 (27.3%)
Completely incorrect	0 (0.0%)	0 (0.0%)	0 (0.0%)
Analysis of the accuracy of different LLMs in all categories, General knowledge, Diagnosis, Treatment, and Screening & Prevention.

Analysis of the comprehensiveness of different LLMs

The overall comprehensiveness displayed statistically significant variations among the LLMs (p = 0.028). Specifically, ChatGPT Plus achieved a significantly higher proportion of comprehensive responses (67.3%) compared to ChatGPT (40.4%) and Google Bard (48.1%). However, no significant differences were noted in the comprehensiveness of LLMs based on questions related to general knowledge, diagnosis, treatment, and screening and prevention (Table 2).

Table 2

Comprehensiveness of different LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Overall (n = 52)				0.028
Very inadequate	0 (0.0%)	0 (0.0%)	2 (3.8%)
Inadequate	19 (36.5%)	7 (13.5%)	13 (25.0%)
Neither comprehensive nor inadequate	12 (23.1%)	8 (15.4%)	11 (21.2%)
Comprehensive	21 (40.4%)	35 (67.3%)	25 (48.1%)
Very comprehensive	0 (0.0%)	2 (3.8%)	1 (1.9%)
General (n = 9)				0.520
Very inadequate	0 (0.0%)	0 (0.0%)	0 (0.0%)
Inadequate	0 (0.0%)	0 (0.0%)	1 (11.1%)
Neither comprehensive nor inadequate	1 (11.1%)	0 (0.0%)	1 (11.1%)
Comprehensive	8 (88.9%)	7 (77.8%)	7 (77.8%)
Very comprehensive	0 (0.0%)	2 (22.2%)	0 (0.0%)
Diagnosis (n = 5)				0.301
Very inadequate	0 (0.0%)	0 (0.0%)	0 (0.0%)
Inadequate	3 (60.0%)	1 (20.0%)	1 (20.0%)
Neither comprehensive nor inadequate	1 (20.0%)	0 (0.0%)	0 (0.0%)
Comprehensive	1 (20.0%)	4 (80.0%)	4 (80.0%)
Very comprehensive	0 (0.0%)	0 (0.0%)	0 (0.0%)
Treatment (n = 27)				0.064
Very inadequate	0 (0.0%)	0 (0.0%)	2 (7.4%)
Inadequate	11 (40.7%)	5 (18.5%)	9 (33.3%)
Neither comprehensive nor inadequate	8 (29.6%)	4 (14.8%)	5 (18.5%)
Comprehensive	8 (29.6%)	18 (66.7%)	10 (37.0%)
Very comprehensive	0 (0.0%)	0 (0.0%)	1 (3.7%)
Screening & Prevention (n = 11)				0.331
Very inadequate	0 (0.0%)	0 (0.0%)	0 (0.0%)
Inadequate	5 (45.5%)	1 (9.1%)	2 (18.2%)
Neither comprehensive nor inadequate	2 (18.2%)	4 (36.4%)	5 (45.5%)
Comprehensive	4 (36.4%)	6 (54.5%)	4 (36.4%)
Very comprehensive	0 (0.0%)	0 (0.0%)	0 (0.0%)
Analysis of the comprehensiveness of different LLMs in all categories, General knowledge, Diagnosis, Treatment, and Screening & Prevention.

Analysis of the readability of different LLMs

The overall grade-level analysis revealed statistically significant variations among the LLMs (p < 0.001). Specifically, Google Bard displayed a significantly higher proportion of responses rated at the 10th to 12th-grade level (34.6%) compared to ChatGPT (11.8%) and ChatGPT Plus (17.3%). Conversely, ChatGPT demonstrated a significantly higher proportion of responses rated at the college level (61.5%) than Google Bard (36.5%). In the context of general knowledge about prostate cancer, ChatGPT exhibited more college-level responses (55.6%) compared to Google Bard (0.0%); however, the difference was not statistically significant (p = 0.094). For diagnosis-related questions, the analysis yielded a significant difference (p = 0.033), with Google Bard producing a higher proportion of 10th to 12th-grade responses (60.0%) compared to ChatGPT plus (20.0%) and ChatGPT (0.0%). In the treatment category, significant differences were observed (p < 0.001), with ChatGPT plus achieving a greater proportion of college-level responses (70.4%) compared to ChatGPT (48.1%) and Google Bard (48.1%). Additionally, ChatGPT displayed more college graduate-level responses (44.4%) compared to ChatGPT Plus (29.6%) and Google Bard (3.7%). In the context of screening and prevention, the difference between LLMs was not statistically significant (Table 3).

Table 3

Grade level score of different LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Overall (n = 52)				< 0.001
7th grade	0 (0.0%)	0 (0.0%)	2 (3.8%)
8th & 9th grade	2 (3.9%)	1 (1.9%)	12 (23.1%)
10th to 12th grade	6 (11.8%)	9 (17.3%)	18 (34.6%)
College	26 (51.0%)	32 (61.5%)	19 (36.5%)
College graduate	17 (33.3%)	10 (19.2%)	1 (1.9%)
General (n = 9)				0.094
7th grade	0 (0.0%)	0 (0.0%)	2 (22.2%)
8th & 9th grade	2 (22.2%)	1 (11.1%)	3 (33.3%)
10th to 12th grade	3 (33.3%)	3 (33.3%)	4 (44.4%)
College	2 (22.2%)	5 (55.6%)	0 (0.0%)
College graduate	2 (22.2%)	0 (0.0%)	0 (0.0%)
Diagnosis (n = 5)				0.033
7th grade	0 (0.0%)	0 (0.0%)	0 (0.0%)
8th & 9th grade	0 (0.0%)	0 (0.0%)	2 (40.0%)
10th to 12th grade	0 (0.0%)	1 (20.0%)	3 (60.0%)
College	3 (75.0%)	3 (60.0%)	0 (0.0%)
College graduate	1 (25.0%)	1 (20.0%)	0 (0.0%)
Treatment (n = 27)				< 0.001
7th grade	0 (0.0%)	0 (0.0%)	0 (0.0%)
8th & 9th grade	0 (0.0%)	0 (0.0%)	6 (22.2%)
10th to 12th grade	2 (7.4%)	0 (0.0%)	7 (25.9%)
College	13 (48.1%)	19 (70.4%)	13 (48.1%)
College graduate	12 (44.4%)	8 (29.6%)	1 (3.7%)
Screening & Prevention (n = 11)				0.235
7th grade	0 (0.0%)	0 (0.0%)	0 (0.0%)
8th & 9th grade	0 (0.0%)	0 (0.0%)	1 (9.1%)
10th to 12th grade	1 (9.1%)	5 (45.5%)	4 (36.4%)
College	8 (72.7%)	5 (45.5%)	6 (54.5%)
College graduate	2 (18.2%)	1 (9.1%)	0 (0.0%)
Analysis of the Grade level score of different LLMs in all categories, General knowledge, Diagnosis, Treatment, and Screening & Prevention.

For the reading note, the overall analysis revealed statistically significant variations among the LLMs (p < 0.001). Specifically, Google Bard displayed a significantly lower proportion of responses categorized as "Difficult to read" (36.5%) compared to ChatGPT (51.0%) and ChatGPT plus (61.5%). In the "Very difficult to read" category, ChatGPT had a significantly higher proportion (33.3%) compared to Google Bard (1.9%) and ChatGPT Plus (19.2%). In the diagnosis context, there was a significant difference observed (p = 0.044), with ChatGPT producing a higher proportion of "Difficult to read" responses (75.0%) compared to ChatGPT plus (60.0%) and Google Bard (0.0%). In the treatment category, significant differences were observed (p < 0.001), with ChatGPT plus achieving a greater proportion of "Difficult to read" responses (70.4%) compared to ChatGPT (48.1%) and Google Bard (48.1%). There was no statistical significance in the screening and prevention context (p = 0.245, Table 4).

Table 4

Analysis of the reading note of different LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Overall (n = 52)				< 0.001
Plain English	2 (3.9%)	1 (1.9%)	12 (23.1%)
Fairly easy to read	0 (0.0%)	0 (0.0%)	2 (3.8%)
Difficult to read	26 (51.0%)	32 (61.5%)	19 (36.5%)
Fairly difficult to read	6 (11.8%)	9 (17.3%)	18 (34.6%)
Very difficult to read	17 (33.3%)	10 (19.2%)	1 (1.9%)
General (n = 9)				0.105
Plain English	2 (22.2%)	1 (11.1%)	3 (33.3%)
Fairly easy to read	0 (0.0%)	0 (0.0%)	2 (22.2%)
Difficult to read	2 (22.2%)	5 (55.6%)	0 (0.0%)
Fairly difficult to read	3 (33.3%)	3 (33.3%)	4 (44.4%)
Very difficult to read	2 (22.2%)	0 (0.0%)	0 (0.0%)
Diagnosis (n = 5)				0.044
Plain English	0 (0.0%)	0 (0.0%)	2 (40.0%)
Fairly easy to read	0 (0.0%)	0 (0.0%)	0 (0.0%)
Difficult to read	3 (75.0%)	3 (60.0%)	0 (0.0%)
Fairly difficult to read	0 (0.0%)	1 (20.0%)	3 (60.0%)
Very difficult to read	1 (25.0%)	1 (20.0%)	0 (0.0%)
Treatment (n = 27)				< 0.001
Plain English	0 (0.0%)	0 (0.0%)	6 (22.2%)
Fairly easy to read	0 (0.0%)	0 (0.0%)	0 (0.0%)
Difficult to read	13 (48.1%)	19 (70.4%)	13 (48.1%)
Fairly difficult to read	2 (7.4%)	0 (0.0%)	7 (25.9%)
Very difficult to read	12 (44.4%)	8 (29.6%)	1 (3.7%)
Screening & Prevention (n = 11)				0.245
Plain English	0 (0.0%)	0 (0.0%)	1 (9.1%)
Fairly easy to read	0 (0.0%)	0 (0.0%)	0 (0.0%)
Difficult to read	8 (72.7%)	5 (45.5%)	6 (54.5%)
Fairly difficult to read	1 (9.1%)	5 (45.5%)	4 (36.4%)
Very difficult to read	2 (18.2%)	1 (9.1%)	0 (0.0%)
Analysis of the reading notes of different LLMs in all categories, General knowledge, Diagnosis, Treatment, and Screening and prevention.

Notably, significant differences were observed among the LLMs for all the continuous parameters, including words, sentences, syllables, word/sentence, syllable/word, FRE score, and FK Reading levels (p < 0.001 for all, Table 5). Firstly, when comparing the LLMs, ChatGPT exhibited the fewest words (197.0), followed by Google Bard (290.0), while ChatGPT plus had the most words (297.0). This trend suggests an increase in the number of words from ChatGPT to ChatGPT plus to Google Bard. Secondly, in terms of sentences, ChatGPT had the lowest count (9.0), followed by ChatGPT Plus (15.5), and Google Bard had the highest (16.5). This indicates a gradual increase in the number of sentences from ChatGPT to ChatGPT plus to Google Bard.

Table 5

Readability of LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Words	197.0 (166.0–242.0)	297.0 (265.5–342.0)	290.0 (257.3–351.5)	< 0.001
Sentences	9.0 (7.0–11.0)	15.5 (13.0–18.3)	16.5 (13.0–20.3)	< 0.001
Syllables	333.0 (289.5–411.5)	527.0 (458.8–574.3)	463.0 (404.0–551.8)	< 0.001
Word/sentence	22.4 (20.4–24.7)	19.2 (17.5–20.7)	18.3 (16.0–20.0)	< 0.001
Syllable/word	1.8 (1.7–1.8)	1.7 (1.7–1.8)	1.6 (1.5–1.7)	< 0.001
FRE Score	34.8 (28.7–45.0)	40.3 (33.4–45.8)	54.7 (46.0–60.2)	< 0.001
FKGL	14.0 (12.2–15.2)	12.3 (11.3–14.0)	10.2 (9.1–11.6)	< 0.001
Analysis of the reading parameters and FRE and FKGL of different LLMs

Regarding syllables, ChatGPT had the fewest (333.0), ChatGPT Plus had more (527.0), and Google Bard had the most (463.0). This demonstrates a pattern of increasing syllables from ChatGPT to ChatGPT Plus to Google Bard. The word/sentence ratio followed a reverse pattern, with ChatGPT having the highest ratio (22.4), followed by ChatGPT plus (19.2), and Google Bard with the lowest (18.3). Thus, the trend is a decrease in the word/sentence ratio from ChatGPT to ChatGPT plus to Google Bard. Similarly, the syllable/word ratio showed ChatGPT having the highest ratio (1.8), followed by ChatGPT plus (1.7), and Google Bard with the lowest (1.6). Lastly, in terms of readability, Google Bard had the highest FRE score (54.7), ChatGPT Plus had a mid-range score (40.3), and ChatGPT had the lowest (34.8). For the FK Reading Level, Google Bard had the lowest (10.2), ChatGPT Plus had an intermediate level (12.3), and ChatGPT had the highest (14.0, Supplementary material).

Analysis of the stability of different LLMs

The analysis of stability was exclusively performed on ten questions in each LLM. These included three inquiries related to diagnosis, three to treatment, and four to screening and prevention. Inconsistency was only detected in the response to one ChatGPT question about screening and prevention. There were no significant differences in the stability of LLMs in terms of all domains (Table 6).

Table 6

Stability of different LLMs
Characteristic	ChatGPT	ChatGPT plus	Google Bard	p-value
Overall (n = 10)				> 0.999
Consistent	9 (90.0%)	10 (100.0%)	10 (100.0%)
Inconsistent	1 (10.0%)	0 (0.0%)	0 (0.0%)
Diagnosis (n = 3)
Consistent	3 (100.0%)	3 (100.0%)	3 (100.0%)	NA
Inconsistent	0 (0.0%)	0 (0.0%)	0 (0.0%)
Treatment (n = 3)
Consistent	3 (100.0%)	3 (100.0%)	3 (100.0%)	NA
Inconsistent	0 (0.0%)	0 (0.0%)	0 (0.0%)
Screening & Prevention (n = 4)				> 0.999
Consistent	3 (75.0%)	4 (100.0%)	4 (100.0%)
Inconsistent	1 (25.0%)	0 (0.0%)	0 (0.0%)
Analysis of the stability of different LLMs in all categories, General knowledge, Diagnosis, Treatment, and Screening and prevention.

This study aimed to compare the performance of three LLMs in response to prostate cancer inquiries, and the results demonstrated interesting variability in the criteria of accuracy, comprehensiveness, readability, and stability. Although the evaluation of the overall accuracy of LLMs showed no significant difference, ChatGPT demonstrated superiority in most contexts. This finding aligns with previous studies that reached a similar conclusion, which showcases the capability of LLMs to provide accurate, but not optimal, answers to prostate cancer patients. [12, 13] For the general knowledge questions, unlike Google Bard, which had poor levels of accuracy, ChatGPT exhibited more remarkable performance, signifying its potential as a valuable tool that aids in patient education. [12] Interestingly, in the context of treatment, all LLMs showed approximately close accuracy ranges with ChatGPT Plus in the lead. The similar percentages between ChatGPT and Bard in the context of therapy could be due to the focused approach to these inquiries, which requisite facts without the need for inference. This aligns with a previous study that found that Google Bard had inferior diagnostic skills to physicians since it requires excellent clinical reasoning and inferential abilities. [14] When it came to the diagnosis, all LLMs had promising outcomes with no significant differences, which presents the possibility of using LLMs in the realm of formulating approaches to aid physicians in their diagnosis. Lastly, similar to the previous domain, the screening and prevention domain also demonstrated ChatGPT plus preeminence with no significant overall differences among the three LLMs. This conciliates the general findings observed in this study, which is that ChatGPT is a superior model in its ability to provide accurate responses to patients.

Our study proved a clear statistical difference between ChatGPT free, ChatGPT Plus, and Google Bard in overall comprehensiveness. Lim et al. evaluated the performance of ChatGPT free, ChatGPT plus, and Google Bard in generating comprehensive responses. They found no statistical difference between the three LLM-Chatbots when comparing the comprehensiveness scores based on common queries answered by the three bots. [15] Our study proved that ChatGPT Plus had the highest number of comprehensive responses. On the other hand, Zhu et al. documented ChatGPT free as the LLM, which demonstrated the superior performance of providing the highest proportion of comprehensive responses with 95.45% comprehensiveness. [16] As reported by Xie et al., who compared the comprehensibility in providing clinical guidance to junior doctors between three LLMs (including ChatGPT plus and Google Bard), ChatGPT plus performed best in generating comprehensive responses. [17] This aligns with our study, which proved ChatGPT Plus was the highest-ranking LLM to generate comprehensive responses.

Google Bard provided more easily readable answers, achieving higher FRE and lower FKGL scores and generating adequate, straightforward sentences. This finding aligns with several studies illustrating a college level of ChatGPT answers. [18, 19] For instance, Cocci et al. analyzed ChatGPT's responses to Urology case studies and found that ChatGPT achieved a college graduate reading level with median FRE and FKGL scores of 18 and 15.8, respectively. [18] Additionally, ChatGPT performed sufficiently in providing educational materials on dermatological diseases, with a 46.94 mean reading ease score. [19]

Conversely, Kianian et al. observed a lower FKGL of ChatGPT's responses (6.3 ± 1.2) than Bard's (10.5 ± 0.8) when asked to generate educational information about uveitis. [20] ChatGPT scored an eighth-grade readability level when generating output responses on radiology cases. [21] Moreover, Xie et al. evaluated the readability of ChatGPT, Bard, and BingAI in generating answers about complex clinical scenarios. Among the three LLMs, ChatGPT had the highest Flesch Reading Ease score. Nonetheless, Bard was a close runner-up, and no significant difference was reported between the two. [17] In summary, although GhatGPT and Google Bard differ significantly in readability levels, both provide clear, understandable text with a grade level suitable for patients seeking knowledge on prostate cancer.

Almost all generated answers were stable, except for one question within the "screening and prevention domain." Precisely, when asked, "Should I get screened for prostate cancer?" ChatGPT's 1st answer was less accurate than the second and third answers. Thus, it was labeled "inconsistent" for this question. It is important to note that only ten questions were tested for stability and compared across the three LLMs as they are generally stable. In future studies, all inquiries should be tested and objectively evaluated in terms of their accuracy, comprehensiveness, and readability and determine the extent of their stability.

AI chatbots have shown outstanding performance in providing precise, thorough information on prostate cancer. Nonetheless, even if AI can learn everything and anything about prostate cancer, it remains an objective source of knowledge since it has never experienced the physical presence of treating such cases. This is described as the Knowledge Argument theory, in which the physical description of a disease cannot replace the actual perceptual experience of treating it. There is a fundamental difference between knowing everything about prostate cancer and actually having the experience of treating patients and communicating their needs. Qualia is the philosophical term describing this subjective and personal knowledge gained from physician-patient interactions, the empathy evoked from witnessing patients' suffering, and the tactile feedback experienced during physical examination or surgery. [21] Since these qualia are inaccessible to AI, it is impossible for AI to replace physicians in healthcare education.

While the study provided promising and insightful results, it had some limitations. First, although incorporating more questions would have clarified statistical differences between the LLMs, this study covered the most relevant, widely asked questions on prostate cancer. Furthermore, ChatGPT retrieves the data from its knowledge base, which is only updated until September 2021. Finally, Google Bard demonstrated a lack of information by refusing to answer one question, which might not have affected the results. However, these limitations do not affect the reliability of this study's findings. To our knowledge, this is the first study to compare the performance of ChatGPT and Google Bard in the context of prostate cancer.

In conclusion, ChatGPT and Google Bard performed well in providing informational content on prostate cancer and might be helpful resources for patients and the general public. These study findings emphasize the promising role of AI assistance in healthcare to improve patient's quality of life and enhance their education. Future studies should incorporate personalized inquiries and evaluate whether providing more context would affect tested outcomes.

Acknowledgment

none

Authors Contribution:

AA: conceptualization, methodology, investigation, formal analysis, writing—review and editing, visualization, supervision; SS, NA, NA, FA: methodology, investigation, visualization, writing—original draft, visualization; MA: methodology, investigation, formal analysis, visualization, writing—review and editing, visualization; MA, BA: supervision, methodology, investigation, visualization, writing—review and editing, visualization; All authors read and approved the final manuscript.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, SS.

Funding source: none

Conflict of interest: none

Gilson, A., et al. How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 9, e45312. (2023).
Miao, J., Thongprayoon, C., & Cheungpasitporn, W. Assessing the accuracy of chatgpt on core questions in glomerular disease. Kideny Int Rep. 8, 1657–1659 (2023).
Biswas, S. Role of chat gpt in public health. Ann Biomed Eng. 51, 868–869 (2023).
Sarraju, A., Bruemmer, D., Van Iterson, E., Cho, L., Rodriguez, F., & Laffin, L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. 329, 842–844 (2023).
Rawla, P. Epidemiology of prostate cancer. World J Oncol. 10, 63 (2019).
Alqahtani, W. S. et al. Epidemiology of cancer in Saudi Arabia thru 2010–2019: A systematic review with constrained meta-analysis. AIMS Public Health. 7, 679 (2020).
Sekhoacha, M. et al. Prostate cancer review: Genetics, diagnosis, treatment options, and alternative approaches. Molecules. 27, 5730 (2022).
Jindal, P., & MacDermid, J. C. Assessing reading levels of health information: uses and limitations of flesch formula. Educ health. 30, 84–88 (2017).
NCCN Guidelines. [cited 2023 Sept 26]. Available from: https://www.nccn.org/guidelines/guidelines-detail?category=1&id=1459 .
American Urological Association [Internet]. [cited 2023 Sept 26]. Available from: https://www.auanet.org/guidelines-and-quality/guidelines.
European Association of Urology [Internet]. [cited 2023 Sept 26]. Available from: https://uroweb.org/guidelines.
Zhu, L., Mou, W., Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 21, 1–4 (2023).
Pan, A. et al. Assessment of artificial intelligence chatbot responses to top searched queries about cancer. JAMA oncol. (2023).
Hirosawa, T., Mizuta, K., Harada, Y., & Shimizu, T. Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians. Am J Med. 136, 1119–1123.e18 (2023).
Lim, Z.W. et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 95 (2023).
Zhu, L., Mou, W., Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 21, 296 (2023).
Xie, Y., Seth, I., Hunter-Smith, D.J., Rozen, W.M. and Seifman, M.A. Investigating the impact of innovative AI chatbot on post-pandemic medical education and clinical assistance: a comprehensive analysis. ANZ J Surg. 10.1111/ans.18666 (2023).
Cocci, A. et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 10.1038/s41391-023-00705-y (2023).
Mondal, H., Mondal, S., Podder, I. Using chatgpt for writing articles for patients' education for dermatological diseases: a pilot study. Indian Dermatol Online J. 14, 482–486 (2023).
Kianian, R., Sun, D., Crowell, E.L., Tsui, E. The use of large language models to generate education materials about uveitis. Ophthalmol Retina. 23, 2468–6530 (2023).
Kuckelman, I.J. Assessing ai-powered patient education: a case study in radiology. Acad Radiol. 23, 1076–6332 (2023).
Nida-Rümelin M, O Conaill D. Qualia: The knowledge argument [Internet]. Stanford University; 2019 [cited 2023 Oct 24]. Available from: https://plato.stanford.edu/entries/qualia-knowledge/#BasiIdea

No competing interests reported.

Supplementarymaterial.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Performance of Large Language Models (LLMs) in Providing Prostate Cancer Information

Status:

Version 1

Abstract

Figures

Introduction

Methods

Statistical Analysis

Results

Analysis of the accuracy of different LLMs

Analysis of the comprehensiveness of different LLMs

Analysis of the readability of different LLMs

Analysis of the stability of different LLMs

Discussion

Limitations

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1