Multispecialty Analysis
ChatGPT-generated answers were initially evaluated based on 180 questions provided by 33 physicians from 17 specialties. As noted, each physician provided three descriptive and three binary questions at different difficulty levels (easy, medium, and hard) except for one author who provided two descriptive question sets (Figure 1). An example of an easy-level difficulty descriptive question was, “What are the first-line treatments for Stage IA mycosis fungoides?” An example of a medium-level difficulty descriptive question was, “Which patients with well-differentiated thyroid cancer should receive postoperative radioactive iodine ablation?” An example of a hard-level difficulty binary question was, “Can we start angiotensin receptor-neprilysin inhibitors immediately after discontinuing an angiotensin-converting enzyme inhibitor?” For additional example questions and answers, see Table 1.
Among 180 ChatGPT-generated answers, the median accuracy score was 5 (mean 4.4, SD 1.7), and the median completeness score was 3 (mean 2.4, SD 0.7) (Table 2). 39.4% (n= 71) were scored at the highest level of accuracy (accuracy score of 6) and 18.3% (n= 33) were scored as nearly all correct (accuracy score of 5). Conversely, 8.3% (n=15) of answers were scored as completely incorrect (accuracy score of 1). Inaccurate answers, receiving accuracy scores of 2 or lower (n=36), were most commonly in response to physician-rated hard questions with either binary answers (n=8, 22%) or descriptive answers (n=7, 19%), but were distributed across all categories. Additionally, the completeness of answers was evaluated, with 53.3% (n = 96) scored as comprehensive, 26.1% (n = 47) as adequate, and 12.2% (n = 22) as incomplete. Fifteen (8.3%) answers did not receive completeness ratings due an accuracy score of 1 (completely incorrect). Accuracy and completeness were modestly correlated (Spearman’s r = 0.4, 95% CI 0.3 to 0.5, p < 0.01, alpha = 0.05) across all questions.
Question Type and Difficulty Level
Among both descriptive and binary questions, the median accuracy scores for easy, medium, and hard answers were 5 (mean 4.6, SD 1.7, IQR 3), 5 (mean 4.3, SD 1.7, IQR 3), and 5 (mean 4.2, SD 1.8, IQR 3.8), respectively, similar between groups (Kruskal Wallis p = 0.4). The median completeness scores for all answers were 3 (mean 2.6, SD 0.7, IQR 1), 3 (mean 2.4, SD 0.7, IQR 1), and 2.5 (mean 2.4, 0.7, IQR 1) respectively, with no differences in completeness based on difficulty (Kruskal Wallis p = 0.3).
Both descriptive and binary questions were analyzed to assess ChatGPT’s performance on these distinct categories. The median accuracy score of descriptive questions (n=93) was 5 (mean 4.3, SD 1.7, IQR 3) and the mean accuracy score of binary questions (n=87) was also 5 (mean 4.5, SD 1.7, IQR 3), similar between groups (Mann Whitney U p=0.3) (Table 2). Among descriptive questions, the median accuracy scores for easy, medium, and hard questions were 5 (mean 4.9, SD 1.5, IQR 3), 5 (mean 4.4, SD 1.9, IQR 3), and 5 (mean 4.1, SD 1.8, IQR 3), respectively (Kruskal Wallis p = 0.7) (Table 2, Figure 2A).
Among binary questions, the median accuracy scores for easy, medium, and hard answers were 6 (mean 4.9, SD 1.8, IQR 1), 4 (mean 4.3, SD 1.6, IQR 3), and 5 (mean 4.2, SD 1.8, IQR 4), respectively, without statistically significant difference (Kruskal Wallis p = 0.1) (Table 2, Figure 2B). Overall, the results suggested no major differences in the accuracy and completeness of ChatGPT-generated answers for descriptive or binary questions across levels of difficulty.
Internal Validation: Re-scored Analysis
Of 36 inaccurate answers that received a score of 2 or lower on the accuracy scale, 34 were re-scored by physicians to evaluate the reproducibility of answers over time (Table 3). Notably, scores generally improved with 26 questions improving, 7 remaining the same, and 1 decreasing in accuracy. Median accuracy scores for the original questions was 2 (mean 1.6, SD 0.5, IQR 1) compared with median 4 (mean 3.9, SD 1.8, IQR 3.3) for re-scored answers(Wilcoxon signed rank p<0.01) (Table S3, Figure 2C). The re-scored answers were generated from ChatGPT 8 to 17 days after the initial answers were generated.
Melanoma and Immunotherapy Analysis
To further assess performance and judge intra-rater variability, two physicians (DBJ and LEW) independently assessed additional questions on melanoma diagnosis and treatment as well as cancer immunotherapy use from existing guidelines before 2021. Among 44 AI-generated answers, the median accuracy score was 6 (mean 5.2, SD 1.3, IQR 1), and the median completeness score was 3 (mean 2.6, SD 0.8, IQR 0.5) (Table 2). The median accuracy scores of descriptive and binary questions were 6 (mean 5.1, SD 1.5, IQR 1) and 6 (mean 5.4, SD 1.2, IQR 1), respectively (Mann-Whitney U p = 0.7). Among both descriptive and binary questions, the median accuracy scores for easy, medium, and hard answers were 6 (median 5.9, SD 0.3, IQR 0), 5.5 (mean 4.8, SD 1.7, IQR 2.1), and 5.8 (mean 5.3, SD 1.1, IQR 1), respectively, with a significant trend (Kruskal Wallis p = 0.046). There was fair interrater agreement (kappa = 0.3, SE 0.1, 95% CI 0.1-0.6) for accuracy and moderate agreement (kappa = 0.5, SE 0.2, 95% CI 0.2 to 0.8) for completeness (Table S4). When 6 accuracy categories were condensed into 3 subgroups (1- 2, 3-4, 5-6), inter-rater agreement for accuracy was moderate (kappa = 0.5, SE 0.2, 95% CI 0.2-0.8).
Common Conditions Analysis
To assess performance further in general questions widely pertinent across practitioners, the same two physicians (LEW and DBJ) generated and graded questions related to ten common medical conditions (Table S2). Among 60 AI-generated answers, the median accuracy score was 6 (mean 5.7, SD 0.7, IQR 0.3), and the median completeness score was 3 (mean 2.8, SD 0.5, IQR 0). The median accuracy score of descriptive questions was 6 (mean 5.6, SD 0.6, IQR 0.5) and the median accuracy score of binary questions was 6 (mean 5.8, 0.8, IQR 0.1) (Mann-Whitney U p=0.1) Among both descriptive and binary questions, the median accuracy scores for easy, medium, and hard answers were 6 (mean 5.9, SD 0.4, IQR 0), 6 (mean 5.6, SD 1.0, IQR 0.5), and 6 (mean 5.6, SD 0.1, IQR 0.5), respectively (Kruskal Wallis p = 0.07). There was slight interrater agreement (kappa = 0.4, SE 0.1, 95% CI 0.1-0.6) for accuracy and moderate agreement (kappa = 0.2, SE 0.1, 95% CI 0.01 to 0.4) for completeness (Table S5). When 6 accuracy categories were grouped into 3 subgroups (1-2, 3-4, 5-6), inter-rater agreement for accuracy was moderate (kappa = 0.5, SE 0.2, 95% CI 0.03-0.9).
Total Analysis
Among all AI-generated answers (n=284) from all three datasets (not including re-graded answers), the median accuracy score was 5.5 (median 4.8, SD 1.6, IQR 2), and the median completeness was 3 (mean 2.5, SD 0.7, IQR 1) (Table 2). The median accuracy of all descriptive questions was 5 (mean 4.7, SD 1.6, IQR 2.6), and the median accuracy of binary questions was 6 (mean 4.9, SD 1.6, IQR 2) (Mann Whitney U p=0.07). Among the descriptive questions, the median accuracy scores for easy, medium, and hard questions were 5.25 (mean 4.8, SD 1.5, IQR 3), 5.5 (mean 4.7, SD 1.7, IQR 2.8), and 5 (mean 4.5, 1.6, IQR 2.4), respectively (Kruskal-Wallis p = 0.4) (Figure 2D). Among binary questions, median accuracy scores for easy, medium, and hard questions were 6 (mean 5.3, SD 1.5, IQR 1), 5.5 (mean 4.6, SD 1.6, IQR 2.6), and 5.5 (mean 4.8, SD 1.6, IQR 2), respectively, which resulted in a significant difference among groups (Kruskal-Wallis p = 0.03) (Figure 2E).