1. Main Results
All three models (GPT-4.0, Ernie Bot 4.0, and GPT4o) passed the exam, with accuracies surpassing the national average score (58.14%) and the pass mark (60%). The specific performances were as follows: GPT-4.0 correctly answered 434 out of 600 questions, achieving an accuracy of 72.33%, while both Ernie Bot 4.0 and GPT4o answered 503 questions correctly, with an accuracy of 83.83% each. Statistical analysis revealed that the performance of Ernie Bot 4.0 and GPT4o was significantly better than that of GPT-4.0 (p < 0.0001), with no significant difference between Ernie Bot 4.0 and GPT4o. For more details, see Table 1 and Figure 1.
Table 1: Overall performance of each model.
Model
|
Correct Answers Count
|
Accuracy (%)
|
GPT4.0
|
434
|
72.33
|
Ernie Bot 4.0
|
503
|
83.83
|
GPT4o
|
503
|
83.83
|
Scoring standards for the Chinese Medical Licensing Examination stipulate that each correct answer earns one point, with no points awarded for incorrect responses. According to the official score report, the average score for human candidates is 348.84, with a passing score threshold set at 360 points. Scoring 360 points ranks a candidate in the 57th percentile among human examinees. Based on this distribution, which we assume to be normal, GPT-4.0 likely surpassed 91.08% of human candidates, while Ernie Bot 4.0 and GPT4o potentially exceeded 99.26% of human candidates.
2. Performance by Major Subject
We calculated the accuracy rates for each model in Basic Medical Comprehensive, Clinical Medical Comprehensive, and Humanities and Preventive Medicine Comprehensive. Independent sample t-tests were used to analyze the differences in accuracy between the models and their statistical significance. The results,as illustrated in Figure 2, are as follows:
Basic Medical Comprehensive: GPT-4.0 achieved an accuracy rate of 72.58%, Ernie Bot 4.0 achieved 85.48%, and GPT4o achieved 93.55%. GPT4o was significantly superior to GPT-4.0 (p < 0.05); there was no significant difference between GPT-4.0 and Ernie Bot 4.0.
Clinical Medical Comprehensive: GPT-4.0 had an accuracy rate of 71.99%, Ernie Bot 4.0 had 83.40%, and GPT4o had 83.20%. GPT-4.0 was significantly lower than both Ernie Bot 4.0 and GPT4o (p < 0.05); there was no significant difference between Ernie Bot 4.0 and GPT4o.
Humanities and Preventive Medicine Comprehensive: GPT-4.0 achieved an accuracy of 75.00%, Ernie Bot 4.0 achieved 85.71%, and GPT4o achieved 78.57%. There were no significant differences between the three models.
These results suggest that Ernie Bot 4.0 and GPT4o generally outperform GPT-4.0, especially in Basic Medical and Clinical Medical subjects. Ernie Bot 4.0 shows a balanced performance across subjects, reflecting its stability and advantages in processing Chinese data. GPT4o excels particularly in Basic Medical Comprehensive, demonstrating its capability in handling broad scientific knowledge. In Humanities and Preventive Medicine Comprehensive, which primarily involves law and computation, no significant differences were observed between the models.
3. Performance by Specific Subject
The accuracy of the three models—GPT4.0, Ernie Bot 4.0, and GPT4o—was further analyzed across various specific subjects within the major categories. The detailed performance for each specific subject is summarized in Table 2.
Table 2: Accuracy rates of GPT4.0, Ernie Bot 4.0, and GPT4o across specific subjects.
Specific Subject
|
GPT4.0 (Correct/Total, %)
|
Ernie Bot 4.0 (Correct/Total, %)
|
GPT4o (Correct/Total, %)
|
Anatomy(20)
|
16/20(80.00%)
|
20/20 (100.00%)
|
20/20 (100.00%)
|
Cardiovascular System(24)
|
18/24 (75.00%)
|
20/24 (83.33%)
|
22/24 (91.67%)
|
Digestive System(26)
|
18/26 (69.23%)
|
22/26 (84.62%)
|
19/26 (73.08%)
|
Female Reproductive System(22)
|
15/22 (68.18%)
|
19/22 (86.36%)
|
18/22 (81.82%)
|
Health Regulations(13)
|
5/13 (38.46%)
|
11/13 (84.62%)
|
8/13 (61.54%)
|
Hematology(27)
|
21/27 (77.78%)
|
23/27 (85.19%)
|
24/27 (88.89%)
|
Infectious Diseases and Sexually Transmitted Diseases(15)
|
10/15 (66.67%)
|
13/15 (86.67%)
|
11/15 (73.33%)
|
Medical Ethics(14)
|
12/14 (85.71%)
|
13/14 (92.86%)
|
11/14 (78.57%)
|
Medical Immunology(6)
|
6/6 (100.00%)
|
6/6 (100.00%)
|
6/6 (100.00%)
|
Medical Microbiology(2)
|
0/2 (0.00%)
|
2/2 (100.00%)
|
0/2 (0.00%)
|
Medical Psychology(3)
|
2/3 (66.67%)
|
3/3 (100.00%)
|
2/3 (66.67%)
|
Metabolic and Endocrine System(20)
|
17/20 (85.00%)
|
19/20 (95.00%)
|
19/20 (95.00%)
|
Musculoskeletal System(25)
|
20/25 (80.00%)
|
20/25 (80.00%)
|
23/25 (92.00%)
|
Nervous and Mental System(25)
|
21/25 (84.00%)
|
23/25 (92.00%)
|
20/25 (80.00%)
|
Pathology(6)
|
6/6 (100.00%)
|
6/6 (100.00%)
|
6/6 (100.00%)
|
Pathophysiology(19)
|
13/19 (68.42%)
|
18/19 (94.74%)
|
18/19 (94.74%)
|
Pediatric Diseases(21)
|
12/21 (57.14%)
|
18/21 (85.71%)
|
18/21 (85.71%)
|
Pharmacology(16)
|
13/16 (81.25%)
|
12/16 (75.00%)
|
15/16 (93.75%)
|
Physiology(8)
|
8/8 (100.00%)
|
7/8 (87.50%)
|
8/8 (100.00%)
|
Preventive Medicine(25)
|
22/25 (88.00%)
|
19/25 (76.00%)
|
22/25 (88.00%)
|
Respiratory System(24)
|
19/24 (79.17%)
|
19/24 (79.17%)
|
21/24 (87.50%)
|
Rheumatic and Immune Diseases(13)
|
10/13 (76.92%)
|
13/13 (100.00%)
|
13/13 (100.00%)
|
Surgery(35)
|
25/35 (71.43%)
|
28/35 (80.00%)
|
27/35 (77.14%)
|
Urology(18)
|
11/18 (61.11%)
|
14/18 (77.78%)
|
13/18 (72.22%)
|
Overall, Ernie Bot 4.0 and GPT4o outperform GPT-4.0, especially showing higher accuracy rates across multiple subjects. For instance, in critical subjects such as the female reproductive system, digestive system, and urinary system, the accuracy rates of Ernie Bot 4.0 and GPT4o are significantly higher than those of GPT-4.0.
In terms of performance by specific subjects:
Anatomy: All models performed exceptionally well in anatomy, with Ernie Bot 4.0 and GPT4o achieving a 100% accuracy rate, while GPT-4.0 also performed well with an 80% accuracy rate.
Cardiovascular System: GPT4o had the highest accuracy rate at 91.67%, followed by Ernie Bot 4.0 at 83.33%, and GPT-4.0 at 75.00%.
Female Reproductive System: Ernie Bot 4.0 and GPT4o significantly outperformed GPT-4.0, with accuracy rates of 86.36% and 81.82% respectively, while GPT-4.0's accuracy was 68.85%.
Health Regulations: Ernie Bot 4.0 had the best performance with an accuracy rate of 84.62%, GPT4o scored 61.54%, and GPT-4.0 had the lowest rate at 38.46%.
Pediatric Diseases: The accuracy rates for GPT4o and Ernie Bot 4.0 were similar, at 85.71% and 85.71% respectively, while GPT-4.0’s rate was 57.14%.
The box plot (Figure 3) provides a detailed visualization of the accuracy distribution for each model (GPT4.0, Ernie Bot 4.0, and GPT4o) across all specific subjects.
From Table 3, it is evident that Ernie Bot 4.0 and GPT4o demonstrate greater consistency and stability in accuracy compared to GPT4.0. Among them, Ernie Bot 4.0 shows a more balanced performance across different subjects, while GPT4o exhibits higher accuracy rates, indicating superior overall performance.
A detailed analysis of the performance of these three natural language processing models across various medical disciplines reveals that Ernie Bot 4.0 and GPT4o generally outperform GPT4.0 in most subjects. These results suggest that, in the context of the Medical Licensing Examination, Ernie Bot 4.0 and GPT4o may offer higher practical value and reliability.
4. Performance by Question Type
We compared the performance of GPT-4.0, Ernie Bot 4.0, and GPT4o across different question types and calculated their accuracy rates along with the statistical differences between them (p values). The specific results are shown in Table 3.
Table 3: Accuracy Rates and Statistical Differences Analysis by Question Type
Question Type
|
GPT4.0 (Correct/Total, %)
|
Ernie Bot 4.0 (Correct/Total, %)
|
GPT4o (Correct/Total, %)
|
GPT4.0 vs Ernie Bot 4.0
P-value
|
GPT4.0 vs GPT4o p-value
|
Ernie Bot 4.0 vs GPT4o p-value
|
A1
|
152/207
(73.43%)
|
179/207 (86.47% )
|
175/207 (84.54%)
|
0.00
|
0.01
|
0.68
|
A2
|
174/235 (74.04%)
|
188/235 (80.00%)
|
189/235 (80.43%)
|
0.15
|
0.12
|
1.00
|
A3/A4
|
70/99
(70.71% )
|
81/99 (81.82%)
|
86/99 (86.87%)
|
0.09
|
0.01
|
0.43
|
B1
|
38/59 (64.41%)
|
55/59 (93.22%)
|
53/59 (89.83%)
|
0.00
|
0.00
|
0.74
|
Type A1: Ernie Bot 4.0 demonstrated the highest performance with an accuracy of 86.47% (179/207), showing statistically significant differences from GPT-4.0 at 73.43% (152/207) and GPT4o at 84.54% (175/207), with p-values of 0.00 and 0.01 respectively. However, there was no significant difference between Ernie Bot 4.0 and GPT4o (p = 0.68).
Type A2: GPT4o (80.43%, 189/235) and Ernie Bot 4.0 (80.00%, 188/235) performed similarly, while GPT-4.0's accuracy was 74.04% (174/235), with no statistically significant differences among them (p > 0.05).
Type A3/A4: GPT4o showed the best performance with an accuracy of 86.87% (86/99), significantly different from GPT-4.0's 70.71% (70/99) with a p-value of 0.01, but not significantly different from Ernie Bot 4.0 at 81.82% (81/99) with a p-value of 0.43.
Type B1: Ernie Bot 4.0 exhibited optimal performance at 93.22% (55/59), significantly better than GPT-4.0 at 64.41% (38/59) and GPT4o at 89.83% (53/59), with p-values of 0.00.
Ernie Bot 4.0 excels across most question types, particularly in Types A1 and B1. GPT4o shows the best performance in Type A3/A4. GPT-4.0 has relatively poorer performance across all types. These results indicate that Ernie Bot 4.0 and GPT4o significantly outperform GPT-4.0 in different question types, highlighting the superior performance of Ernie Bot 4.0 and GPT4o in the context of the medical licensing examination.