Comparative Study of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination

doi:10.21203/rs.3.rs-4639770/v1

Download PDF

Research Article

Comparative Study of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination

https://doi.org/10.21203/rs.3.rs-4639770/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This study systematically evaluates the performance of three advanced large language models (LLMs)—GPT-4.0, ERNIE Bot 4.0, and GPT-4o—in the 2023 Chinese Medical Licensing Examination. Employing a dataset of 600 standardized questions, we analyzed the accuracy of each model in answering questions from three comprehensive sections: Basic Medical Comprehensive, Clinical Medical Comprehensive, and Humanities and Preventive Medicine Comprehensive. Our results demonstrate that both ERNIE Bot 4.0 and GPT-4o significantly outperformed GPT-4.0, achieving accuracies above the national pass mark. The study further examines the strengths and limitations of each model, providing insights into their applicability in medical education and potential areas for future improvement. These findings underscore the promise and challenges of deploying LLMs in multilingual and multimodal educational contexts, suggesting a pathway towards integrating AI into medical training and assessment practices.

Large Language Models

Medical Licensing Examination

GPT-4.0

ERNIE Bot 4.0

GPT-4o

Artificial Intelligence in Education

Multilingual AI

Comparative Analysis

In recent years, artificial intelligence (AI) and large language models (LLMs) have demonstrated significant potential in the medical field. Leveraging natural language processing (NLP) technologies, LLMs are capable of generating human-like text, which assists in medical decision-making and enhances the accuracy and efficiency of diagnoses[1, 2]. For instance, GPT-3 and GPT-4, the latest generations of LLMs, excel in handling complex medical issues, providing decision support for clinicians and playing a vital role in medical education[2]. The application of AI technology in medical diagnostics, patient management, and medical education is continually expanding[3]. Studies have shown that AI models can significantly enhance the efficiency of medical work and reduce human errors[4, 5].

Despite the need for further exploration of the performance and applicability of large language models (LLMs) in medical exams[2], numerous studies have assessed their effectiveness. In the United States Medical Licensing Examination (USMLE), research has shown that GPT-4 significantly outperforms GPT-3.5 in various medical exams. For instance, Gilson et al. (2023) examined ChatGPT's performance in the USMLE, finding its accuracy to exceed 60%, reaching the level of a third-year medical student[6]. In the National Eligibility cum Entrance Test (NEET) in India, Farhat et al. (2024) compared the performances of GPT-3.5 and GPT-4, showing that GPT-4's accuracy significantly surpassed that of GPT-3.5, achieving over 70%[7].

GPT-4's performance has also been extensively studied across different linguistic environments. Takagi et al. (2023) evaluated GPT-4 in the Japanese National Medical Licensing Examination, finding that GPT-4's accuracy was significantly higher than that of GPT-3.5, especially in complex questions and areas requiring specialized knowledge, demonstrating its reliability in non-English environments[8]. In South Korea, Huh (2023) researched ChatGPT's performance in the Korean National Medical Licensing Examination, revealing that its accuracy in answering parasitology questions was close to that of medical students[9].

In Germany, GPT-4 also exhibited outstanding performance. Meyer et al. (2024) studied GPT-4 in the written German Medical Licensing Examination, finding its accuracy reached 85%, significantly outperforming GPT-3.5's 58%[10]. Overall, existing research demonstrates the significant potential of LLMs in medical exams, especially in handling standardized test questions. However, these models still have limitations in processing across different language environments and multimodal data. Chat-GPT's performance on non-English questions is not as good as its performance on English questions[11], suggesting the need for further research to optimize LLMs' applications in multilingual environments.

In the Chinese language environment, AI technology faces unique challenges and opportunities. Ernie Bot 4.0, developed by Baidu, is a Chinese large language model. Although there is currently no specific study examining Ernie Bot 4.0's performance in Chinese medical exams, its potential in processing Chinese medical texts is particularly noteworthy. This potential provides direction for future research, especially in the application of Chinese medical education and examinations.

Additionally, the newly released GPT-4o in May 2024, as the latest large language model, has demonstrated its formidable capabilities in multilingual environments[12]. GPT-4o excels not only in handling medical questions in English and other languages but also shows significant improvements in Chinese contexts. Particularly, GPT-4o exhibits outstanding abilities in processing multimodal data, including text, images, and audio, enabling it to provide more comprehensive and accurate diagnostic support[13].

The central issue of this study is to evaluate the performance of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination. By systematically comparing these three large language models in the Chinese medical exam context, we aim to reveal their potential and limitations in medical exam applications, helping us understand their suitability and reliability in the Chinese medical education environment. By comparing the performance of different models, we can identify their respective strengths and limitations, providing data support for future model improvements and optimizations. Additionally, as a Chinese model, the performance of ERNIE Bot 4.0 in the Chinese Medical Licensing Examination has not yet been studied. As a new model, the performance and stability of GPT-4o in practical applications still require further verification. This study will provide firsthand data to assess these models' actual effectiveness in Chinese medical exams, offering important references for future research and applications.

Data Sources

This study utilized all 600 original questions from the 2023 Chinese Medical Licensing Examination. These questions strictly adhered to the comprehensive medical examination syllabus, encompassing three main sections: Basic Medical Comprehensive, Clinical Medical Comprehensive, and Humanities and Preventive Medicine Comprehensive. The specific contents are as follows:

Basic Medical Comprehensive: This section includes foundational medical knowledge such as anatomy, biochemistry, physiology, medical microbiology, medical immunology, pathology, pathophysiology, and pharmacology.

Humanities and Preventive Medicine Comprehensive: This section covers the basic theories and behavioral norms of medical humanities disciplines such as medical psychology, medical ethics, and health law, as well as preventive medicine content like medical statistical methods, principles and methods of epidemiology, clinical preventive services, community public health, and health service systems and management.

Clinical Medical Comprehensive: This section primarily assesses comprehensive knowledge categorized by system diseases, involving the respiratory system, cardiovascular system, digestive system, urinary system (including male reproductive system), female reproductive system, hematological system, metabolic and endocrine systems, psychiatric and neurological systems, musculoskeletal system, rheumatic and immune diseases, pediatric diseases, infectious diseases, and sexually transmitted diseases.

All questions were standardized multiple-choice questions, including the following types:

Type A1 (Single Best Answer): Each question consists of a stem and five options, with only one best choice.

Type A2 (Case Summary Best Choice): The stem is a brief case description, with five options among which only one is the best choice.

Type B1 (Standard Matching): Five options are provided, with at least two questions, each requiring the selection of the most relevant answer.

Type A3 (Cluster Case Best Choice): Centered on a patient scenario, presenting 2-3 related questions, each independent of the others.

Type A4 (Sequential Case Best Choice): Centered on a single patient or family clinical scenario, gradually increasing information and posing 3-6 related questions.

Study Design

The process of this study is as follows:

All 600 questions from the 2023 Chinese Medical Licensing Examination were manually inputted into each model by the researchers. To ensure consistency, each question was presented with the following prompt: "You are currently participating in the Chinese Medical Licensing Examination. Here is an exam question. Please read carefully and choose the most correct answer based on your knowledge. Your answer should be based on existing medical knowledge and best practice guidelines." All models responded within a Chinese-language environment to ensure consistency with the actual exam conditions. Each question was initiated in a new dialogue box to prevent interference between questions.

Access to and use of each model were obtained in May 2024 through the purchase of the respective membership services. The responses from each model were recorded in an Excel file and organized into structured data for analysis.

Data Processing

The responses from each model were compared with the official correct answers. Based on the comparison results, responses were marked as correct (1) or incorrect (0). The results were stored in an Excel file.

Statistical Analysis

Descriptive statistical analyses were conducted using Microsoft Excel 2016, calculating the number of correct answers and the overall accuracy rate for each model. Subsequently, independent sample t-tests were performed using IBM SPSS Statistics 26.0 to analyze the statistical differences between the models, with a p-value of less than 0.05 considered statistically significant. For the graphical section, Figure 1 and Figure 2 were generated by GPT-4o and validated through IBM SPSS Statistics 26.0, while Figure 3 was produced using IBM SPSS Statistics 26.0, ensuring the accuracy of all figures.

1. Main Results

All three models (GPT-4.0, Ernie Bot 4.0, and GPT4o) passed the exam, with accuracies surpassing the national average score (58.14%) and the pass mark (60%). The specific performances were as follows: GPT-4.0 correctly answered 434 out of 600 questions, achieving an accuracy of 72.33%, while both Ernie Bot 4.0 and GPT4o answered 503 questions correctly, with an accuracy of 83.83% each. Statistical analysis revealed that the performance of Ernie Bot 4.0 and GPT4o was significantly better than that of GPT-4.0 (p < 0.0001), with no significant difference between Ernie Bot 4.0 and GPT4o. For more details, see Table 1 and Figure 1.

Table 1: Overall performance of each model.

Model	Correct Answers Count	Accuracy (%)
GPT4.0	434	72.33
Ernie Bot 4.0	503	83.83
GPT4o	503	83.83

Scoring standards for the Chinese Medical Licensing Examination stipulate that each correct answer earns one point, with no points awarded for incorrect responses. According to the official score report, the average score for human candidates is 348.84, with a passing score threshold set at 360 points. Scoring 360 points ranks a candidate in the 57th percentile among human examinees. Based on this distribution, which we assume to be normal, GPT-4.0 likely surpassed 91.08% of human candidates, while Ernie Bot 4.0 and GPT4o potentially exceeded 99.26% of human candidates.

2. Performance by Major Subject

We calculated the accuracy rates for each model in Basic Medical Comprehensive, Clinical Medical Comprehensive, and Humanities and Preventive Medicine Comprehensive. Independent sample t-tests were used to analyze the differences in accuracy between the models and their statistical significance. The results,as illustrated in Figure 2, are as follows:

Basic Medical Comprehensive: GPT-4.0 achieved an accuracy rate of 72.58%, Ernie Bot 4.0 achieved 85.48%, and GPT4o achieved 93.55%. GPT4o was significantly superior to GPT-4.0 (p < 0.05); there was no significant difference between GPT-4.0 and Ernie Bot 4.0.

Clinical Medical Comprehensive: GPT-4.0 had an accuracy rate of 71.99%, Ernie Bot 4.0 had 83.40%, and GPT4o had 83.20%. GPT-4.0 was significantly lower than both Ernie Bot 4.0 and GPT4o (p < 0.05); there was no significant difference between Ernie Bot 4.0 and GPT4o.

Humanities and Preventive Medicine Comprehensive: GPT-4.0 achieved an accuracy of 75.00%, Ernie Bot 4.0 achieved 85.71%, and GPT4o achieved 78.57%. There were no significant differences between the three models.

These results suggest that Ernie Bot 4.0 and GPT4o generally outperform GPT-4.0, especially in Basic Medical and Clinical Medical subjects. Ernie Bot 4.0 shows a balanced performance across subjects, reflecting its stability and advantages in processing Chinese data. GPT4o excels particularly in Basic Medical Comprehensive, demonstrating its capability in handling broad scientific knowledge. In Humanities and Preventive Medicine Comprehensive, which primarily involves law and computation, no significant differences were observed between the models.

3. Performance by Specific Subject

The accuracy of the three models—GPT4.0, Ernie Bot 4.0, and GPT4o—was further analyzed across various specific subjects within the major categories. The detailed performance for each specific subject is summarized in Table 2.

Table 2: Accuracy rates of GPT4.0, Ernie Bot 4.0, and GPT4o across specific subjects.

Specific Subject	GPT4.0 (Correct/Total, %)	Ernie Bot 4.0 (Correct/Total, %)	GPT4o (Correct/Total, %)
Anatomy（20）	16/20（80.00%）	20/20 (100.00%)	20/20 (100.00%)
Cardiovascular System（24）	18/24 (75.00%)	20/24 (83.33%)	22/24 (91.67%)
Digestive System（26）	18/26 (69.23%)	22/26 (84.62%)	19/26 (73.08%)
Female Reproductive System（22）	15/22 (68.18%)	19/22 (86.36%)	18/22 (81.82%)
Health Regulations（13）	5/13 (38.46%)	11/13 (84.62%)	8/13 (61.54%)
Hematology（27）	21/27 (77.78%)	23/27 (85.19%)	24/27 (88.89%)
Infectious Diseases and Sexually Transmitted Diseases（15）	10/15 (66.67%)	13/15 (86.67%)	11/15 (73.33%)
Medical Ethics（14）	12/14 (85.71%)	13/14 (92.86%)	11/14 (78.57%)
Medical Immunology（6）	6/6 (100.00%)	6/6 (100.00%)	6/6 (100.00%)
Medical Microbiology（2）	0/2 (0.00%)	2/2 (100.00%)	0/2 (0.00%)
Medical Psychology（3）	2/3 (66.67%)	3/3 (100.00%)	2/3 (66.67%)
Metabolic and Endocrine System（20）	17/20 (85.00%)	19/20 (95.00%)	19/20 (95.00%)
Musculoskeletal System（25）	20/25 (80.00%)	20/25 (80.00%)	23/25 (92.00%)
Nervous and Mental System（25）	21/25 (84.00%)	23/25 (92.00%)	20/25 (80.00%)
Pathology（6）	6/6 (100.00%)	6/6 (100.00%)	6/6 (100.00%)
Pathophysiology（19）	13/19 (68.42%)	18/19 (94.74%)	18/19 (94.74%)
Pediatric Diseases（21）	12/21 (57.14%)	18/21 (85.71%)	18/21 (85.71%)
Pharmacology（16）	13/16 (81.25%)	12/16 (75.00%)	15/16 (93.75%)
Physiology（8）	8/8 (100.00%)	7/8 (87.50%)	8/8 (100.00%)
Preventive Medicine（25）	22/25 (88.00%)	19/25 (76.00%)	22/25 (88.00%)
Respiratory System（24）	19/24 (79.17%)	19/24 (79.17%)	21/24 (87.50%)
Rheumatic and Immune Diseases（13）	10/13 (76.92%)	13/13 (100.00%)	13/13 (100.00%)
Surgery（35）	25/35 (71.43%)	28/35 (80.00%)	27/35 (77.14%)
Urology（18）	11/18 (61.11%)	14/18 (77.78%)	13/18 (72.22%)

Overall, Ernie Bot 4.0 and GPT4o outperform GPT-4.0, especially showing higher accuracy rates across multiple subjects. For instance, in critical subjects such as the female reproductive system, digestive system, and urinary system, the accuracy rates of Ernie Bot 4.0 and GPT4o are significantly higher than those of GPT-4.0.

In terms of performance by specific subjects:

Anatomy: All models performed exceptionally well in anatomy, with Ernie Bot 4.0 and GPT4o achieving a 100% accuracy rate, while GPT-4.0 also performed well with an 80% accuracy rate.

Cardiovascular System: GPT4o had the highest accuracy rate at 91.67%, followed by Ernie Bot 4.0 at 83.33%, and GPT-4.0 at 75.00%.

Female Reproductive System: Ernie Bot 4.0 and GPT4o significantly outperformed GPT-4.0, with accuracy rates of 86.36% and 81.82% respectively, while GPT-4.0's accuracy was 68.85%.

Health Regulations: Ernie Bot 4.0 had the best performance with an accuracy rate of 84.62%, GPT4o scored 61.54%, and GPT-4.0 had the lowest rate at 38.46%.

Pediatric Diseases: The accuracy rates for GPT4o and Ernie Bot 4.0 were similar, at 85.71% and 85.71% respectively, while GPT-4.0’s rate was 57.14%.

The box plot (Figure 3) provides a detailed visualization of the accuracy distribution for each model (GPT4.0, Ernie Bot 4.0, and GPT4o) across all specific subjects.

From Table 3, it is evident that Ernie Bot 4.0 and GPT4o demonstrate greater consistency and stability in accuracy compared to GPT4.0. Among them, Ernie Bot 4.0 shows a more balanced performance across different subjects, while GPT4o exhibits higher accuracy rates, indicating superior overall performance.

A detailed analysis of the performance of these three natural language processing models across various medical disciplines reveals that Ernie Bot 4.0 and GPT4o generally outperform GPT4.0 in most subjects. These results suggest that, in the context of the Medical Licensing Examination, Ernie Bot 4.0 and GPT4o may offer higher practical value and reliability.

4. Performance by Question Type

We compared the performance of GPT-4.0, Ernie Bot 4.0, and GPT4o across different question types and calculated their accuracy rates along with the statistical differences between them (p values). The specific results are shown in Table 3.

Table 3: Accuracy Rates and Statistical Differences Analysis by Question Type

Question Type	GPT4.0 (Correct/Total, %)	Ernie Bot 4.0 (Correct/Total, %)	GPT4o (Correct/Total, %)	GPT4.0 vs Ernie Bot 4.0 P-value	GPT4.0 vs GPT4o p-value	Ernie Bot 4.0 vs GPT4o p-value
A1	152/207 (73.43%)	179/207 (86.47% )	175/207 (84.54%)	0.00	0.01	0.68
A2	174/235 (74.04%)	188/235 (80.00%)	189/235 (80.43%)	0.15	0.12	1.00
A3/A4	70/99 (70.71% )	81/99 (81.82%)	86/99 (86.87%)	0.09	0.01	0.43
B1	38/59 (64.41%)	55/59 (93.22%)	53/59 (89.83%)	0.00	0.00	0.74

Type A1: Ernie Bot 4.0 demonstrated the highest performance with an accuracy of 86.47% (179/207), showing statistically significant differences from GPT-4.0 at 73.43% (152/207) and GPT4o at 84.54% (175/207), with p-values of 0.00 and 0.01 respectively. However, there was no significant difference between Ernie Bot 4.0 and GPT4o (p = 0.68).

Type A2: GPT4o (80.43%, 189/235) and Ernie Bot 4.0 (80.00%, 188/235) performed similarly, while GPT-4.0's accuracy was 74.04% (174/235), with no statistically significant differences among them (p > 0.05).

Type A3/A4: GPT4o showed the best performance with an accuracy of 86.87% (86/99), significantly different from GPT-4.0's 70.71% (70/99) with a p-value of 0.01, but not significantly different from Ernie Bot 4.0 at 81.82% (81/99) with a p-value of 0.43.

Type B1: Ernie Bot 4.0 exhibited optimal performance at 93.22% (55/59), significantly better than GPT-4.0 at 64.41% (38/59) and GPT4o at 89.83% (53/59), with p-values of 0.00.

Ernie Bot 4.0 excels across most question types, particularly in Types A1 and B1. GPT4o shows the best performance in Type A3/A4. GPT-4.0 has relatively poorer performance across all types. These results indicate that Ernie Bot 4.0 and GPT4o significantly outperform GPT-4.0 in different question types, highlighting the superior performance of Ernie Bot 4.0 and GPT4o in the context of the medical licensing examination.

This study systematically assessed the performance of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination, revealing that ERNIE Bot 4.0 and GPT-4o significantly outperformed GPT-4.0. Both ERNIE Bot 4.0 and GPT-4o correctly answered 503 out of 600 questions, achieving an accuracy rate of 83.83%, while GPT-4.0 only correctly answered 434 questions, with an accuracy rate of 72.33%.

Specifically, in the three major subjects (Basic Medical Comprehensive, Clinical Medical Comprehensive, and Humanities and Preventive Medicine Comprehensive), the performance of ERNIE Bot 4.0 and GPT-4o was superior to that of GPT-4.0. A more detailed analysis of specific subjects showed that ERNIE Bot 4.0 and GPT-4o excelled particularly in anatomy, cardiovascular system, digestive system, female reproductive system, health regulations, hematology, infectious diseases, and sexually transmitted diseases. Additionally, when addressing different types of questions (such as A1, A2, A3/A4, and B1), ERNIE Bot 4.0 and GPT-4o also demonstrated higher accuracy rates.

These results indicate that, in the Chinese Medical Licensing Examination, ERNIE Bot 4.0 and GPT-4o perform significantly better than GPT-4.0 across various subjects and question types. ERNIE Bot 4.0 shows a more balanced performance across subjects, while GPT4o exhibits higher accuracy, demonstrating superior overall performance. These findings highlight the practical value and reliability of ERNIE Bot 4.0 and GPT-4o in this context, suggesting that they could be more suitable tools for educational and preparatory purposes in the field of medical licensure in China.

Our research findings are consistent with existing studies, demonstrating the superior performance of advanced language models in medical examinations. For instance, in an English-language examination environment, Gilson et al. (2023) discovered that ChatGPT-4.0's accuracy in the United States Medical Licensing Examination (USMLE) exceeded 60%, equivalent to the level of a third-year medical student and surpassing GPT-3.5[6]. Farhat et al. (2024) compared the performances of GPT-3.5, GPT-4, and Google Bard in the National Eligibility cum Entrance Test (NEET) in India, showing that GPT-4 achieved the highest overall score[7]. In non-English examination settings, ChatGPT-4 significantly outperformed GPT-3.5 and Google Bard in the Japanese National Dentist Examination (JNDE)[14]. GPT-4 scored an average of 85% in the German Medical Licensing Exam, ranking in the 92.8th, 99.5th, and 92.6th percentiles in the exams of October 2021, April 2022, and October 2022, respectively[10]. In the Iranian Medical Licensing Exam, GPT-4 also performed better than a random test group, showcasing its advantages in diagnostic accuracy and decision-making capabilities[15]. Our study found that GPT-4.0, Ernie Bot 4.0, and GPT-4o all passed the 2023 Chinese Medical Licensing Examination and significantly outperformed human candidates, further proving the robust adaptability of LLMs in various language environments and complex medical examinations.

This study found that Ernie Bot 4.0, a large language model specifically developed for the Chinese linguistic environment, significantly outperforms GPT-4.0 in Chinese medical examinations. This suggests that incorporating more Chinese medical literature and practice into the training data can substantially improve the model's performance in this language setting. Additionally, while ChatGPT shows a decline in non-English language capabilities compared to English[11], in this study, GPT4o scored comparably to Ernie Bot 4.0 across all aspects, particularly excelling in Basic Medical knowledge—a universally applicable medical field—with an accuracy of 93.55%, significantly higher than GPT-4.0. This indicates GPT4o's superior performance in universally applicable medical knowledge. In subjects related to Humanities and Preventive Medicine, neither Ernie Bot 4.0 nor GPT4o showed a significant advantage, though Ernie Bot 4.0 had higher accuracy rates. This might be due to the differences in the content of Chinese versus English data concerning law, ethics, public health, and other issues.

In conclusion, compared with previous research, this study demonstrates that Ernie Bot 4.0 and GPT4o perform better than GPT-4.0 in Chinese medical examinations. Future research could further explore how to optimize LLMs in multilingual environments, particularly through enhancing multilingual training data and improving model structures, to boost their application in various language settings. This could help facilitate the widespread application of LLMs in global medical education and clinical decision-making support.

Evaluating large language models (LLMs) in the context of the Chinese Medical Licensing Examination helps to understand their applicability and reliability in the Chinese medical education environment. Wang et al. (2023) noted that although ChatGPT 3.5 did not perform as well as medical students in Chinese medical exams, its explanatory and learning support capabilities still hold value[16]. Our research confirms the advantages of Ernie Bot 4.0 and GPT-4o in handling Chinese medical questions, providing a basis for their application in Chinese medical education. Li et al. (2024) discovered that ChatGPT performed exceptionally in the entrance examination for a Master's degree in clinical medicine, especially in the medical humanities subjects[17]. This study further assesses the performance of Ernie Bot 4.0, GPT-4.0, and GPT-4o across different subjects and question types, revealing their unique strengths and areas for improvement in Chinese medical exams.

LLMs can serve as auxiliary tools to enhance the learning efficiency and diagnostic accuracy of medical students and clinicians, particularly in regions with limited resources where AI technology can offer additional educational support and compensate for the lack of teaching and medical resources. Due to internet restrictions in mainland China, GPT-4.0 and GPT-4o are not usable, making Ernie Bot 4.0 (Wenxin Yiyi) a viable alternative. Its ability to process Chinese data gives it significant applicability in localized medical education and clinical practice.

This study still has several limitations: 1. Variability of Accuracy Over Time: The accuracy of large language models may vary over time[18]. This study only collected data from the three models in May 2024 and did not evaluate the performance of the models over different periods. This limitation restricts a continuous assessment of model consistency. 2. Multimodal Capabilities Limitation: Although all three models are multimodal, capable of processing various types of content including images, this examination did not involve image-related questions. Therefore, we could not assess the models' capabilities in image processing, which may be critical for complete diagnostic assessments in some medical fields. 3. Lack of Comparison with Human Candidates' Performance: Since the official distribution of scores for the 2023 Chinese Medical Licensing Examination was not published, we could not compare the models' performance with human candidates who took the exam at the same time. This limitation restricts a comprehensive evaluation of how models perform relative to human levels. 4. Potential Data Leakage Issues: Some answers to the medical licensing examination may be available online, and these answers might have been included in the dataset used during the training of the large language models, potentially skewing the assessment of the models' true capabilities. This could result in the models performing better than they would in real clinical scenarios. These limitations suggest that caution is needed when generalizing our research findings to all medical settings. Future research should expand the scope of study to include more types of examinations and combine these with clinical applicability assessments to provide a more comprehensive evaluation of AI models in medical education.

Future research should focus on further optimizing the performance of these models in multilingual environments, especially enhancing their adaptability in non-English contexts. As a Chinese model, Ernie Bot 4.0 requires additional research support for its application in Chinese materials. For GPT-4o, as a new model, its performance and stability in practical applications still need further validation.

In summary, this study systematically evaluated the performance of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination. The findings indicate that ERNIE Bot 4.0 and GPT-4o both outperform GPT-4.0 overall and in specific subjects. ERNIE Bot 4.0 displays a more balanced performance across various subjects, while GPT-4o exhibits higher accuracy rates, demonstrating superior overall performance. Future research should further verify the stability and reliability of these models in practical applications to provide stronger data support for advancing AI's application in medical education.

AI: Artificial Intelligence

LLM: Large Language Model

NLP: Natural Language Processing

GPT: Generative Pre-trained Transformer

USMLE: United States Medical Licensing Examination

NEET: National Eligibility cum Entrance Test

Acknowledgements

I would like to thank all individuals and institutions that provided support and assistance during this study. Special thanks to Quanzhou First Hospital Affiliated to Fujian Medical University for their support and resources.

Conflicts of Interest

The author declare that he has no conflicts of interest regarding the publication of this paper.

Authors' Contributions

L.L. (Luoyu Lian) conducted the conceptualization, data collection, statistical analysis, writing, and submission of the manuscript. L.L. reviewed and approved the final manuscript.

Data Availability

Data is provided within the manuscript or supplementary information files.

Funding

No specific grant from funding agencies in the public, commercial, or not-for-profit sectors was received for this research.

Ethical Considerations

This study involved analysis of publicly available, non-personally identifiable data and did not require ethical approval.

Nikhil R. Sahni MBAM. Artificial intelligence in u.s. Health care delivery. The New England Journal of Medicine. 2023(2023;389:348-58.). https://doi.org/10.1056/NEJMra2204673.
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-40. https://doi.org/10.1038/s41591-023-02448-8.
Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. 2024. https://doi.org/10.1111/medu.15402.
Buchlak QD, Tang CHM, Seah JCY, Johnson A, Holt X, Bottrell GM, et al. Effects of a comprehensive brain computed tomography deep learning model on radiologist detection accuracy. Eur Radiol. 2024;34(2):810-22. https://doi.org/10.1007/s00330-023-10074-8.
Atallah SB, Banda NR, Banda A, Roeck NA. How large language models including generative pre-trained transformer (gpt) 3 and 4 will impact medicine and surgery. Tech Coloproctol. 2023;27(8):609-14. https://doi.org/10.1007/s10151-023-02837-8.
Aidan Gilson, Bs, Safranek CW, Bs, Huang T, Bs, et al. How does chatgpt perform on the united states medical licensing examination (usmle) the implications of large language models for medical education and knowledge assessment. JMIR MEDICAL EDUCATION. 2023(2023;9:e45312). https://doi.org/10.2196/45312.
Farhat F, Chaudhry BM, Nadeem M, Sohail SS, Madsen DO. Evaluating large language models for the national premedical exam in india: comparative analysis of gpt-3.5, gpt-4, and bard. JMIR Med Educ. 2024;10:e51523. https://doi.org/10.2196/51523.
Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of gpt-3.5 and gpt-4 on the japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9:e48002. https://doi.org/10.2196/48002.
Jang D, Yun T, Lee C, Kwon Y, Kim C, Nakayama LF. Gpt-4 can pass the korean national licensing examination for korean medicine doctors. PLOS digital health. 2023;2(12):e416. https://doi.org/10.1371/journal.pdig.0000416.
Meyer A, Riese J, Streichert T. Comparison of the performance of gpt-3.5 and gpt-4 with that of medical students on the written german medical licensing examination: observational study. JMIR Med Educ. 2024;10:e50965. https://doi.org/10.2196/50965.
Ml S. Chatgpt: not all languages are equal. Nature. 2023(Mar;615(7951):216). https://doi.org/10.1038/d41586-023-00680-3.
Zhu N, Zhang N, Shao Q, Cheng K, Wu H. Openai’s gpt-4o in surgical oncology: revolutionary advances in generative artificial intelligence. European journal of cancer (1990). 2024;206:114132. https://doi.org/10.1016/j.ejca.2024.114132.
Temsah M, Jamal A, Alhasan K, Aljamaan F, Altamimi I, Malki KH, et al. Transforming virtual healthcare: the potentials of chatgpt-4omni in telemedicine. Cureus. 2024. https://doi.org/10.7759/cureus.61377.
Ohta K, Ohta S. The performance of gpt-3.5, gpt-4, and bard on the japanese national dentist examination: a comparison study. Cureus. 2023. https://doi.org/10.7759/cureus.50369.
Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. Chatgpt in iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an ai-based model. BMJ Health Care Inform. 2023;30(1):e100815. https://doi.org/10.1136/bmjhci-2023-100815.
Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, et al. Chatgpt performs on the chinese national medical licensing examination. J Med Syst. 2023;47(1). https://doi.org/10.1007/s10916-023-01961-0.
Li K, Bu Z, Shahjalal M, He B, Zhuang Z, Li C, et al. Performance of chatgpt on chinese master’s degree entrance examination in clinical medicine. PLoS One. 2024;19(4):e301702. https://doi.org/10.1371/journal.pone.0301702.
Goodman RS, Patrinely JR, Stone CA, Zimmerman E, Donald RR, Chang SS, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6(10):e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Comparative Study of GPT-4.0, ERNIE Bot 4.0, and GPT-4o in the 2023 Chinese Medical Licensing Examination

Status:

Version 1

Abstract

Figures

Introduction

Methods and Results

Results

Discussion

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1