Principal Findings
Our results indicated that Ernie Bot performed the best in terms of scores and correctness, followed by ChatGLM. Both of them exceeded the passing score, demonstrating their expertise in TCM surpassed the passing threshold for the postgraduate examination. The performance of SparkDesk and GPT-4 was unsatisfactory. This suggests that Ernie Bot and ChatGLM have the potential to be used in TCM and could serve various roles in the future.
Differences in AI accuracy across input languages may partially stem from the linguistic composition of its dataset, as LLMs often exhibit a preference for languages more aligned with their training data21. ChatGPT performs best with English input, emphasizing the influence of language on its accuracy22. Peng et al (2024)23 discovered that ChatGPT demonstrated medium accuracy in answering open-ended medical questions in Chinese, with an accuracy rate of 31.5%. Considering the disparities in English and Chinese inputs, it's clear that ChatGPT needs further improvements to handle medical questions in Chinese. Furthermore, the performance of GPT-4 did not meet expectations in this study, likely because it is a general-domain model with relatively limited training data in TCM.
The significant differences in the performance of Ernie Bot, ChatGLM, and GPT-4 across various subjects suggested that they were not effective as a comprehensive TCM learning tool. These four LLMs did well in responding to the questions on the medical humanistic spirit, possibly because answering such questions does not necessitate specialized medical knowledge or clinical experience. In TCM education, students first learn foundational subjects before moving on to clinical subjects. The ability to effectively address clinical questions relies, in part, on mastering these fundamentals and applying them correctly. Ernie Bot showed a higher accuracy rate in answering questions about Pharmacology (foundational subjects) and Internal Medicine of TCM (clinical subjects), while ChatGLM and GPT-4 performed well in Internal Medicine of TCM. Consequently, in the present study, the remarkable performance of LLMs in Internal Medicine of TCM (clinical subjects) gave a huge impression. This may be because the Internal Medicine of TCM questions were mostly presented in case formats, where LLMs showed greater proficiency in extracting and processing internal information. Subject-level accuracy illustrates the suitability of various models for distinct domains, providing invaluable guidance for users seeking to choose a model tailored to their specific requirements24.
In the qualitative analysis of responses, 1) In terms of logical reasoning ability, when the prompt specifically asked the LLMs to generate an answer parsing, Ernie Bot, ChatGLM, and GPT-4 produced more context-focused answers, which were better representations of the deductive reasoning process. SparkDesk had 56.4% of answers lacking logical reasoning. Meanwhile, the accuracy rate of responses showing logical reasoning exceeded those without it. This suggests an urgent need for SparkDesk to improve its deductive reasoning abilities to increase accuracy rates. 2) In terms of the ability to use internal and external information, Ernie Bot, ChatGLM, and GPT-4 incorporated internal information in over 95% of their answer parses, with over 60% integrating external information. The frequency of correct answers with external information did not significantly surpass that of incorrect answers. This suggested that although LLMs could connect questions to extra data, such information did not substantially aid in making the correct choice.
The causes of incorrect responses were categorized into three groups: logical errors, information errors, and both types of errors at the same time. General-domain LLMs were mainly trained on widely accessible public datasets, which might have limitations due to the lack of training data specific to professional knowledge within specialized domains25. When confronted with issues necessitating domain-specific knowledge, general-domain LLMs can be significantly illusory, often resulting in factual fabrications26. It closely aligns with the findings of the present study, where informational errors were identified as the most prevalent factor. Hence, prioritizing data selection and filtering from pre-trained corpora to gather high-quality TCM knowledge data is crucial. Moreover, considering the intricate terminology in TCM, improving the proficiency of LLMs in conducting expert analyses on TCM matters could be achieved through the selection of instruction-tuning data.
In this study, the evaluation of four LLMs' performance in the postgraduate examination revealed that Ernie Bot and ChatGLM were able to pass the exam. Furthermore, over 95% of responses demonstrated logical reasoning and internal information. The majority of responses also presented external information, indicating that they could justify the response in most cases by demonstrating logical reasoning and contextual information. Ernie Bot and ChatGLM could process and comprehend natural language inputs, leveraging their extensive databases to deliver personalized and parsed responses. As a result, the superior performance and accuracy of ChatGLM and Ernie Bot imply a greater proficiency in addressing TCM exam questions. This offers preliminary data that they can lead to integrated TCM applications, showcasing their enormous potential as TCM support tools. The knowledge accuracy and interpretive ability of SparkDesk and GPT-4 need further improvement.
Potential Application of LLMs in TCM
Several application scenario studies have demonstrated the application potential of LLMs. For instance, in the case of medical records: for every hour a physician dedicates to a patient, they need to allocate nearly two hours to electronic medical records and documentation tasks during their work27. ChatGPT could help with writing and organizing medical records28, significantly enhancing the productivity of medical staff. During medical consultations, LLMs could rapidly and accurately translate complex medical terminology, reducing the communication gap between patients and doctors29. LLMs could serve as medical assistants for patients, helping with tasks like scheduling appointments, arranging checkups, coordinating treatments, and managing their health information30. In clinical specialties like radiology31 and psychiatry32, LLMs could help with clinical care by optimizing workflow or improving diagnostic accuracy and efficiency. Nonetheless, LLMs have several limitations. For example, the most basic consultation cannot adequately replace direct interaction between an experienced doctor and a patient. The same applies to considerations of medical ethics and law27. Currently, related studies in TCM are still limited. Biomedicine and TCM are both part of the field of medicine and have a certain degree of transferability. The potential application of LLMs in TCM can be envisioned by drawing upon the research insights gleaned from LLMs in the field of biomedicine.
The logic and contextual information demonstrated by LLMs in most of the answers highlight their potential as valuable tools for medical education and learning support18. In the future, LLMs could serve as "virtual assistants" for TCM education33. In medical education, multiple-choice questions constitute the primary examination format34. LLMs can generate questions and quizzes to facilitate both practice and assessment in class sessions18. Choosing the appropriate LLMs can offer explanations and feedback for each question and clarify ambiguous and complex concepts that students might face. Using this tool, students can ask about specific medical concepts, diagnoses, or treatments and receive precise and personalized responses that enhance their learning and contribute to their knowledge base12. In addition, LLMs are able to edit basic medical reports, which could help students identify areas for improvement35. They can serve as real-time learning tools for students. They give teachers an efficient way to assess their pupils' skills and pinpoint their areas of weakness. More importantly, users need to exercise caution. It is necessary to validate all responses and information provided by LLMs to ensure the accuracy of disseminated TCM knowledge. LLMs are paving the way for a new era of teaching and learning, pushing the boundaries for educators and students. By continuously training and refining the models, their usefulness in medical education and examinations will expand, possibly revolutionizing our approach to exam preparation and lifelong learning15.
In general, LLMs offer a diverse array of applications in both TCM clinical practice and medical education. Professionals in relevant fields should stay informed about advancements in AI technology and contemplate its potential applications.
We propose further evaluation of LLMs: a broader range of question formats, such as open-ended questions, could be employed to gain a more comprehensive understanding of their real-world applicability. Secondly, conducting evaluations with a larger question pool reduces uncertainty about perceptions of LLMs. In addition, refining the subjects under assessment (e.g. by sub-disciplines) will yield a more detailed insight into the model's capabilities, allowing for speculation on its application in specific medical scenarios. A more thorough evaluation will enable LLMs to be more precisely situated within medical practice, encouraging their improvement and quick iteration. With training on a larger dataset of high-quality Chinese medical information (such as medical literature and electronic medical records), a Chinese LLM like Ernie Bot could achieve even greater performance, significantly improving its applicability in TCM.
Limitations
This study had the following limitations. Firstly, the postgraduate examination evaluated the basic competence of TCM students comprehensively but did not include a more detailed specialist assessment. Moreover, the exam exclusively comprised objective questions and did not include any subjective questions. Second, this study only explored how LLMs performed on cross-sectional data. However, the exam content differs annually, leading to corresponding changes in passing scores.