Medicine is a fundamental human pursuit, and language serves as a vital medium for communication and information exchange among clinicians, medical researchers, patients, and other relevant stakeholders [1]. Large Language Models (LLMs), exemplified by ChatGPT, leverage language as a means for human-computer interaction, and their ability to engage in accurate and natural conversations has garnered rapid attention[2]. It's worth noting that LLMs are gradually reshaping the healthcare landscape, prompting researchers to explore the opportunities and challenges inherent in this emerging technology [3–6]. They are also seeking ways to apply it in a standardized and rational manner across clinical medical practices, medical education, medical research, and various medical specialties[7] .
Currently, common NLP tasks[5, 8] in the medical field encompass clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, medical question answering, and medical text generation. Compared with traditional NLP methods that rely on rules and artificial features, Large Language Models (LLMs) use deep learning techniques with stronger generalization, which can better simulate the nature of human language when dealing with natural language, and are becoming more and more widely used in various NLP tasks[9]. Similarly LLMs have shown stronger potential applications in multiple medical tasks [1,4], mainly including summarization of findings, clinical decision assistant-support, or even working as a chatbot to answer questions for the patients with their specific data and concerns, and more.
However, healthcare falls into a relatively intricate and high-risk professional realm, necessitating the consideration of the requirement for medical specialization, safety, empathetic, and the model's utmost precision [10–11]. Especially the doctor-patient relationship, Doctors should accept responsibility for both a technical expert and a supportive interpersonal role, who must take into account communication of information calculated to assist the patient to understand, control, and cope with overpowering emotions and anxiety[12]. The current medical large-scale models have also achieved some success, such as ChatDoctor[13] and BenTsao[14]. They further enhance the self-guided knowledge retrieval capability of large medical language models to provide well-informed medical advice. While ChatDoctor and BenTsao have shown promise in medical question-answering in Chinese and English, respectively, there is still room for improvement and enhancement of the performance of existing large models. For instance, there is a pressing need for a medical large model to enhance the doctors' knowledge base and response accuracy, while also fulfilling specific humanitarian care requirements in the medical field. Secondly, the model must possess cross-language comprehension capabilities to handle real-world cross-linguistic inquiries in medical scenarios. Lastly, to enhance the overall efficiency of healthcare, the large model should be capable of handling multiple tasks, not only common medical questions and answers.
Constructing specialized domain datasets is a crucial strategy for enhancing the performance of large-scale models. Datasets such as MedQA (USMLE) [15] and MedMCQA [16] contribute to general medical knowledge, particularly in the context of medical licensing exams. PubMedQA [17] and MMLU [18] concentrate on biomedical scientific and professional medical knowledge, while MedicationQA [19], LiveQA [20], and HealthSearchQA [1] encompass the domain of general medical knowledge sought by consumers. These foundational datasets also give rise to comprehensive medical Q&A datasets, such as MultiMedQA and MedDialog, by incorporating real medical dialogue data [21].
Common approaches involve the utilization of language models trained on medical-specific structured or unstructured data. Examples include the KEBLM[22] based on structured knowledge from UMLS and PubChem, as well as models like PubMedBERT[23] and BioGPT[24]. While these models have demonstrated effectiveness in tasks such as medical question-answering and medical text summarization, they are relatively smaller in scale and scope compared to large language models represented by GPT-3 and LLaMA. As large language models continue to scale up, they are expected to enhance performance significantly. For instance, Yang et al.[25] harnessed over 900 billion tokens of text, including over 820 billion tokens of de-identified clinical text, to develop a large-scale clinical language model named GatorTron—which achieved a performance improvement of 9.5% compared to BioBERT. On the other hand, Lievin et al. [26] employed a fine-tuned large language model and applied concept chain prompts to enhance LLMs' reasoning capabilities in medical question answering.
The leading AI model in the current medical domain, ChatDoctor, is based on a dataset of 100,000 doctor-patient dialogues sourced from online medical consultation platforms. However, it primarily focuses on the English language domain and currently lacks the functionality to provide users with concise medical text summaries. The Chinese medical large language model, BenTsao, integrates both structured and unstructured medical knowledge from the Medical Knowledge Graph (CMeKG) and undergoes supervised fine-tuning. Comparative analysis against LLaMA, Alpaca, and ChatGLM reveals that BenTsao generates responses infused with more reliable medical knowledge. However, its medical knowledge is limited to the field of liver cancer, providing restricted support for general symptoms or diseases in medical interactions.
In contrast to the studies mentioned above, the work proposes MulMed, which will place a greater emphasis on multitasking capabilities. This includes the ability to summarize complex medical texts, address patient inquiries, engage in medical question-answering dialogues, exhibit cross-lingual proficiency, and offer a more comprehensive coverage of medical knowledge. Specifically, as Fig. 1, this work brings the following original contributions:
- We establish a dataset called MulMedData, which integrates more than 300,000 various and intricate data from different sources. Then, based on this dataset we preset a two-step fine-tuned modeling framework that makes the model to have multi-task and cross-language capabilities and demonstrate excellent generalization abilities on two benchmark test sets.
- We introduce instruction prompt design for aligning LLMs to the special medical context terms. The fine-tuned model demonstrates a level of human empathy, particularly in the context of doctor-patient consultations.
- We propose a medical ethics framework to assist in evaluating the feasibility of applications for medical models, which consider information security, etiology explainability, user guidance.
Figure 1 Overview of our contributions. We curated MulMedData, a large-scale, multi-source dataset that encompasses medical Q&A and text summarization in both Chinese and English. On MulMedData, We performed a two-step fine-tuning process based on the foundational model and developed our model MulMed through prompt design. MulMed outperformed the state-of-the-art results on iCliniq and haodf, and also performed commendably in the evaluation of our proposed SIH ethical framework for medical applications. Additionally, the prompt design has imbued our model's responses with greater compassion and humanistic care.