MulMed: Addressing Multiple Medical Tasks Utilizing Large Language Models

doi:10.21203/rs.3.rs-4967279/v1

Download PDF

Research Article

MulMed: Addressing Multiple Medical Tasks Utilizing Large Language Models

https://doi.org/10.21203/rs.3.rs-4967279/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The proliferation of large-scale language models, such as ChatGPT, has underscored the urgent requirement to develop Language Models in Medicine (LLMs) to mitigate the burden on healthcare resources. This work introduces MulMed, a model that prioritizes multitasking capabilities in medical domains. MulMed aims to summarize complex medical texts, address patient inquiries, engage in medical question-answering dialogues, demonstrate cross-lingual proficiency, and offer comprehensive medical knowledge coverage. Its key contributions include a two-step fine-tuned modeling framework that enables the model to perform multi-task functions like medical text summarization and Q&A in both English and Chinese, demonstrating excellent generalization abilities on benchmark test sets. The model also exhibits human empathy in doctor-patient consultations, and its fine-tuning process and data are openly available to promote future research in cross-lingual medical models. Additionally, a medical ethics framework is proposed to aid in evaluating the feasibility of medical model applications.

Large Language model

Multi-task model

Cross-lingual medical models

Medicine is a fundamental human pursuit, and language serves as a vital medium for communication and information exchange among clinicians, medical researchers, patients, and other relevant stakeholders [1]. Large Language Models (LLMs), exemplified by ChatGPT, leverage language as a means for human-computer interaction, and their ability to engage in accurate and natural conversations has garnered rapid attention[2]. It's worth noting that LLMs are gradually reshaping the healthcare landscape, prompting researchers to explore the opportunities and challenges inherent in this emerging technology [3–6]. They are also seeking ways to apply it in a standardized and rational manner across clinical medical practices, medical education, medical research, and various medical specialties[7] .

Currently, common NLP tasks[5, 8] in the medical field encompass clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, medical question answering, and medical text generation. Compared with traditional NLP methods that rely on rules and artificial features, Large Language Models (LLMs) use deep learning techniques with stronger generalization, which can better simulate the nature of human language when dealing with natural language, and are becoming more and more widely used in various NLP tasks[9]. Similarly LLMs have shown stronger potential applications in multiple medical tasks ^[1,4], mainly including summarization of findings, clinical decision assistant-support, or even working as a chatbot to answer questions for the patients with their specific data and concerns, and more.

However, healthcare falls into a relatively intricate and high-risk professional realm, necessitating the consideration of the requirement for medical specialization, safety, empathetic, and the model's utmost precision [10–11]. Especially the doctor-patient relationship, Doctors should accept responsibility for both a technical expert and a supportive interpersonal role, who must take into account communication of information calculated to assist the patient to understand, control, and cope with overpowering emotions and anxiety[12]. The current medical large-scale models have also achieved some success, such as ChatDoctor[13] and BenTsao[14]. They further enhance the self-guided knowledge retrieval capability of large medical language models to provide well-informed medical advice. While ChatDoctor and BenTsao have shown promise in medical question-answering in Chinese and English, respectively, there is still room for improvement and enhancement of the performance of existing large models. For instance, there is a pressing need for a medical large model to enhance the doctors' knowledge base and response accuracy, while also fulfilling specific humanitarian care requirements in the medical field. Secondly, the model must possess cross-language comprehension capabilities to handle real-world cross-linguistic inquiries in medical scenarios. Lastly, to enhance the overall efficiency of healthcare, the large model should be capable of handling multiple tasks, not only common medical questions and answers.

Constructing specialized domain datasets is a crucial strategy for enhancing the performance of large-scale models. Datasets such as MedQA (USMLE) [15] and MedMCQA [16] contribute to general medical knowledge, particularly in the context of medical licensing exams. PubMedQA [17] and MMLU [18] concentrate on biomedical scientific and professional medical knowledge, while MedicationQA [19], LiveQA [20], and HealthSearchQA [1] encompass the domain of general medical knowledge sought by consumers. These foundational datasets also give rise to comprehensive medical Q&A datasets, such as MultiMedQA and MedDialog, by incorporating real medical dialogue data ^[21].

Common approaches involve the utilization of language models trained on medical-specific structured or unstructured data. Examples include the KEBLM[22] based on structured knowledge from UMLS and PubChem, as well as models like PubMedBERT[23] and BioGPT[24]. While these models have demonstrated effectiveness in tasks such as medical question-answering and medical text summarization, they are relatively smaller in scale and scope compared to large language models represented by GPT-3 and LLaMA. As large language models continue to scale up, they are expected to enhance performance significantly. For instance, Yang et al.[25] harnessed over 900 billion tokens of text, including over 820 billion tokens of de-identified clinical text, to develop a large-scale clinical language model named GatorTron—which achieved a performance improvement of 9.5% compared to BioBERT. On the other hand, Lievin et al. [26] employed a fine-tuned large language model and applied concept chain prompts to enhance LLMs' reasoning capabilities in medical question answering.

The leading AI model in the current medical domain, ChatDoctor, is based on a dataset of 100,000 doctor-patient dialogues sourced from online medical consultation platforms. However, it primarily focuses on the English language domain and currently lacks the functionality to provide users with concise medical text summaries. The Chinese medical large language model, BenTsao, integrates both structured and unstructured medical knowledge from the Medical Knowledge Graph (CMeKG) and undergoes supervised fine-tuning. Comparative analysis against LLaMA, Alpaca, and ChatGLM reveals that BenTsao generates responses infused with more reliable medical knowledge. However, its medical knowledge is limited to the field of liver cancer, providing restricted support for general symptoms or diseases in medical interactions.

In contrast to the studies mentioned above, the work proposes MulMed, which will place a greater emphasis on multitasking capabilities. This includes the ability to summarize complex medical texts, address patient inquiries, engage in medical question-answering dialogues, exhibit cross-lingual proficiency, and offer a more comprehensive coverage of medical knowledge. Specifically, as Fig. 1, this work brings the following original contributions:

- We establish a dataset called MulMedData, which integrates more than 300,000 various and intricate data from different sources. Then, based on this dataset we preset a two-step fine-tuned modeling framework that makes the model to have multi-task and cross-language capabilities and demonstrate excellent generalization abilities on two benchmark test sets.

- We introduce instruction prompt design for aligning LLMs to the special medical context terms. The fine-tuned model demonstrates a level of human empathy, particularly in the context of doctor-patient consultations.

- We propose a medical ethics framework to assist in evaluating the feasibility of applications for medical models, which consider information security, etiology explainability, user guidance.

Figure 1 Overview of our contributions. We curated MulMedData, a large-scale, multi-source dataset that encompasses medical Q&A and text summarization in both Chinese and English. On MulMedData, We performed a two-step fine-tuning process based on the foundational model and developed our model MulMed through prompt design. MulMed outperformed the state-of-the-art results on iCliniq and haodf, and also performed commendably in the evaluation of our proposed SIH ethical framework for medical applications. Additionally, the prompt design has imbued our model's responses with greater compassion and humanistic care.

2.1. Data Integration

In real-world clinical scenarios, users (doctors and patients) often provide more colloquial and diverse expressions when describing disease symptoms due to differences in language, culture, and identity roles. Self-constructed datasets may lack reality and diversity. Therefore, it is essential to gather a substantial amount of authentically occurring data. MulMedData, incorporating various and intricate data from different sources, enables the model to possess capabilities in processing complex medical texts and engaging in cross-lingual consultations with medical users.

To begin with, we meticulously collected layperson summaries of biomedical research articles from the PLOS and eLife datasets, obtained from the BioNLP Workshop 2023 Shared Task1. Furthermore, we gathered approximately 100,000 real patient-doctor conversations from the HealthCareMagic2 online medical consultation website. These three datasets were combined to form the English training set, comprising a total of 144,518 entries. Moreover, we collected 7,321 patient-doctor conversations from the iCliniq3 online medical consultation website, which serves as the English test set. Regarding the Chinese data, we collected it from haodf.com, an esteemed online platform for medical consultations. Considering the computational resources and cost, this work treats the dialogue data as multiple rounds of Q&A and treats the first Q&A as a dataset. The efforts yielded 150,000 patient-doctor dialogue samples from this source. Additionally, we collected 10,000 dialogue samples from the same platform, constituting the Chinese test set.

Throughout the data collection process, we prioritized ethical considerations and implemented measures to avoid offensive, harmful, and biased content. Firstly, we performed deletion and anonymization processes on user identities (both doctors and patients). Simultaneously, sensitive word filtering for offensive and biased language was applied. For some garbled characters, we employed methods such as standardizing encoding and identifying and removing outliers to address them. Regarding excessively long texts, we truncated them while retaining key information to conform to the model's input length constraints, thereby aiding in reducing data complexity.

The statistics of datasets are presented on Table 1, there are 302,839 entries, which include not only abstracts from biomedical articles but also consultation and doctor-patient dialogue data from two prominent medical websites in both Chinese and English. Furthermore, this portion of the data has undergone a series of preprocessing measures, including sensitive word filtering and cleaning of dirty data, to ensure the quality of the data source. In summary, the MulMed dataset is robust, diverse, and trustworthy because it incorporates data from various sources and has undergone thorough quality control procedures to ensure accuracy and reliability. The combination of these elements enhances the overall value of the dataset for model train.

Table 1

The statistics of the datasets for Summarization, Question-Answering, and Dialogue tasks.
DataSet	Task	Size	Abstract Len (mean)	Summary Len (mean)	Question Len (mean)	Answer Len (mean)	Dialog Turns
PLOS	Summary	27,525	160.5	367.9	-	-
eLife	Summary	4,828	230.9	123.5	-	-	-
HealthCareMagic	QA	112,165	-	-	84.1	109.5	-
Haodf	QA&Dialogue	151,000	-		148.8	36.1	2.5
iCliniq	Q&A	7,321	-	-	103.3	103.8	-

2.2. Model Fine-tuning

To optimize computational resources and reduce time costs, this work employed the 7B LLaMA model as the foundation of development process, incorporating Lora fine-tuning approach[27]. This technique focuses on fine-tuning low-rank slices of the query, key, and value embedding heads, thereby reducing the overall memory usage. The overall process can be described as follows:

Step A: Cross-language Pretraining Model. Initially, fine-tuning was performed on the LLaMA-7B base model using both English and Chinese alpaca data in the Hugging Face format. The batch size was set to 192, the hidden size to 4096, the learning rate to 3e-4, and the cutoff length was set to 256. The AdamW optimizer with a warmup strategy was employed. After 7 hours and 23 minutes, the resulting Lora weights obtained from this process were merged with the original LLaMA-7B base model. This formed a Hugging Face compatible pretraining model for both English and Chinese, referred to as CrossLa.

Step B: Multiple Medical Tasks Model. In the second step, we fine-tuned the CrossLa model on MulMedData(section 2.1). The batch size was set to 256, with the other training strategies remaining the same. After this process, which took 21 hours and 39 minutes, the resulting Lora weights were merged, yielding the final Hugging Face-compatible muti-task medical fine-tuning model in both English and Chinese, called MulMed.

Step C: Evaluation. To illustrate the performance of the models, we conducted a comparative analysis with SOTA models: ChatGPT, ChatDoctor, and BenTsao.

Hippocrates, a famous Greek physician, once said, ‘Doctors have two things that can be used for treatment, one is medicine, and the other is language’.[28] Especially in the doctor-patient consultation scenario, patients or their family users often show nervousness, anxiety and uneasiness, etc. At this time, medical service personnel's politeness, comfort, professional prudence, technical pertinence and other language rich in professionalism and humanistic care can increase the patient-user's sense of trust [29–30].To accommodate various languages and types of dialogues, we have implemented specific prompt design schemes based on the instruct-tuning approach utilized in Alpaca. In the prompt data, the “instruction” component is formulated as “instruction + question content”, the “input” remains blank, and the “output” is designed as the “answer”. For different languages and dialogue types, we employ the instruct-tuning approach.

As illustrated in Fig. 2, for English Summary Prompt Data, assuming the original data comprises the “abstract” and “summary” sections, we formulate the instruction for the prompt as “Please summarize the following medical phenomena:” + “abstract”. The input is left empty, and the output is designed as the “summary”.

For English Question-Answer, assuming the original data consists of the “question” and “answer” sections, we formulate the instruction for the prompt as “Please answer the medical question based on the following description:” + “question”. The input is left empty, and the output is designed as the “answer”. For Chinese Question-Answer, assuming the original data includes the “question” and “answer” sections, we formulate the instruction for the prompt as “请根据以下描述来回答医学问题或给出建议:( translation: Please answer the medical question based on the following description:)” +“question”+“请问医生我该怎么做? (translation: what should I do?)”. The input is left empty, and the output is designed as the “answer”. Because Dialogue datasets have already been processed into a Q&A format in section 2.1, the prompt design is the same as that of Q&A.

By adhering to these prompt design guidelines, we ensure a consistent and structured format across different languages and dialogue types. This approach facilitates effective fine-tuning and enhances the performance of the MulMed model in cross-lingual medical question-answering and summarization tasks.

As practical applications of medical models must address issues such as data privacy, offensive, harmful, and biased content, we propose a medical ethical framework to guide the deployment and usage of MulMed in real-world scenarios: Information Security, Etiology Explainability, User Guidance. The evaluation axis is progressive, first assessing whether the generated information is secure and unbiased, followed by evaluating the etiology of explainability, and finally assessing whether the content is guidance to the user. The questions asked in the evaluation are summarized on Table 2. The evaluation methodology is as follows: we recruited five annotators with medical backgrounds to score models based on the criteria.

Table 2

Summary of the Medical Ethics Evaluation Framework
Task	Axis	Question	Evaluation Score
1	Information Security	Does the answer contain any information that is inappropriate or inaccurate or it shouldn’t?	(Yes, not acceptable) :0 (Yes,acceptable): 1; No: 2
2	Etiology Explainability	Does the answer contain a correct explainability of the user's mentioned question, or How well does the answer address the intent of the question?	(Yes, good) : 2; (Yes, acceptable):1; No:0
3	User Guidance	Does it enable you to draw a conclusion or help clarify next steps?	(Yes, good) : 2; (Yes,acceptable):1; No:0

5.1. Metrics

For generative tasks in the general domain, evaluation metrics such as ROUGE[31] and F-score[32] are used to determine whether the generative model can generate similar responses to real scenarios. The ROUGE metric is a common evaluation metric in fields such as machine translation, automatic summarization, and question-answering generation. ROUGE calculates the corresponding score by comparing the model-generated summary or answer with the reference answer (usually generated by humans), mainly based on recall and N-gram. F-score is neutralization calculation of recall and precision. The 7,321 gold reference answer come from iCliniq3 online medical consultation website, which serves as the ground truth of English test set, meanwhile 10,000 Chinese answer from haodf.com online medical consultation website.

In practical medical scenarios, different roles such as patients and doctors may have diverse and inconsistent expressions of the same condition. In order to evaluate this phenomenon, we adopted Diversity Score[33] ,which is widely used because it can measures the dimensions of diversity, coherence, and factuality of text generation.

5.2. Metrics results

We have made the following observations regarding the performance of the MulMed model compared to ChatGPT, ChatDoctor, and BenTsao. Figure 3 illustrates that fine-tuned MulMed model consistently outperforms ChatDoctor across all metrics. Additionally, it surpasses ChatGPT in terms of F-score, ROUGE and Diversity scores, indicating the fact that the English training set includes not only medical consultation questions and answers but also a portion of academic journals, which supplement the diverse expressions of medical users regarding diseases, symptoms, and other terminologies.

The fine-tuned MulMed model surpasses BenTsao in all metrics. In comparison to ChaGPT, the MulMed model demonstrates comparable performance in terms of Rouge and Diversity scores. The similarity can be attributed to the larger dataset employed for training ChatGPT. However, the MulMed model achieves a higher F-score, indicating its overall superiority, particularly in scenarios where precision and recall carry equal importance.

5.3. Results of Ethical Framework for Medical Applications

As shown in Fig. 4, across all metrics in terms of information security, explainability of the cause of the disease or symptom, user guidance. We observed that two-step fine-tuned MulMed model outperformed the SOTA in all the test set. In terms of user guidance metrics, the medical professional model surpasses ChatGPT. Our MulMed model, in particular, holds an edge over ChatDoctor and BenTsao, thanks to our extensive and varied dataset, coupled with prompt fine-tuning.

Owing to our meticulous data processing and quality assurance, our information security metrics are substantially higher than those of ChatDoctor and BenTsao. While ChatGDP boasts a high safety score, its responses seldom offer clear guidance on next steps.

In the etiology explainability metrics, the model frequently identifies key terms in questions and offers explainability of the cause of the disease or symptom, yet our MulMed model excels in providing more extensive reason analyses.

5.4. Humanistic Care

Humanistic, a quality that can be distinguished from professionalism of the professional clinician, has received increasing attention in the healthcare environment^[32]. We analyze the answers output from our model for split words and word frequency statistics, by utilizing python natural language processing packages jieba and nltk. As illustrated in Fig. 5, the high frequency words in the model output are not only professional terms such as treatment, consultation, and hospital, but also some more polite and soft communication and expression between doctors and patients such as “understand”, “concern”, “hope”, “help”, “suggest” and so on. This finding indicates that our medical model has a certain degree of compassion and empathy, and is more closely aligned with the current language requirements of medical ethics in the doctor-patient relationship.

The purpose of this two-step fine-tuning approach was to enhance the model's capabilities for cross-lingual conversations and for performing effective medical tasks, including summarization, question-answering, and dialogue tasks. Additionally, the real datasets of doctor-patient consultations, coupled with a series of text processing, safety oversight, quality assurance, and prompt standardization procedures, the model demonstrated the inclusion of humanistic care, cause analysis, and a sense of security and helpfulness for medical users. We believe that as the medical dataset grows, the model's medical competence, which includes professionalism and humanistic care, becomes closer to resembling that of a doctor's role.

We propose MulMed, a methodology for cross-lingual and multiple tasks medical fine-tuning based on LLaMA. This model leverages the Lora (Low-rank Adaptation) technique, utilizing a two-step fine-tuning process. The model outperforms ChatGPT, ChatDoctor, and the Chinese variant BenTsao on the benchmark test set. The incorporation of the LoRA fine-tuning technique has proven to be effective, allowing for efficient training on a single 80GB A100 GPU within a reasonable timeframe of approximately 29 hours. This approach strikes a balance between training cost and performance, making it a practical choice for large-scale model development.

The study has successfully demonstrated the feasibility of fine-tuning the model to achieve cross-lingual medical performance and multi-task capabilities. Our proposed MulMed represents a significant advancement in the field of medical LLMs. Apart from possessing the capability for cross-lingual and multitask processing, it also demonstrates a notable improvement in understanding the intent behind inquiries from medical users in its output. MulMed provides suggestions that are relatively safer, more accurate, and useful compared to baseline models.

There are some limitations to this work. Considerations must be directed towards ethical implementation, involving meticulous quality assessment in various clinical settings. Additionally, precautions should be established to prevent excessive dependence on the outputs of a medical assistant. While these models can improve the accessibility of medical information and assist healthcare providers in extracting critical details, they heavily rely on the data they are trained on. Another limitation is the use of large language models in the medical domain requires continuous monitoring and updating to keep up with the rapidly evolving medical knowledge. Looking forward, future efforts will concentrate on training medical datasets in multiple languages. This endeavor aims to facilitate the advancement of cross-lingual and multi-task medical large language models, contributing to the progress of multilingual medical natural language processing.

Acknowledgements

This research has been funded by Science and Technology Research Project of Jiangxi Department of Education of China(GJJ2202609)

Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., Scales, N.,Tanwani, A., Cole-Lewis, H., Pfohl, S., et al.: Large language models encode clinical knowledge. Nature 620(7972), 172–180 (2023)
Sohail, S.S., Farhat, F., Himeur, Y., Nadeem, M., Madsen, D.Ø., Singh, Y., Atalla,S., Mansoor, W.: Decoding chatgpt: A taxonomy of existing research, current challenges, and possible future directions. Journal of King Saud University-Computer and Information Sciences, 101675 (2023)
Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F.,Ting, D.S.W.: Large language models in medicine. Nature medicine 29(8), 1930–1940 (2023)
Mesk´o, B., Topol, E.J.: The imperative for regulatory oversight of large language models (or generative ai) in healthcare. Nature medicine 6(1), 120 (2023)
Tang, L., Sun, Z., Idnay, B., Nestor, J.G., Soroush, A., Elias, P.A., Xu, Z., Ding,Y., Durrett, G., Rousseau, J.F., et al.: Evaluating large language models on medical evidence summarization. npj Digital Medicine 6(1), 158 (2023)
Shen Y, Heacock L, Elias J, et al..: Chatgpt and other large language models are double-edged swords. Radiology 307(2), 230163 (2023)
Xue, V.W., Lei, P., Cho, W.C.: The potential impact of chatgpt in clinical and translation-al medicine. Clinical and Translational Medicine 13(3) (2023)
Yang, X., Chen, A., PourNejatian, N., Shin, H.C., Smith, K.E., Parisien, C.,Compas, C., Martin, C., Costa, A.B., Flores, M.G., et al.: A large language model for electronic health records. NPJ digital medicine 5(1), 194 (2022)
Kitamura, F.C.: Chatgpt is shaping the future of medical writing but still requires human judgment. Radiology 307(2), 230171 (2023)
Amin, M.M., Cambria, E., Schuller, B.W.: Can chatgpt’s responses boost traditional natural language processing. IEEE Intelligent Systems 38(5), 5–11(2023)
Moor, M., Banerjee, O., Abad, Z.S.H., Krumholz, H.M., Leskovec, J., Topol, E.J.,Rajpurkar, P.: Foundation models for generalist medical artificial intelligence.Nature 616(7956), 259–265 (2023)
Kaba, R., Sooriakumaran, P.: The evolution of the doctor-patient relationship.International journal of surgery 5(1), 57–65 (2007)
Li, Y., Li, Z., Zhang, K., Dan, R., Jiang, S., Zhang, Y.: Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 15(6) (2023)
Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge (2023) arXiv:2304.06975
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., Szolovits, P.: What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11(14), 6421 (2021)
Pal, A., Umapathi, L.K., Sankarasubbu, M.: Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering (2022)
Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., Lu, X.: Pubmedqa: A dataset for biomedical research question answering (2019) arXiv:1909.06146
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D.,Steinhardt, J.: Measuring massive multitask language understanding (2020) arXiv:2009.03300
Abacha, A.B., Agichtein, E., Pinter, Y., Demner-Fushman, D.: Overview of the medical question answering task at TREC 2017 LiveQA (2017)11
Abacha, A.B., Mrabet, Y., Sharp, M., Goodwin, T.R., Shooshan, S.E., Demner Fushman, D.: Bridging the Gap Between Consumers’ Medication Questions andTrusted Answers (2019)
Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K.,Pfohl, S., Cole-Lewis, H., Neal, D., et al.: Towards expert-level medical question answering with large language models (2023) arXiv:2305.09617
Lai, T.M., Zhai, C., Ji, H.: Keblm: knowledge-enhanced biomedical language models. Journal of Biomedical Informatics 143, 104392 (2023)
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T.,Gao, J., Poon, H.: Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare(HEALTH) 3(1), 1–23 (2021)
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., Liu, T.-Y.: Biogpt:generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics 23(6), 409 (2022)
Yang, X., Chen, A., PourNejatian, N., Shin, H.C., Smith, K.E., Parisien, C.,Compas, C., Martin, C., Costa, A.B., Flores, M.G., et al.: A large language model for electronic health records. NPJ digital medicine 5(1), 194 (2022)
Li´evin, V., Hother, C.E., Motzfeldt, A.G., Winther, O.: Can large language models reason about medical questions? Patterns 5(3) (2024)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L.,Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) arXiv:2106.09685
Cassell, E.J.: The theory of doctor-patient communication. In: Cassell, E.J. (ed.)Talking with Patients, vol. 1, p. 1. Mit Press, ??? (1985)
Cassell, E.J., Skopek, L.: Language as a tool in medicine: methodology and theoretical framework. Academic Medicine 52(3), 197–203 (1977)
Chou, C.M., Kellom, K., Shea, J.A.: Attitudes and habits of highly humanistic physicians. aca-demic medicine. Academic Medicine 89(9), 1252–1258 (2014)
K, G.: Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. Computational Linguistics 1(1) (2006)
Derczynski, L.: Complementarity, F-score, and NLP Evaluation (2016)[33] Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A Diversity-Promoting Objective Function for Neural Conversation Models (2016)

No competing interests reported.

Download PDF

Editor assigned by journal
25 Aug, 2024
Submission checks completed at journal
25 Aug, 2024
First submitted to journal
24 Aug, 2024

You are reading this latest preprint version

MulMed: Addressing Multiple Medical Tasks Utilizing Large Language Models

Status:

Version 1

Abstract

Figures

1. Introduction

2. Cross-lingual Multitasking Model

2.1. Data Integration

2.2. Model Fine-tuning

3. Humanistic Medical Prompt Design

4. Ethical Framework for Medical Applications

5. Metrics &Results

5.1. Metrics

5.2. Metrics results

5.3. Results of Ethical Framework for Medical Applications

5.4. Humanistic Care

6. Discussion

7. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1