To evaluate the effectiveness of ChatGPT and BKGs in biomedicine, we conducted a comprehensive comparative analysis. Specifically, we first assessed their performance in answering drug-related and dietary supplements (DS)-related question answering. Next, we evaluated their capacities in novel biomedical knowledge discovery, e.g., drug and DS repurposing. Last, we also assessed the comprehensiveness of the biomedical knowledge they provided. Specifically, we investigated ChatGPT's ability to generate accurate and relevant responses to drug-related and DS-related queries and its potential for knowledge discovery by identifying hidden patterns and relationships.
2.1 Compared Methods
The integrated Dietary Supplements Knowledge Base (iDISK)24, which serves as an encompassing knowledge graph comprising a diverse range of dietary supplements, including vitamins, herbs, minerals, and other relevant entities. iDISK has been meticulously standardized and integrated from multiple widely used and authoritative dietary supplement resources, namely the Natural Medicines Comprehensive Database (NMCD)25, Memorial Sloan Kettering Cancer Center (MSKCC)26, the Dietary Supplement Label Database (DSLD)27, and the Natural Health Products Database (NHP), comprised of the Natural Health Product Ingredients Database28 and the Licensed Natural Health Products Database29. This integrated knowledge base incorporates various attributes and relationships that provide comprehensive information about each dietary supplement, including details such as its inclusion as an ingredient in specific products and its potential interactions with medications. In this study, iDISK will serve as the primary BKG for investigating and analyzing DS-related exploration tasks.
The integrative Biomedical Knowledge Hub (iBKH)30 was developed through a meticulous process of harmonizing and integrating information from a diverse range of biomedical resources. This comprehensive knowledge hub incorporates data from 18 highly regarded and carefully curated sources. The current iteration of iBKH encompasses a vast collection of over 2.2 million entities, representing 11 distinct entity types. Furthermore, it encompasses 45 different types of relations that establish connections between various entity pairs, spanning across 18 different categories. In the context of this study, iBKH will serve as the primary Biomedical Knowledge Graph (BKG) for conducting thorough investigations and analyses of drug-related exploration tasks.
ChatGPT31, developed by OpenAI, is an advanced conversational AI model that utilizes the GPT (Generative Pre-trained Transformer) architecture4. It is designed to generate human-like responses to text-based inputs and has garnered significant attention for its language generation capabilities and natural language processing performance. ChatGPT 3.5 and 4.0 are the latest versions of the model. The main differences between ChatGPT 3.5 and 4.0 are in the following aspects: Model scale: GPT-4 is larger than GPT-3.5, containing more parameters and computational power, which allows it to handle more complex tasks and language patterns. ChatGPT 4 has significantly progressed over ChatGPT 3.5, offering better language understanding, enhanced conversational abilities, and broader potential applications. In this study, we aim to evaluate the performance of ChatGPT, specifically the GPT-3.5 and GPT-4.0 versions, in the context of question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain.
2.2 Performance Evaluation of ChatGPT and BKG in Question-Answering
We used the question-answering (Q&A) dataset (including their titles and contents) from the "Alternative Medicine" sub-category in Yahoo! Answers32. The questions were grouped into categories such as Adverse Effects, Background, Contraindication, Effectiveness, Indication, Interaction, Safety, Uncertain, Unclassified, and Usage. Initially, we randomly selected 5 questions from each group, resulting in a total of 50 questions.
Q&A based on ChatGPT. To collect responses from ChatGPT, we input the questions as prompts and record the generated answers.
Q&A based on BKGs. The dataset utilized for evaluating ChatGPT's query performance on existing biomedical knowledge was sourced from the "Alternative Medicine" sub-category of Yahoo! Answers32. Consequently, iDISK24 will be employed to explore the pertinent answers to the given questions. Initially, we identified the unique identifier of the subject and its corresponding relationship based on the question description. Subsequently, we established connections between the object identifiers and the identifiers of relevant supplements, ingredients, therapeutic effects, and/or adverse effects using the iDISK Relationship Table, depending on the specific question at hand. Lastly, we retrieved the names of the relevant concepts and transformed the findings into natural language to provide a comprehensive response to the original query. For instance, to address the question "What are the side effects for panax ginseng?", we first located the concept ID of panax ginseng within iDISK and proceeded to identify the corresponding relationship mentioned in the question, which in this case is "has_adverse_reaction," within the relationship table. Subsequently, we retrieved the entities associated with panax ginseng and the relation "has_adverse_reaction," and translated these records into natural language to formulate the final answer. A visual representation of the overall query process is depicted in Fig. 1.
Q&A performance evaluation. To evaluate the responses, we followed the LiveQA Track guidelines33 and assigned judgment scores on a scale ranging from 0 to 3. Two experts who have medical backgrounds were introduced for manual scoring. A score of 0 indicates an incorrect response (poor or unreadable response), 1 indicates an incorrect but related answer (fair response), 2 denotes a correct but incomplete response (good response), and 3 indicates a correct and complete answer (excellent response). Based on this scale, we calculated two metrics. Firstly, we computed the average score, which evaluated the first retrieved answer for each test question33,34. Secondly, we measured the \(succ@i+\) metric, which is defined as the ratio of the number of questions with a score \(\ge i\)(we considered \(i\) rangng from 1 to 3) to the total number of questions. For example, \(succ@1+\)means the percentage of questions that were answered by the conversational agent (CA) with at least a fair grade33. To assess the statistical differences in the performance of the three systems (ChatGPT 4.0, ChatGPT 3.5 and iDISK), we used the t-test for normal distributed data or Mann-Whitney U test for non-normal distributed data. The QQ-plot was performed to look at the normality of the data. The analysis is conducted using R 1.1 with the package “car35”.
2.3 Performance Evaluation of ChatGPT and BKG in Knowledge Discovery
To test knowledge discovery capabilities between ChatGPT and BKGs, we devised a prediction scenario that emulates the task of drug and DS repurposing for Alzheimer's Disease (AD).
AD drug/DS repurposing based on ChatGPT. The task was to prompt ChatGPT to suggest drugs or DSs that are not presently utilized for the treatment or prevention of AD but possess the potential to be employed in such capacities. Each prompt was repeated 10 times, and we collected all the results returned by ChatGPT. The specifically crafted prompts included:
-
Please provide the approved drugs that are not currently used to treat Alzheimer's disease but are potentially available for the treatment of AD. And please give your rationale. (Drug)
-
Please provide which dietary supplements have the potential to treat/prevent Alzheimer's disease. And please give your rationale. (DS)
We examined the answers generated by ChatGPT to determine if these answers met the following criteria: 1. whether they were already present in existing BKGs (specifically, iBKH30 for drugs and ADInt36 for DSs); 2. whether they were documented in clinical trials; and 3. whether they were supported by existing literature.
AD drug/DS repurposing based on BKG. Building upon our previous research30,36, we employed knowledge graph embedding (KGE) algorithms to compute machine-readable embedding vectors for entities and relations within the BKGs (iBKH and ADInt) while preserving the graph structure. Subsequently, we leveraged these learned embedding vectors to conduct link prediction, enabling the prediction of potential relations between pairs of entities. Then, we generated suggested potential drug and DS candidates for AD. This approach involved identifying relationships that were absent in the existing BKGs, thus enabling the exploration of novel therapeutic possibilities in the context of AD.
2.4 Performance Evaluation of ChatGPT and BKG in Knowledge Reasoning
To assess the comprehensiveness of ChatGPT's knowledge base, we further examined its capability in establishing associations between the proposed drug and DS candidates with AD. In our previous study, we investigated potential pharmaceuticals and DS for the treatment or prevention of AD using link prediction techniques30,36. Building upon these previous findings, our objective was to evaluate ChatGPT's knowledge base by examining the associations it provides between these hypothetical drug/DS candidates and AD, as well as the corresponding references it offers to support these hypotheses. To accomplish this, we formulated scenario-based inquiries as follows:
-
Please show the association/linkage (direct link or indirect link) between [Tested Drug] and Alzheimer's disease (AD) in a structured way (like a triplet). And please provide the reference for your finding.
-
Please show the association/linkage (direct link or indirect link) between [Tested DS] and Alzheimer's disease (AD) in a structured way (like a triplet). And please provide the reference for your finding.