Study Design
This comparative study evaluated the knowledge and diagnostic capabilities of GPT-3.5, GPT-4, and MAC for 150 rare diseases using real-world clinical case reports from the MedLine database. Each case encompasses two scenarios: an initial presentation and a complete presentation, which represent different stages of patient care. Figure 1 shows the flowchart of this study.
Data Collection
Selection of diseases to be studied
This study involved 150 rare diseases selected from a pool of over 7,000 across 33 types in the Orphanet Database, a comprehensive resource co-funded by the European Commission13.
Owing to the varied distribution of rare diseases among different types, a normalized weighted random sampling method was used for selection to ensure a balanced representation. The sampling weights were adjusted based on the disease count in each type and moderated by natural logarithm transformation14, 15.
Search for clinical case reports
After the diseases were selected for investigation, clinical case reports published after January 2022 were identified from the MEDLINE database. The search was conducted by one investigator and reviewed by another investigator.
Inclusion and exclusion criteria
Clinical case reports were included if they 1) presented a complete clinical picture of a real patient diagnosed with a rare disease, including demographics, symptoms, medical history, and diagnostic tests performed. 2) were published in English. Case reports were excluded if they 1) lacked information required to make a diagnosis, 2) were not published in English, 3) were animal studies, 4) contained factual errors that would influence the diagnosis, and 5) reported diseases other than those in the intended literature search.
Manual screening
Two blinded investigators independently screened the search results using defined criteria. The first investigator selected case reports for the test, followed by repeated screening by the second investigator. Any disagreements were resolved through a group discussion.
For each disease, the search results were screened until an eligible case report was identified. If no suitable reports were identified, new random sampling within the same disease category was conducted to select a different disease.
Data Preparation
Data Extraction
One investigator manually extracted data from each clinical case report, which was subsequently reviewed by a specialist doctor. The extracted information includes patient demographics, clinical presentation, medical history, physical examination results, and outcomes of tests (e.g., genetic tests, biopsies, radiographic examinations), along with the final diagnosis.
Data Curation
Final and possible differential diagnoses from the original texts were removed from the inputs sent to the LLMs. Patient information was presented in two scenarios: initial and complete presentations, each representing a different stage in the diagnostic process.
Initial Presentation: This simulates the first clinical encounter, focusing on the LLM's ability to suggest probable diagnoses and further tests using initial information such as demographics, clinical presentation, physical examination, medical history, and routine test results. Complete Presentation: This simulates a fully informed diagnostic scenario, aiming to evaluate the capacity of the LLM to reach a final diagnosis with comprehensive data, including all initial information and results from additional diagnostic tests. Supplementary File 1 provides an example of patient information.
Model testing
Model selection
GPT-3.5-turbo and GPT-4 are commonly tested for medical purposes and were selected for model testing.
Multi-agent conversation
The MAC framework, aimed at diagnosing and generating knowledge about rare diseases (Fig. 2), was developed under AutoGen's structure using GPT-4. AutoGen is a novel framework that facilitates multi-agent collaboration using LLMs9. This setup simulated a medical team consultation with three doctor agents and one supervising agent. Doctor agents collected information on the patients' conditions, engaged in medical reasoning, and shared opinions during joint discussions. The supervising agent oversaw these conversations, challenged the doctors' findings and perspectives, and facilitated a consensus. The final output was derived through multiple rounds of collaborative discussions.
Prompt engineering
While the input-output (IO) prompt technique was adopted in the testing of base model GPT-3.5 and GPT-4, three other prompt techniques were also tested to evaluate whether prompt engineering enhanced diagnostic performance. These techniques include zero-shot chain of thoughts (COT), tree of thoughts (TOT), and reflections of thoughts (ROT). The ROT was developed in our previous study, which enabled models to retrospectively adjust their initial outputs, thereby potentially enhancing the overall quality and accuracy of the output16.
Generating Disease Specific Knowledge
GPT-3.5, GPT-4, and MAC were assessed for their knowledge of each rare disease covered in the study, including disease definition, epidemiology, clinical description, etiology, diagnostic methods, differential diagnosis, antenatal diagnosis, genetic counseling, management and treatment, and prognosis.
Generating Diagnosis and Recommended Tests
For the initial presentation, the LLMs were tasked with generating one most likely diagnosis, several possible diagnoses, and further diagnostic tests. For complete presentations, the LLMs were tasked with generating one most likely diagnosis and several possible diagnoses.
Performance Evaluation
Performance was evaluated through panel discussions with three doctors who were blinded to the models and reviewed the content in a randomized order.
Disease knowledge evaluation
The knowledge generated by the LLMs was evaluated using a Likert scale. As described by a previous study, the evaluation metrics included inaccurate or inappropriate content, omissions, potentially harmful content, and bias17.
Diagnostic ability evaluation
The most likely diagnosis was considered accurate if it matched the exact diagnosis. Possible diagnoses were considered accurate if it includes the exact diagnosis. The recommended tests were evaluated as helpful or unhelpful in reaching the correct diagnosis.
The most likely diagnosis and possible diagnoses were also rated using the scale described by Bond et al18: scale: 5 for the exact diagnosis mentioned, 4 for very close, 3 for closely related and potentially helpful, 2 for related but unlikely to help, and 0 for unrelated diagnoses. Further diagnostic tests were assessed using a five-point Likert scale ranging from 1 (strongly agree) to 5 (strongly disagree) on their helpfulness.
Statistical Analysis
Statistical analyses were performed using SPSS version 25 (IBM, Armonk, NY, USA) and GraphPad Prism version 8 (GraphPad Software, San Diego, CA, USA). Continuous variables are presented as means and standard deviations, and the Shapiro–Wilk test was used to check for a normal distribution. Depending on the distribution, an ANOVA, or Kruskal–Wallis test, was applied. Discontinuous data were expressed as incidence and rate and analyzed using the chi-square test for differences.