2.1 Large language model
ChatGPT, developed by OpenAI in San Francisco, CA, is a subscription-based AI chatbot that was launched in November 2022. Originating from GPT-3, or Generative Pretrained Transformer models, it is designed to generate targeted, well-crafted, conversational, and easily understandable responses based on user prompts through interactive conversations8. To enhance ChatGPT's capabilities, the initial models were augmented with a combination of supervised and reinforcement learning methods. Notably, a part of ChatGPT's training involved Reinforcement Learning from Human Feedback (RLHF), whereby a reward model is developed based on human feedback on a dataset comprising model responses ranked by quality9,10. This method enables the fine-tuning of the model using Proximal Policy Optimization (PPO). ChatGPT's training on general rather than domain-specific data adds to its versatility across various fields, although it can be further specialized. 11The launch of an updated version utilizing GPT-4 in April 2023 marks a significant advancement, showcasing human-like responses and signs of general intelligence.
2.2 Data Source
Our investigation assembled an exhaustive compilation of 986 questions related to childhood cancer, sourced from authoritative websites and prestigious institutions. These included the Children's Oncology Group (COG), the American Cancer Society, the American Society of Pediatric Hematology/Oncology, the International Society of Pediatric Oncology (SIOP), the Children's Cancer Research Fund, and the MedlinePlus Medical Encyclopedia. Furthermore, we extracted questions from the websites of esteemed hospitals, including the Mayo Clinic, Cincinnati Children's Hospital, and Boston Children's Hospital. To broaden the diversity of our collection, we also included questions from major Chinese medical online consultation platforms like Ding Xiang Doctor, Chun Yu Doctor, You Lai Doctor, and Wei Yi, which are of concern to parents of children undergoing cancer treatment. Committed to inclusivity and thoroughness, we also integrated questions pertinent to pediatric oncology from the National qualification examination for medical practitioners, accredited by the Chinese Medical Doctor Association. We extracted 986 questions that cover various aspects of childhood cancer, including prevention, diagnosis, treatment, rehabilitation, psychology and many other aspects, serving as samples for this study. This curated selection encompasses a wide range of childhood cancer, such as leukemia, brain and nervous system tumors, neuroblastoma, and nephroblastoma
Throughout this selection process, we rigorously applied established criteria to exclude questions. These criteria were as follows: 1) For questions of similar or identical meaning, we retained only one for analysis; 2) Questions that are subjective or vary by individual, such as "What are the chances of my child's cancer coming back?"; 3) Questions with ambiguous meanings, for example, "How does cancer affect a child's body?"; 4) Non-medical questions related to the illness, such as "What are some childhood cancer support groups?"; 5) If responses are overly repetitive, we typically retain only one question, for example, 'What is the gold standard for diagnosing childhood cancer?' or 'Which tests are typically required for a definitive diagnosis of childhood tumors?', where the common answer is a pathological biopsy. Given the prevalence of such questions and the uniformity of responses, including too many could lead to an overestimation of the model’s accuracy; 6) Questions not applicable to pediatric cancer patients, such as "What are the ethical considerations in pediatric cancer research?" Following the screening process, our study ultimately included a total of 150 questions. Figure 1 presents a flowchart detailing the selection of questions pertaining to childhood cancer.
The questions were systematically categorized into distinct groups based on their subjects: Basic Knowledge, Diagnosis, Treatment, Prevention, and Humanistic Care and Emotional Support. Grammatical adjustments were made to specific questions to enhance their clarity and precision.
This study was not classified as Human Subjects Research and was therefore exempt from Institutional Review Board approval.
2.3 Response Generation
In this study, we employed ChatGPT Plus, a subscription-based model utilizing the GPT-4 architecture, as of its March 14, 2023, release. Each question was input independently through the "New Chat" function. To assess the reproducibility of ChatGPT’s responses,8,12consistent with prior research, each question was submitted twice. We aimed to accurately reflect the context of online consultations for children with cancer and their families by avoiding the introduction of additional prompts and entering questions in their original form. To reduce contextual biases, each question was crafted as a distinct, standalone prompt using the same function. Adhering to principles of blind methodology, the responsibility for collecting and inputting questions was shared among several researchers, with two designated specifically for entering questions into ChatGPT-4. All inquiries were conducted in English.
2.4 Grading system
Two pediatric oncologists from Children's National Regional Medical Center, each with extensive experience in academic practice, independently evaluated the responses generated by ChatGPT. The assessment was based on a four-tier grading system: 1. Correct and comprehensive, requiring no additional information as per the pediatric oncologist; 2. Correct but incomplete, needing supplementary information; 3. Partially correct and partially incorrect; 4. Completely incorrect. Each response was independently graded by assessors. Reproducibility was evaluated based on the consistency between two responses to the same question. If the responses were similar, only the first response from ChatGPT was graded. Divergent responses received independent grades from the reviewers. A lack of reproducibility was noted if the grades differed. If two answers to the same question varied, the third reviewer, a senior pediatric oncologist who is a nationally recognized clinical key specialist leader with over thirty years of experience, would choose the lower rating as the final evaluation. For instance, if answers are rated as 2 (Correct but incomplete) and 3 (Partially correct and partially incorrect), the final evaluation would be 3. This method is adopted because a poor response is more likely to mislead patients and their families, holding greater significance for our research. Discrepancies in accuracy and reproducibility were addressed through a blinded review conducted by this senior pediatric oncologist. The reviewers were blind to the origin of the responses, ensuring a double-blind assessment. Additionally, the sequence of responses was randomized to reduce bias.
2.5 Humanistic Care and Emotional Support
In our study, we analyzed inquiries from pediatric cancer patients and their families on online consultation platforms. We observed that these interactions not only sought information about cancer care but also emotional support. To evaluate ChatGPT's ability to provide humanistic care and emotional support comparable to that of a skilled physician, we tailored ten questions based on the specific needs of these users. We then assessed ChatGPT across three critical dimensions: demonstration of empathy, effectiveness of communication, and provision of emotional support, all according to predefined criteria (see Table 1).
Table 1
Assessment criteria for Humanistic Care and Emotional Support.
Criteria | Level A | Level B | Level C |
Demonstration of empathya | Fully embody | Have but not enough | Completely absent |
Effective communicationb | Fully embody | Have but not enough | Completely absent |
Emotional supportc | Fully embody | Have but not enough | Completely absent |
a: Demonstrates an understanding of the patient's or family's emotions and situation when explaining the questions.
b: Explained in precise, clear and plain language without too many technical terms.
c: Use supportive or encouraging language when answering questions, such as "You'll be fine" and "it will get better tomorrow."
2.5 Evaluation of quality
The DISCERN scale, a three-component tool previously employed to assess the reliability and quality of online health information, is structured into three sections13,14. The first section, comprising eight questions, evaluates the reliability of a publication. The second section, containing seven questions, assesses the quality of information regarding treatment options. The third section addresses the overall quality of the publication as a source of treatment-related information. In this study, the DISCERN scale was modified to evaluate the reliability of ChatGPT responses because not all questions were pertinent to treatment. This modification led to the creation of the modified DISCERN (mDISCERN) scale, which incorporates only the first section (see Supplementary Table 1). Scoring within the mDISCERN scale assigns 1 point for 'no' responses, 2–4 points for partial answers, and 5 points for 'yes' answers. Scores under 40% (8–15 points) were classified as poor, 40–79% (16–31 points) as fair, and above 80% (32–40 points) as good13,15. For questions that elicit two responses, we evaluate each response independently and compute the average score of the two responses.
2.6 Statistical Analysis
The distribution of grades among responses was quantified and reported in percentages. To assess the reproducibility of ChatGPT's responses, we analyzed the similarity between pairs of responses generated by the model for identical questions. Reproducibility was assessed based on the consistency of grading categories between pairs of responses. In cases where responses to the same question differed, each was graded independently. We classified responses into two categories for analysis: grades 1 and 2 versus grades 3 and 4. Divergence into different categories indicated significant differences. All statistical analyses were conducted using Microsoft Excel (version 16.69.1).