Exploring the Potential of Large Language Models: Can ChatGPT effectively assume the role of medical professionals by providing accurate and reliable responses in childhood cancer?

doi:10.21203/rs.3.rs-4624109/v1

Download PDF

Article

Exploring the Potential of Large Language Models: Can ChatGPT effectively assume the role of medical professionals by providing accurate and reliable responses in childhood cancer?

https://doi.org/10.21203/rs.3.rs-4624109/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Childhood cancer incidence rises by 1.1% annually, with leukemia up 0.6% and soft-tissue sarcomas 1.8%. This trend challenges pediatric oncology and increases demand for accurate online medical information. This study examined ChatGPT's accuracy and reliability in answering questions about childhood tumors and its ability to provide emotional support.

Methods: This study screened 150 questions from authoritative sources to assess ChatGPT's effectiveness in providing accurate information on childhood cancer. A double-blind evaluation and a four-level scoring system by pediatric oncologists were implemented. We also evaluated ChatGPT's ability to provide emotional support by tailoring ten questions to the users' specific needs.

Result: ChatGPT demonstrated high precision, accurately answering 132 (88%) of 150 questions across various domains: basic knowledge (28%), diagnosis (26.7%), treatment (32%), and prevention (13.3%). It provided 13 (8.7%) correct but incomplete responses and 5 (3.3%) partially correct responses, with no completely incorrect answers. Reproducibility was high at 98%. When evaluated on ten questions about humanistic care and emotional support for children with cancer, ChatGPT received a "B" grade in empathy and an "A" in effective communication. For emotional support, it scored "B" on eight occasions and "C" on two.

Conclusion: Our findings suggest that ChatGPT's accuracy and repeatability could enable it to offer virtual doctor consultations. However, its emotional support capacity needs improvement. As ChatGPT evolves, it may assume roles traditionally held by physicians. Further research is necessary to assess the risks and efficacy of ChatGPT in pediatric oncology and other medical fields to enhance patient outcomes.

Biological sciences/Cancer

Health sciences/Health care

Health sciences/Medical research

Health sciences/Oncology

ChatGPT Medical Applications

Childhood Cancer

Digital Health Communication

Artificial Intelligence in Health Literacy

Recent advancements in childhood cancer management have significantly improved survival rates, yet the incidence of childhood cancer remains on an upward trajectory, increasing at an annual rate of 1.1%. This includes a rise in leukemia cases by 0.6% and soft-tissue sarcomas by 1.8%, presenting a continuous challenge in pediatric oncology^1,2. As a consequence, the digital quest for medical information has intensified, with patients and their families increasingly relying on online platforms. Frey et al.'s research underscores the emergence of social media as a pivotal resource for parents confronting pediatric diseases³, indicating a paradigm shift towards digital information sources. This shift, however, is marred by the prevalent dissemination of low-quality and misleading information, particularly in oncology and pediatric surgery, across various social networks including TikTok and YouTube^4,5.

In this digital information era, OpenAI's ChatGPT, with its substantial user base exceeding 175 million, stands out as a promising tool capable of offering precise and comprehensive responses to medical inquiries, including those related to childhood cancer^6,7. Despite its potential, the accuracy and reliability of ChatGPT's childhood cancer information have not been rigorously evaluated against established standards. The gap in the literature concerning the assessment of information quality, understandability, actionability, and readability provided by AI chatbots signifies a critical area for research.

This study assesses ChatGPT's capacity to assume roles traditionally reserved for healthcare professionals, with a critical analysis of its effectiveness in addressing childhood cancer inquiries. Emphasizing accuracy and reliability, these inquiries cover both the concerns of affected children and their families and the professional dimensions of oncology. Moreover, the study evaluates ChatGPT's ability to offer emotional support to families dealing with childhood cancer, leading to a crucial inquiry: Can ChatGPT effectively replace healthcare professionals in providing accurate and dependable information in childhood cancer?

2.1 Large language model

ChatGPT, developed by OpenAI in San Francisco, CA, is a subscription-based AI chatbot that was launched in November 2022. Originating from GPT-3, or Generative Pretrained Transformer models, it is designed to generate targeted, well-crafted, conversational, and easily understandable responses based on user prompts through interactive conversations⁸. To enhance ChatGPT's capabilities, the initial models were augmented with a combination of supervised and reinforcement learning methods. Notably, a part of ChatGPT's training involved Reinforcement Learning from Human Feedback (RLHF), whereby a reward model is developed based on human feedback on a dataset comprising model responses ranked by quality^9,10. This method enables the fine-tuning of the model using Proximal Policy Optimization (PPO). ChatGPT's training on general rather than domain-specific data adds to its versatility across various fields, although it can be further specialized. ¹¹The launch of an updated version utilizing GPT-4 in April 2023 marks a significant advancement, showcasing human-like responses and signs of general intelligence.

2.2 Data Source

Our investigation assembled an exhaustive compilation of 986 questions related to childhood cancer, sourced from authoritative websites and prestigious institutions. These included the Children's Oncology Group (COG), the American Cancer Society, the American Society of Pediatric Hematology/Oncology, the International Society of Pediatric Oncology (SIOP), the Children's Cancer Research Fund, and the MedlinePlus Medical Encyclopedia. Furthermore, we extracted questions from the websites of esteemed hospitals, including the Mayo Clinic, Cincinnati Children's Hospital, and Boston Children's Hospital. To broaden the diversity of our collection, we also included questions from major Chinese medical online consultation platforms like Ding Xiang Doctor, Chun Yu Doctor, You Lai Doctor, and Wei Yi, which are of concern to parents of children undergoing cancer treatment. Committed to inclusivity and thoroughness, we also integrated questions pertinent to pediatric oncology from the National qualification examination for medical practitioners, accredited by the Chinese Medical Doctor Association. We extracted 986 questions that cover various aspects of childhood cancer, including prevention, diagnosis, treatment, rehabilitation, psychology and many other aspects, serving as samples for this study. This curated selection encompasses a wide range of childhood cancer, such as leukemia, brain and nervous system tumors, neuroblastoma, and nephroblastoma

Throughout this selection process, we rigorously applied established criteria to exclude questions. These criteria were as follows: 1) For questions of similar or identical meaning, we retained only one for analysis; 2) Questions that are subjective or vary by individual, such as "What are the chances of my child's cancer coming back?"; 3) Questions with ambiguous meanings, for example, "How does cancer affect a child's body?"; 4) Non-medical questions related to the illness, such as "What are some childhood cancer support groups?"; 5) If responses are overly repetitive, we typically retain only one question, for example, 'What is the gold standard for diagnosing childhood cancer?' or 'Which tests are typically required for a definitive diagnosis of childhood tumors?', where the common answer is a pathological biopsy. Given the prevalence of such questions and the uniformity of responses, including too many could lead to an overestimation of the model’s accuracy; 6) Questions not applicable to pediatric cancer patients, such as "What are the ethical considerations in pediatric cancer research?" Following the screening process, our study ultimately included a total of 150 questions. Figure 1 presents a flowchart detailing the selection of questions pertaining to childhood cancer.

The questions were systematically categorized into distinct groups based on their subjects: Basic Knowledge, Diagnosis, Treatment, Prevention, and Humanistic Care and Emotional Support. Grammatical adjustments were made to specific questions to enhance their clarity and precision.

This study was not classified as Human Subjects Research and was therefore exempt from Institutional Review Board approval.

2.3 Response Generation

In this study, we employed ChatGPT Plus, a subscription-based model utilizing the GPT-4 architecture, as of its March 14, 2023, release. Each question was input independently through the "New Chat" function. To assess the reproducibility of ChatGPT’s responses,^8,12consistent with prior research, each question was submitted twice. We aimed to accurately reflect the context of online consultations for children with cancer and their families by avoiding the introduction of additional prompts and entering questions in their original form. To reduce contextual biases, each question was crafted as a distinct, standalone prompt using the same function. Adhering to principles of blind methodology, the responsibility for collecting and inputting questions was shared among several researchers, with two designated specifically for entering questions into ChatGPT-4. All inquiries were conducted in English.

2.4 Grading system

Two pediatric oncologists from Children's National Regional Medical Center, each with extensive experience in academic practice, independently evaluated the responses generated by ChatGPT. The assessment was based on a four-tier grading system: 1. Correct and comprehensive, requiring no additional information as per the pediatric oncologist; 2. Correct but incomplete, needing supplementary information; 3. Partially correct and partially incorrect; 4. Completely incorrect. Each response was independently graded by assessors. Reproducibility was evaluated based on the consistency between two responses to the same question. If the responses were similar, only the first response from ChatGPT was graded. Divergent responses received independent grades from the reviewers. A lack of reproducibility was noted if the grades differed. If two answers to the same question varied, the third reviewer, a senior pediatric oncologist who is a nationally recognized clinical key specialist leader with over thirty years of experience, would choose the lower rating as the final evaluation. For instance, if answers are rated as 2 (Correct but incomplete) and 3 (Partially correct and partially incorrect), the final evaluation would be 3. This method is adopted because a poor response is more likely to mislead patients and their families, holding greater significance for our research. Discrepancies in accuracy and reproducibility were addressed through a blinded review conducted by this senior pediatric oncologist. The reviewers were blind to the origin of the responses, ensuring a double-blind assessment. Additionally, the sequence of responses was randomized to reduce bias.

2.5 Humanistic Care and Emotional Support

In our study, we analyzed inquiries from pediatric cancer patients and their families on online consultation platforms. We observed that these interactions not only sought information about cancer care but also emotional support. To evaluate ChatGPT's ability to provide humanistic care and emotional support comparable to that of a skilled physician, we tailored ten questions based on the specific needs of these users. We then assessed ChatGPT across three critical dimensions: demonstration of empathy, effectiveness of communication, and provision of emotional support, all according to predefined criteria (see Table 1).

Table 1

Assessment criteria for Humanistic Care and Emotional Support.
Criteria	Level A	Level B	Level C
Demonstration of empathy^a	Fully embody	Have but not enough	Completely absent
Effective communication^b	Fully embody	Have but not enough	Completely absent
Emotional support^c	Fully embody	Have but not enough	Completely absent

a: Demonstrates an understanding of the patient's or family's emotions and situation when explaining the questions.

b: Explained in precise, clear and plain language without too many technical terms.

c: Use supportive or encouraging language when answering questions, such as "You'll be fine" and "it will get better tomorrow."

2.5 Evaluation of quality

The DISCERN scale, a three-component tool previously employed to assess the reliability and quality of online health information, is structured into three sections^13,14. The first section, comprising eight questions, evaluates the reliability of a publication. The second section, containing seven questions, assesses the quality of information regarding treatment options. The third section addresses the overall quality of the publication as a source of treatment-related information. In this study, the DISCERN scale was modified to evaluate the reliability of ChatGPT responses because not all questions were pertinent to treatment. This modification led to the creation of the modified DISCERN (mDISCERN) scale, which incorporates only the first section (see Supplementary Table 1). Scoring within the mDISCERN scale assigns 1 point for 'no' responses, 2–4 points for partial answers, and 5 points for 'yes' answers. Scores under 40% (8–15 points) were classified as poor, 40–79% (16–31 points) as fair, and above 80% (32–40 points) as good^13,15. For questions that elicit two responses, we evaluate each response independently and compute the average score of the two responses.

2.6 Statistical Analysis

The distribution of grades among responses was quantified and reported in percentages. To assess the reproducibility of ChatGPT's responses, we analyzed the similarity between pairs of responses generated by the model for identical questions. Reproducibility was assessed based on the consistency of grading categories between pairs of responses. In cases where responses to the same question differed, each was graded independently. We classified responses into two categories for analysis: grades 1 and 2 versus grades 3 and 4. Divergence into different categories indicated significant differences. All statistical analyses were conducted using Microsoft Excel (version 16.69.1).

3.1 Quality

The mean mDISCERN score, derived from a consensus on 150 questions, was 34.03 ± 1.21. The distribution of ChatGPT-4 response scores, as assessed by the mDISCERN scale and categorized by quality, is depicted in Table 2. Most responses exhibited good (97.33%) and moderate (2.67%) reliability, with no responses rated as poor.

Table 2

Score distribution of responses according to the mDISCERN scale.
mDISCERN criteria	n = 19
mDISCERN score (mean ± SD)	34.03 ± 1.21
Poor (< 40% score < 16)	0 (0%)
Moderate (40–79% score 16–31)	4 (2.67%)
Good (≥ 80% score 32–40)	146 (97.33%)

3.2 Frequently asked questions about childhood cancer

ChatGPT exhibited a high level of precision, accurately and comprehensively answering 132 (88%) of 150 questions across various domains: basic knowledge (42 questions, 28%), diagnosis (40 questions, 26.7%), treatment (48 questions, 32%), and prevention (20 questions, 13.3%). It also provided 13 (8.7%) correct but incomplete responses and 5 (3.3%) partially correct responses, with no instances of completely incorrect answers (Fig. 2). In terms of reproducibility, 147 (98%) of the 150 questions were reliably replicated (see Supplementary Table 2). Specifically, reproducibility rates were 100% for questions regarding diagnosis and prevention, but slightly lower for basic knowledge and treatment, at 95.2% and 97.9% respectively (Fig. 3).

3.3 questions regarding humanistic care and emotional support

When ChatGPT was evaluated on ten questions regarding humanistic care and emotional support for children with cancer, it consistently received a "B" grade across all questions in the empathy dimension. This indicates that ChatGPT can demonstrate empathy, albeit insufficiently. In the realm of effective communication, ChatGPT consistently avoids overly technical terminology, employs clear and straightforward language, and invariably achieves an "A" grade. Within the emotional support dimension, it scored "B" on eight occasions and "C" on two, reflecting some level of emotional support in most responses but a complete lack in two instances (see Supplementary Table 3).

During the prolonged treatment of pediatric oncology patients, their families and caregivers often seek comprehensive knowledge about the condition to actively manage health. Prior to the advent of artificial intelligence, platforms such as TikTok, YouTube, and Google were the primary sources of information beyond medical professionals. However, these platforms are often flooded with low-quality and erroneous information³, which can significantly disrupt patient care and even lead physicians to question clinical decisions based on incorrect online information.

ChatGPT, a large language model trained on extensive datasets, can provide reliable health information to childhood cancer patients and their caregivers. One of its key advantages is the ability to filter vast amounts of information and present responses in a conversational, easily understandable format⁸. Unlike search engines that often return overwhelming and misleading results, ChatGPT can offer comprehensible and potentially more reliable sources of information. Its conversational nature makes the responses more accessible than professional guidelines or primary literature, positioning it as a valuable resource for medical health responses online.

Our study indicates that the majority of ChatGPT's responses are accurate, with 88% rated as "Correct and Comprehensive," and remarkably, none were rated as "Completely Incorrect." Additionally, the model demonstrated high reproducibility, with 98% of questions generating reliable repeats, showcasing its robust knowledge base in pediatric oncology. This capability makes it a dependable, comprehensive, and accurate resource for patients and caregivers.

In terms of humanistic care and emotional support, our research found that ChatGPT could simulate a certain level of empathy and emotional support, providing actionable advice. While this capability still has room for improvement, it could offer significant help to patients or families distressed by illness.

To our knowledge, this is the first study to explore the application of ChatGPT in the field of childhood cancer. The questions addressed in our research were sourced from reputable institutions, professional society websites, and large online medical consultation platforms. Independent evaluators performed assessments of accuracy and reproducibility. However, our study identified several inherent limitations of the ChatGPT model. Firstly, the database used for training ChatGPT only extends to a specific cut-off date, while internet resources are continuously updated and revised. This means that some information may already be outdated, leading to inaccurate or incomplete answers. Secondly, guidelines and standards issued by various professional associations differ globally; ChatGPT is unable to distinguish which guidelines and standards to follow and cannot generate responses based on local or national guidelines. This can lead to potentially misleading guidance. Thirdly, due to the model's extensive data sources and its inability to critically evaluate information, it tends to avoid providing precise numerical data to reduce the risk of errors. Moreover, this limitation also fosters a predisposition towards unreliable sources. Despite ChatGPT's ability to generate clear and seemingly convincing responses to clinical questions, there are concerns regarding the reliability of its data sources¹⁶.

While our study reveals important findings, it also encompasses several limitations. First, the selection of questions inevitably influences the responses obtained from the OpenAI chatbot. However, the diversity and broad sourcing of these questions, along with their relatively large number, minimize potential biases. Second, among the various methods available to evaluate model performance^8,17–19, it remains unclear which is most effective in assessing the accuracy of ChatGPT's responses. Consequently, we employed a grading method commonly used in other studies^8,19. Lastly, responses were graded by two doctors, with a third resolving any discrepancies. Given that different doctors might have divergent opinions, to mitigate this risk, the grading doctors based their evaluations on official answers from the source websites and relevant clinical guidelines, supplemented by their personal experience. Furthermore, all selected doctors were from a national children's regional medical center, possessing substantial clinical expertise.

Our study represents the inaugural examination of ChatGPT's accuracy, comprehensiveness, and repeatability in addressing queries from childhood cancer patients and their families. We anticipate that ChatGPT's high accuracy and repeatability will enable it to serve as a virtual doctor, offering online consultation services to a broader audience in the future. However, the model's current capacity for emotional support and empathy requires significant enhancement. As ChatGPT evolves, it may potentially assume roles traditionally held by physicians. Consequently, further research is imperative to assess the risks and efficacy of ChatGPT's application in pediatric oncology and other medical fields to enhance patient outcomes.

Author Contribution

Kongkong Cui substantially contributed to the manuscript design. The selection and exclusion of questions were planned, reviewed, and approved by Cui and Gao. Peng Hong presented the collected questions to ChatGPT. Tian and Shi independently evaluated the accuracy and consistency of each response. Lin, Hu, and Wang collected and analyzed the data. Qinlin Shi played a crucial role in the research design, data collection, and analysis. Therefore, Qinlin Shi is designated as the corresponding author of this paper. All authors participated in drafting the manuscript. Feng Liu and Guanghui Wei critically revised the manuscript. All authors have read and approved the final version.

Data Availability

The datasets generated and analyzed during the current study are publicly available. All data from this study can be found in the supplementary materials. For more specific data or additional information, please contact the corresponding author.

Kaatsch P. Epidemiology of childhood cancer. Cancer Treat Rev. 2010;36(4):277–285.
Steliarova-Foucher E, Fidler MM, Colombet M, et al. Changing geographical patterns and trends in cancer incidence in children and adolescents in Europe, 1991–2010 (Automated Childhood Cancer Information System): a population-based study. The Lancet Oncology. 2018;19(9):1159–1169.
Frey E, Bonfiglioli C, Brunner M, Frawley J. Parents' Use of Social Media as a Health Information Source for Their Children: A Scoping Review. Acad Pediatr. 2022;22(4):526–539.
Sütcüogolu O, Özay ZI, Özet A, Yazici O, Özdemeir N. Evaluation of scientific reliability and quality of YouTube videos on cancer and nutrition. Nutrition. 2023;108.
Bai GC, Fu K, Fu W, Liu GC. Quality of Internet Videos Related to Pediatric Urology in Mainland China: A Cross-Sectional Study. Frontiers in Public Health. 2022;10.
Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst. 2023;47(1):33.
Shen YQ, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023;307(2).
Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clinical and molecular hepatology. 2023;29(3):721–732.
Shuvo SS, Symum H, Ahmed MR, Yilmaz Y, Zayas-Castro JL. Multi-Objective Reinforcement Learning Based Healthcare Expansion Planning Considering Pandemic Events. IEEE J Biomed Health Inform. 2023;27(6):2760–2770.
Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery. 2024;175(4):936–942.
Zohny H, Porsdam Mann S, Earp BD, McMillan J. Generative AI and medical ethics: the state of play. J Med Ethics. 2024;50(2):75–76.
Kuşcu O, Pamuk AE, Sütay Süslü N, Hosal S. Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front Oncol. 2023;13:1256459.
Kumar VS, Subramani S, Veerapan S, Khan SA. Evaluation of online health information on clubfoot using the DISCERN tool. J Pediatr Orthop B. 2014;23(2):135–138.
Ozduran E, Buyukcoban S. Evaluating the readability, quality and reliability of online patient education materials on post-covid pain. PeerJ. 2022;10:e13686.
Onder CE, Koc G, Gokbulut P, Taskaldiran I, Kuskonmaz SM. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep. 2024;14(1):243.
Branum C, Schiavenato M. Can ChatGPT Accurately Answer a PICOT Question? Assessing AI Response to a Clinical Question. Nurse educator. 2023;48(5):231–233.
Johnson D, Goodman R, Patrinely J, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Research square. 2023.
Cao JJ, Kwon DH, Ghaziani TT, et al. Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR American journal of roentgenology. 2023;221(4):556–559.
Samaan JS, Yeo YH, Rajeev N, et al. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obesity surgery. 2023;33(6):1790–1796.

No competing interests reported.

supplementmaterial.docx.docx

Download PDF

Editor invited by journal
28 Jun, 2024
Submission checks completed at journal
25 Jun, 2024
First submitted to journal
23 Jun, 2024

You are reading this latest preprint version

Exploring the Potential of Large Language Models: Can ChatGPT effectively assume the role of medical professionals by providing accurate and reliable responses in childhood cancer?

Status:

Version 1

Abstract

Figures

1. Introduction

2. Material and methods

2.1 Large language model

2.2 Data Source

2.3 Response Generation

2.4 Grading system

2.5 Humanistic Care and Emotional Support

2.5 Evaluation of quality

2.6 Statistical Analysis

3. Results

3.1 Quality

3.2 Frequently asked questions about childhood cancer

3.3 questions regarding humanistic care and emotional support

4. Discussion

5. Conclusion

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1