Artificial Intelligence-based chatbots in providing space maintainer related information for pediatric patients and parents: A comparative study

doi:10.21203/rs.3.rs-4917284/v1

Background

Artificial Intelligence-based chatbots have phenomenal popularity in various areas including spreading medical information. To assess the features of two different chatbots on providing space maintainer related information for pediatric patients and parents.

Methods

12 space maintainer-related questions were formed in accordance with the current guidelines and were directed to ChatGPT-3.5 and ChatGPT-4. The answers were assessed regarding the criteria of quality, reliability, readability, and similarity with the previous papers by recruiting the tools EQIP, DISCERN, FRES, FKRGL calculation, GQS, and the Similarity Index.

Results

ChatGPT-3.5 and 4 revealed that both tools have similar mean values regarding the assessed parameters. ChatGPT-3.5 revealed an outstanding quality and ChatGPT-4 revealed a good quality with mean values of 4.58 ± 0.515 and 4.33 ± 0.492, respectively. The tools also performed high reliability with mean values of 3.33 ± 0.492 and 3.58 ± 0.515 (ChatGPT-3.5, ChatGPT-4; respectively). The readability scores seemed to require an education of a college degree and the similarity levels were lesser than 10% for both chatbots whit a high originality.

Conclusions

The outcome of this study shows that recruiting AI-based chatbots, ChatGPT for receiving space maintainer-related information can be a useful attempt for those who are seeking medical information regarding pediatric space maintainers on the internet.

ChatGPT

Natural Language Processing

Pediatric Dentistry

Orthodontics

patient information

Effective space management is crucial in cases where deciduous teeth are lost prematurely to avoid complications such as misalignment, impaction, ectopic eruption, or overcrowding of the emerging permanent dentition. Pediatric dentists are responsible for monitoring the development of dentition and preventing the potential need for further treatments including complex orthodontic treatments. Various space maintainers, each built for unique conditions and with distinct indications can be utilized in the maintaining of the extracted tooth’s space, and parents may have concerns about the procedures of space maintainer applications. Since it is not always possible for the parents or the patients to contact their dentists and to be informed regarding ongoing dental therapies, they can simply lead their curiosity to online sources where it is possible to reach various information including dental care-related issues [1, 2].

In recent days, the prevalence of individuals utilising the Internet for health-related subjects has been steadily rising and over 80% of Internet users were reported to have been seeking medical information through online platforms. Accordingly, online platforms have a proven effect on enhancing the knowledge level of individuals regarding healthcare-related issues including dental care and oral health. Furthermore, imparting knowledge to patients through the Internet may offer additional potential benefits such as promoting the adoption of healthy behaviors, improving constancy to suggestions, and facilitating the utilisation of preventative agents [1, 3, 4].

To date, social media platforms have been seen as popular areas for individuals who searching for medical information [2, 5]. However, with the advancements in artificial intelligence (AI), seeking information by utilizing AI-based programs has come to the fore and AI-based language models have also gained a recent phenomenal popularity. These models are proving to be incredibly useful for gathering information from various sources on the Internet. Natural Language Processing (NLP) is a fascinating field within computer science that revolves around understanding and generating human language. It utilises a set of techniques and algorithms designed to understand and respond to natural language. These algorithms encompass a range of tasks, such as vocabulary construction, text classification, automatic summarization, speech recognition and phrase parsing. Various large language models (LLM) exist, and these models have been extensively trained on vast datasets to be used for a wide range of NLP tasks [6].

Being one of these NLP tasks, ChatGPT is a pre-trained language model that utilises, an artificial neural network that captures the connections between words to comprehend word sequences and produce a novel text by using patterns acquired from huge language datasets. ChatGPT's ability to acquire knowledge from extensive language datasets and produce text of exceptional quality makes it an invaluable resource for a wide range of natural language processing applications [6, 7]. This model can produce accurate and timely responses to user queries and can also condense long messages utilising grammatically understandable characteristics that are accessible to non-experts. In recent times, the development of multimodal algorithms has made it easier to understand the meanings conveyed by images and videos [7]. Nevertheless, it is prudent to exercise caution when employing language models such as GPT for medical information. Given the constraints of the precision and timeliness of information, levels of knowledge, suggestions, and ethical considerations, GPT and other NLP models can't replace consultations with professionals when making healthcare decisions or evaluating options on treatment modalities. These models suggested only to be used as a complement to professional medical advice. It is essential to have a certified medical specialist verify and appraise the information gained through these models [8–10].

The value of internet-sourced information lies in its constant accessibility and its capacity to offer many perspectives. Nevertheless, the degree to which patients can understand information may differ based on factors such as the substance, excellence, and structure of the presentation. In addition, posts created by individuals may present lack of competence in the topic, relying solely on their personal experiences and tend to be subjective. This can hinder the accessibility of reliable information. However, it is crucial to obtain the accurate and reliable responses especially in health-related areas. Based on the aforementioned data, the objective of the present study is determined to assess the quality, reliability, readability, and similarity of the space maintainer-related information provided by ChatGPT-3.5 and ChatGPT-4 for pediatric patients and parents.

Determining Space Maintainer-related Questions

Figure 1 displays the flowchart of the study. Since no materials derived from humans or animals were used, it was not needed to obtain ethical approval. This study utilized ChatGPT-3.5 and ChatGPT-4 (OpenAI, 2021) to gather information in the area of pediatric space maintainers. It was aimed to assess the information provided by these models to patients and parents through targeted inquiries and realistic discourse, utilizing the model's ability to respond to specific questions and participate in natural discussions.

The space maintainer-related questions were determined in accordance with the latest guidelines on maintaining the developing occlusion provided by the European Academy of Pediatric Dentistry [11]. Before the questions were formed, to get previous information, the term ‘’space maintainer’’ was searched by Google search engine (1998, USA), and the information that parents tend to seek was also determined by this attempt. In follows, the guidelines were reviewed in the aspect of the questions that parents mostly tend to ask, and the questions were formed by two pediatric dentists with clinical experience of at least 10 years. The questions were shared with another dentist before the queries were written in the chatbots and the required revisions were also performed in accordance with the consensus of these three dentists. The questions, consisting of two sections one of them is ‘theoretical’ and the other is the part of ‘clinical’ related information, were directed to ChatGPT-3.5 and ChatGPT-4. A new dialogue page was opened for each question asked in the chatbot tools (Fig. 2).

Assessment Criteria

The examination of the software's responses to the questions involved the use of the Ensuring Quality Information for Patients (EQIP) tool, the Reliability Scoring System (adapted from DISCERN), the Flesh Reading Ease calculation (FRES), the Flesch-Kincaid Reading Grade Level (FKRGL) calculation, the Global Quality Scale (GQS), and the Similarity Index with the feature of plagiarism detection. All the data collected for the study were between 01.04.2024 and 07.04.2024 from ChatGPT-3.5 and ChatGPT-4.

EQIP [12], created by Moult et al. was employed to assess the accuracy of the software's responses. This tool, which can be applied to any written material in the healthcare area, comprises a total of 20 questions part of them (initial 14) assesses the overall characteristics of the text, whereas the subsequent questions focus on the specifics of the material presented regarding an illness, test, procedure, condition, or drug. Each question has four potential answers: "yes", "partially", "no", and "not applicable." The calculating is expressed as a percentage and ranges from 0 to 100, determined by applying the methodology shown below:

Furthermore, DISCERN Reliability Scoring System was employed to evaluate the dependability of the responses generated by the tools. The Reliability Scoring System evaluates the content of healthcare-related sources by considering factors such as readability, source authority, and the quality of information. The evaluation of each feature is conducted using a scale ranging from 1 to 5, where 5 represents the greatest achievable score (Fig. 3) [13].

GQS was also utilised as an assessment criterion in the study. The GQS was initially created to measure the instructional value of video contents. However, it can be employed to evaluate the quality of various formats. The assessment considers the content's quality, degree of information, coherence, and profits for patients. The scoring system ranges from 1 to 5, with 5 representing the highest score. In the current study, this system was adopted to text format and the textural responses were evaluated. Consequently, a score of 1 indicates poor quality, insufficient information, and text content that is not beneficial for individuals, while a score of 5 indicates outstanding quality and coherence with highly valuable text content (Fig. 3) [14].

To assess the readability of the responses, FRES and FKRGL tests were employed [15, 16] These tests are utilized to ascertain the readability and comprehensibility degree of an English text. Both tests essentially rely on computing the mean number of words per sentence and the mean number of syllables per word. The FRES test score is directly proportional to the readability of the text. As the score grows, the material gets easier to read, and as the score lowers, it becomes more difficult. The comprehensibility of the text is determined based on educational levels using a scale of 0-100. A text with a score between 90–100 indicates that it is comprehensible to someone at a 5th-grade level, while a score between 0–10 suggests that the content is difficult to understand and is better suited for persons who have completed university education. The FKRGL test, similar to FRES, assesses its scoring according to the level of education. Both assessments are predicated on the assumption that larger phrases and words are inherently more intricate and necessitate greater cognitive exertion to comprehend than shorter ones, hence rendering them more arduous to read.

The Similarity Index was utilised to ascertain the quantitative resemblance between the responses generated by the software and the present written texts in various databases. The objective was to identify the potential rate of plagiarism and assess the degree of originality in the responses. To achieve this objective, all responses were inputted into the plagiarism detection software (iThenticate, http://www.ithenticate.com) and the similarity rates were detected in percentages. The similarity rates were classified into four distinct categories: 0–10%, 10–20%, 20–40%, and 40–100%. Texts that have a similarity rate of less than 10% are categorised as highly original, texts with a similarity rate of 10–20% are considered to have an acceptable level of resemblance, and texts with a similarity rate of 20–40% are categorized to resembling a very high level of similarity.

Two researchers, each with a minimum of 10 years of expertise, assessed the responses generated by the software for 12 questions using the aforementioned evaluation criteria and tests.

Statistical Analysis

Descriptive statistics were computed using the data collected from the evaluation criteria. The normality assumption of the data was assessed by Shapiro-Wilk test and examining the skewness-kurtosis coefficients. Upon inspection, it was noted that all data displayed a normal distribution. The One-way ANOVA test was utilised to identify any statistically significant disparity among groups. The Pearson correlation coefficient was used to evaluate the pairwise correlations between group data. A p-value below 0.05 was deemed statistically significant. The statistical analyses were conducted using the Jamovi program (The Jamovi Project, 2022, version 2.3; Sydney, Australia).

Table 1 shows the descriptive statistics and post hoc comparisons of GPT-3.5 and GPT-4. Accordingly, the mean values of FRES in the group of GPT 3.5 (41.4±7.65) were statistically lower than the mean values of GPT-4 (49.8±8.69). No statistically significant difference was detected between GPT-3.5 and GPT-4 for the other parameters compared. ChatGPT-3.5 revealed an outstanding quality and ChatGPT-4 revealed a good quality with median values of 5 (mean, 4.58±0.515) and 4 (mean, 4.33±0.492), respectively. The tools also performed high reliability with median values of 3 (mean, 3.33±0.492) and 4 (mean, 3.58±0.515). The readability scores of both tools seemed to require an education of a college degree and the similarity levels were lesser than 10% for both chatbots which indicates a high originality.

Table 2 shows the mean values of the compared evaluation criteria regarding the sections of the questions (Theoretical-Clinical) in each group. Accordingly, no statistically significant difference was detected regarding the sections of the questions both in the group of GPT-3.5 and GPT-4 (p>0.05). However, when the clinical and theoretical sections questions were compared between the groups, the mean FRES values of GPT-4 revealed statistically significantly higher results than the mean FRES values of GPT-4 (p=0.045) (Table 3).

Table 4 shows the correlation between the evaluating criteria in GPT-3.5 and GPT-4. Accordingly, a correlation was detected between the Reliability-GQS and FRES-FKGRL criteria in GPT-3.5 and EQIP-Reliability, EQIP-GQS, Reliability-GQS, FRES-FKGRL, and FRES-GQS criteria.

The outcomes of the current study wherein the quality, reliability, readability, and originality of the space maintainer information provided by ChatGPT-3.5 and 4 revealed that both tools have similar mean values regarding the assessed parameters. ChatGPT-3.5 revealed an outstanding quality and ChatGPT-4 revealed a good quality with median values of 5 and 4. The tools also revealed high reliability.

In the developing and changing world, information is displacing serially, and individuals have the chance to get any information meeting with their curiosity with one click. People tend to seek information on the internet and social media previously, however, with the explosive increase in the NLP and LLM, the interest has verged to AI-based chatbots. ChatGPT, one of the most known of these models, has phenomenal popularity and according to our literature review, the features of the space maintainer-related information including reliability and quality seemed to be not assessed previously. Accordingly, the current study aimed to assess the features including quality and reliability of space maintainer-related information provided by the ChatGPT 3.5 and compare these outcomes with the upgraded and paid version named ChatGPT 4.

In a previous similar study, Duran et al. [6] assessed the quality, reliability, readability, and originality of ChatGPT information on cleft lip and palate. Accordingly, ChatGPT-4 was revealed to be a source with high reliability and good quality based on reliability and GQS analysis. FRES results pointed out that the readability of the text created by this tool was ‘’difficult’’. Plagiarism checks revealed acceptable levels of similarity [6]. Yurdakurban et al. [18] have also conducted a similar study wherein they assessed the quality, reliability, readability, and originality of information provided by different AI chatbots (Open Evidence, MediSearch, ChatGPT4) on the subject of orthognathic surgery. Accordingly, all the assessed chatbots performed a high reliability and good quality. The readability index-SMOG revealed that the provided data requires education with a college degree and higher. Chat GPT seemed to have the highest originality.¹⁸These outcomes have similarities with the findings of the current study with high reliability, good quality, and low similarity levels.

Buldur and Sezer [19] have previously conducted a study wherein they asked the frequently asked question on the use of fluoride in dentistry determined by ADA to Chat GPT and compare the answers with those of ADA’s. Accordingly, the outcomes revealed that the answers provided by ChatGPT were more detailed and scientific and the authors notified that ChatGPT was reliable and sufficient on the subject of Fluoride in dentistry [19] these outcomes reveal the benefits of ChatGPT also supported the results of the current study we held.

Rokhshad et al. [20] conducted a study wherein they directed 30 true-false questions to pediatric dentists, general dentists, dental students, and different chatbots GPT-3.5. Accordingly, pediatric dentists seem to generate significantly more accurate answers than other clinicians and chatbots. The success of the groups in answering questions were as follows; pediatric dentists, general dentists, chatbots, and students. Among the other chatbots, ChatGPT-3.5 generated more acceptable consistency [20].

Ahmed et al. [21] have also assessed the quality of dental caries-related multiple-choice questions generated by ChatGPT and Bard. Accordingly, they concluded that these tools could generate questions related to dental caries at the cognitive level of knowledge. Bard displayed a higher cognitive level and generated more absolute terms in the related subject [21].

In another previous similar study, Abu Arqub al. [22] assessed the accuracy of the answers provided by ChatGPT on the subject of orthodontic clear aligners. Accordingly, they reported that the answers to the queries were objectively true with a ratio of 58% and false with a ratio of 15%. They conclude that the overall accuracy of the answers was suboptimal, and the software has a limited ability to offer correct and up-to-date information regarding the searched subject. They also are warned of the false claims that were provided by the tools [22]. Haita et al. [23] have analysed the clinical scenarios of interceptive orthodontics, accordingly 21 open-ended questions comprising various clinical cases were directed to ChatGPT. Although the tool has presented a good ability to generate answers to difficult clinical cases, they proposed that ChatGPT still cannot be identified as sophisticated, and the tool is not intellectual enough to replace the mind of a human being [23]. Giannakopoulos et al. [24] examined the performance of LLM models in answering clinically relevant questions in different disciplines of dentistry. Accordingly, the outcomes revealed that all language models revealed that although ChatGPT-4 is more successful than ChatGPT-3.5, all chatbots seem to be exhibiting inaccuracies and outdated content. The answers were lack of reference sources and irrelevant information was also detected [24]. The outcomes of these studies were contradicted to the outcomes of the current study. However, this inconsistency might be related to the variations in the assessment tools recruited in the current study.

Since the use of space maintainers, the types, indications, and the clinical applications consist of limited information, it was not possible to verify the asked questions. The questions were directed to the assessed chatbots only one time and the consistency of the answers provided by the same tool in various times was not able to be assessed. The answers of different chatbots except ChatGPT were not studied within the project. These issues can be listed as the limitations of the current study. Conducting a further study eliminating these limitations can be a useful attempt before declaring a committal regarding recruiting AI-based chatbots as supervisor on heath related subjects specific to pediatric space maintainers.

The outcome of this study offers that recruiting one of the AI based chat bots, ChatGPT for receiving space maintainer related information can be a useful attempt for those who are seeking medical information on the internet. However, considering the changing and developing nature of AI based information, these tools need to be under the clear assessment of medical authors in all times and the knowledge provided by these tools should be assessed periodically.

EQIP: Ensuring Quality Information for Patients

FRES: Flesh Reading Ease sCORE

FKRGL: Flesch-Kincaid Reading Grade Level

GQS: Global Quality Score

AI: artificial intelligence

NLP: Natural Language Processing

LLP: Large Language Models

ADA: American Dental Association

Ethics approval and consent to participate

Present study is a public application, and there is no human/animal participant, ethics committee approval was not required.

Consent for publication

Not applicable.

Avaibility of data and materials

I have data to declare. Data is provided within the manuscript or supplementary information files.

Competing interests

The authors declare no competing interests.

Fundings

The authors financed the recent study.

Author contributions

All authors contributed to the understanding and design of the study. The responsibility of study design, methods, experiments were to C.B. The responsibility of study design, writing and reviewing were to M.A. The responsibility of methods, writing and reviewing were to K.G.T. The responsibility of experiments were to S.G. Authors (including authors of letters to the editor) are responsible for disclosing all financial and personal relationships that might bias their work.

Acknowledgements

The authors would like to thank Assoc. Prof. Dr. Gökhan Serhat Duran for his assistance and valuable advice on experimental procedures.

Nilüfer Ü, Ozge Y D, Mutlu O. Quality and reliability assessment of the space maintainer videos as a source of information. Dent Med J. 2020;5(1): 8-16.
Aksoy M, Topsakal K G. YouTube™ for information on paediatric oral health instructions. International Journal of Dental Hygiene 2022;20(3):496-503.
Cianetti S, Lombardo G, Lupatelli E, Rossi G, Abraha I, ElKarmi R, Hassona Y, Taimeh D, Scully C. YouTube as a source for parents’ education on early childhood caries. Int J Paediatr Dent. 2017;27:437-443.
Singh S, Banerjee A. Internet and doctor–patient relationship: Cross-sectional study of patients’ perceptions and practices. Indian J Public Health. 2019;63(3):215-219.
Susarla A, Oh JH, Tan Y. Social networks and the diffusion of user-generated content: evidence from You Tube. Inf Syst Res. 2012; 23(1): 23-41
Duran GS, Yurdakurban E, Topsakal KG. The quality of CLP-related information for patients provided by ChatGPT. The Cleft Palate Craniofacial Journal 2023; doi.org/10.1177/10556656231222387.
Khurana D, Koli A, Khatter K, Singh S. Natural language processing: State of the art, current trends and challenges. Multimed Tools Appl. 2023;82(3):3713-3744
Duran GS, Yurdakurban E, Topsakal KG. The quality of CLP-related information for patients provided by ChatGPT. The Cleft Palate Craniofacial Journal 2023; doi.org/ 10.1177/10556656231222387.
Tu R, Ma C, Zhang C. Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis 2023; arXiv Prepr arXiv:230113819.
Hulman A, Dollerup OL, Mortensen JF. et al. ChatGPT-versus human-generated answers to frequently asked questions about diabetes: a Turing test-inspired survey among employees of a Danish diabetes center. PLoS One 2023;18(8):e0290773. doi: 10.1371/journal.pone.0290773.
Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res Sq. 2023; rs.3.rs-2566942. doi: 10.21203
American Academy of Pediatric Dentistry. Management of the developing dentition and occlusion in pediatric dentistry. The Reference Manual of Pediatric Dentistry. Chicago, Ill.: American Academy of Pediatric Dentistry 2023. 466-83.
Moult B, Franck LS, Brady H. Ensuring quality information for patients: Development and preliminary validation of a new instrument to improve the quality of written health care information. Heal Expect 2004;7(2):165-175.
Charnock D, Shepperd S, Needham G, Gann R. Discern: An instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Heal 1999;53(2):105-111.
Bernard A, Langille M, Hughes S, Rose C, Leddin D, Van Zanten SV. A systematic review of patient inflammatory bowel diseaseinformation resources on the World Wide Web. Am J Gastroenterol. 2007;102(9):2070-2077
Flesch R. A new readability yardstick. J Appl Psychol. 1948;32(3):221-23
Kincaid JP, Fishburne Jr RP, Rogers RL, Chissom BS. Derivation of new readability formulas (automated readability Index, fog count and flesch Reading ease formula) for navy enlisted personnel.Naval Technical Training Command Millington TN Research Branch; 1975.
Yurdakurban E, Topsakal KG, Duran GS. A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information. Journal of Stomatology, Oral and Maxillofacial Surgery 2024; 125(5): 101757.
Buldur M, Sezer B. Can Artificial Intelligence Effectively Respond to Frequently Asked Questions About Fluoride Usage and Effects? A Qualitative Study on ChatGPT. FLUORIDE-QUARTERLY REPORTS 2023; 56(3).
Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and Consistency of Chatbots versus Clinicians for Answering Pediatric Dentistry Questions: A pilot study. Journal of Dentistry 2024; doi: 10.1016.104938.
Ahmed WM, Azhari AA, Alfaraj A, Alhamadani A, Zhang M, Lu CT. The quality of dental caries-related multiple-choice questions and answers generated by ChatGPT and Bard language models. Heliyon 2024; doi: 10.1016/j.heliyon.2024.e28198.
Abu Arqub S, Al-Moghrabi D, Allareddy V, Upadhyay M, Vaid N, Yadav S. Content analysis of AI-generated (ChatGPT) responses concerning orthodontic clear aligners. The Angle Orthodontist 2024; doi: 10.2319/071123-484.1.
Hatia A, Doldo T, Parrini S, Chisci E, Cipriani L, Montagna L, Chisci G. Accuracy and Completeness of ChatGPT-Generated Information on Interceptive Orthodontics: A Multicenter Collaborative Study. Journal of Clinical Medicine 2024; 13(3): 735.
Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: Comparative mixed methods study. Journal of medical internet research 2023; 25: e51580.

Table 1. Descriptive statistics and post hoc comparisons of GPT 3.5 and GPT-4.

Categories	GPT-3.5				GPT-4				p
Categories	Mean±SD	Median	Min-Max	25%-75%	Mean±SD	Median	Min-Max	25%-75%	p
EQIP	63.1±5.35	64.1	53.1-68.8	60.9-67.2	63.3±5.35	64.1	56.3-68.8	58.6-68.8	.859
Reliability	3.33±0.492	3.00	3.00-4.00	3.00-4.00	3.58±0.515	4.00	3.00-4.00	3.00-4.00	.229
GQS	4.58±0.515	5.00	4.00-5.00	4.00-5.00	4.33±0.492	4.00	4.00-5.00	4.00-5.00	.229
FRES	41.4±7.65	41.0	30.9-58.1	35.3-45.6	49.8±8.69	51.3	33.3-69.5	45.6-52.7	.014*
FKGRL	12.0±3.63	11.6	8.54-22.8	9.96-12.4	10.2±2.24	10.1	5.95-14.6	9.19-10.4	.133
Similarity	2.00±2.13	2.00	0.00-6.00	0.00-3.25	1.83±3.64	0.00	0.00-11.0	0.00-1.00	.302

SD: Standard deviation, Min: Minimum, Max: Maximum, 25%: 25 percentile, 75%: 75 percentile. EQIP: Ensuring Quality Information for Patients, GQS: Global Quality Score, FRES: Flesch Reading Ease Score, FKGRL: Flesch-Kincaid Grade Level. *There is a statistically significant difference at p<.05.

Table 2. Descriptive statistics and post hoc comparisons of the ‘Theoretical’ and ‘Clinical’ sections of GPT-3.5 and GPT-4.

Categories	GPT-3.5									GPT-4
	Theoretical Section				Clinical Section				p		Theoretical Section				Clinical Section				p
	Mean±SD	Median	Min-Max	25%-75%	Mean±SD	Median	Min-Max	25%-75%	p		Mean±SD	Median	Min-Max	25%-75%	Mean±SD	Median	Min-Max	25%-75%	p
EQIP	64.8±4.77	66.1	56.3-68.8	63.3-68.2	61.5±5.82	62.5	53.1-68.8	57.8-64.8	.254		64.1±5.13	65.6	56.3-68.8	60.9-68.0	62.5±5.93	62.5	56.3-68.8	57.0-68.0	.679
Reliability	3.33±0.516	3.00	3.00-4.00	3.00-3.75	3.33±0.51	3.00	3.00-4.00	3.00-3.75	1.000		3.50±0.548	4.00	3.00-4.00	3.00-4.00	3.67±0.516	4.00	3.00-4.00	3.25-4.00	.575
GQS	4.67±0.516	5.00	4.00-5.00	4.25-5.00	4.50±0.548	4.50	4.00-5.00	4.00-5.00	.575		4.33±0.516	4.00	4.00-5.00	4.00-4.75	4.33±0.516	4.00	4.00-5.00	4.00-4.75	1.000
FRES	44.4±9.21	43.8	30.9-58.1	40.6-48.7	38.4±4.73	36.7	34.2-46.1	35.0-41.0	.200		50.3±10.1	47.4	42.3-69.5	43.7-51.4	49.3±7.93	51.8	33.3-54.3	51.0-53.2	.522
FKGRL	12.6±5.13	11.1	8.54-22.8	10.1-12.1	11.4±1.35	11.8	9.63-12.8	10.4-12.5	.873		10.0±2.80	10.1	5.95-14.6	9.05-10.5	10.3±1.77	9.84	8.99-13.8	9.28-10.4	.873
Similarity	2.33±2.73	1.50	0.00-6.00	0.00-4.50	1.67±1.51	2.00	0.00-4.00	0.50-2.00	.738		3.00±4.82	0.00	0.00-11.0	0.00-5.25	0.667±1.63	0.00	0.00-4.00	0.00-0.00	.400

SD: Standard deviation, Min: Minimum, Max: Maximum, 25%: 25 percentile, 75%: 75 percentile. EQIP: Ensuring Quality Information for Patients, GQS: Global Quality Score, FRES: Flesch Reading Ease Score, FKGRL: Flesch-Kincaid Grade Level. *There is a statistically significant difference at p<.05.

Table 3. Descriptive statistics and post hoc comparisons of GPT-3.5 and GPT-4 in the ‘Theoretical’ and ‘Clinical’ sections.

Categories	Theoretical Section			Clinical Section
	GPT-3.5	GPT-4	p	GPT-3.5	GPT-4	p
	Mean±SD	Mean±SD	p	Mean±SD	Mean±SD	p
EQIP	64.8±4.77	64.1±5.13	.805	61.5±5.82	62.5±5.93	.684
Reliability	3.33±0.516	3.50±0.548	.575	3.33±0.51	3.67±0.516	.269
GQS	4.67±0.516	4.33±0.516	.269	4.50±0.548	4.33±0.516	.575
FRES	44.4±9.21	50.3±10.1	.297	38.4±4.73	49.3±7.93	.045*
FKGRL	12.6±5.13	10.0±2.80	.337	11.4±1.35	10.3±1.77	.200
Similarity	2.33±2.73	3.00±4.82	1.000	1.67±1.51	0.667±1.63	.176

SD: Standard deviation, EQIP: Ensuring Quality Information for Patients, GQS: Global Quality Score, FRES: Flesch Reading Ease Score, FKGRL: Flesch-Kincaid Grade Level. *There is a statistically significant difference at p<.05.

Table 4. Correlation between the evaluating criteria in GPT-3.5 and GPT 4.

Categories	GPT-3.5		GPT-4
Categories	r	p	r	p
EQIP-Reliability	.490	.106	.748	.005*
EQIP-FRES	.062	.849	.470	.123
EQIP-FKGRL	-.053	.870	-.550	.064
EQIP-GQS	.409	.187	.755	.005*
EQIP-Similarity	.034	.917	-.066	.839
Reliability-FRES	.177	.582	.179	.577
Reliability-FKGRL	-.159	.622	-.366	.243
Reliability-GQS	.598	.040*	.598	.040*
Reliability-Similarity	-.520	.083	-.429	.165
FRES-FKGRL	-.689	.013*	-.803	.002*
FRES-GQS	.535	.073	.617	.033*
FRES-Similarity	.317	.315	-.068	.833
FKGRL-GQS	-.456	.137	-.483	.112
FKGRL-Similarity	-.391	.208	-.058	.859
GQS-Similarity	.083	.798	-.372	.234

r: Spearmen correlation coefficient, EQIP: Ensuring Quality Information for Patients, GQS: Global Quality Score, FRES: Flesch Reading Ease Score, FKGRL: Flesch-Kincaid Grade Level. *There is a statistically significant difference at p<.05.

No competing interests reported.

Artificial Intelligence-based chatbots in providing space maintainer related information for pediatric patients and parents: A comparative study

Status:

Version 1

Abstract

Background

Methods

Results

Conclusions

Figures

INTRODUCTION

MATERIALS AND METHODS

Determining Space Maintainer-related Questions

Assessment Criteria

Statistical Analysis

RESULTS

DISCUSSION

CONCLUSIONS

Abbreviations

Declarations

References

Tables

Additional Declarations

Status:

Version 1