Superior Performance of Artificial Intelligence Models in English Compared to Arabic in Infectious Disease Queries

doi:10.21203/rs.3.rs-3830452/v1

Download PDF

Research Article

Superior Performance of Artificial Intelligence Models in English Compared to Arabic in Infectious Disease Queries

https://doi.org/10.21203/rs.3.rs-3830452/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries.

Methods

The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool.

Results

In comparing AI models' performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P = .012). The same trend was observed in Arabic, albeit without statistical significance (P = .082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models' performance in English was rated as “excellent”, significantly outperforming their “above-average” Arabic counterparts (P = .002).

Conclusions

Disparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.

AI chatbots

infectious diseases

language performance

healthcare technology

digital health queries

Arabic is a culturally diverse language spoken daily by over 400 million people [1]. Consequently, the Arabic language is considered an important medium for the delivery of health-related information to a substantial number of native speakers [2]. The pursuit for ensuring access to accurate health information in the native language is highly important for effective communication and better health outcomes [3, 4].

From a global health perspective, the "big three" infectious diseases — malaria, tuberculosis (TB), and human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) — rise as prevalent health concerns [5, 6]. Additionally, the profound impact of the coronavirus disease 2019 (COVID-19) pandemic, highlighted the need for effective health communication [7]. Furthermore, influenza continues to pose significant public and global health risks with potential for causing epidemics and pandemics; therefore, effective public health measures are needed to address influenza threats [8].

In the current digital era, lay individuals increasingly seek health information via various online platforms [9]. While these online channels — including the recent rise of artificial intelligence (AI)-based chatbots — offer convenient access to data, these digital channels also present significant challenges and concerns in relation to the reliability of the information provided [10–12]. The prevalence of misinformation or even disinformation on these platforms can pose significant risks, as lay individuals may encounter and act upon inaccurate health-related content, potentially compromising their health and wellbeing [13–15].

ChatGPT (by OpenAI, San Francisco, CA), Bing (by Microsoft Corporation, Redmond, WA), and Bard (by Google, Mountain View, CA) are AI-based conversational models that emerged as promising tools for various purposes including the ability to facilitate the acquisition of health information [16–18]. These chatbots garnered notable user attention due to their ease of use and perceived effectiveness in delivering a broad spectrum of information [19]. This includes health-related content and self-diagnosis options, marking a significant advance in digital health communication and information accessibility [20, 21].

While AI-based models like ChatGPT, Bing, and Bard are promising in disseminating health information and improving health literacy, it is crucial to recognize their limitations [16, 22]. For example, a notable issue is the occurrence of "hallucinations" where AI models generate plausible but incorrect responses [23]. This is particularly concerning in the context of health-related information, where such inaccuracies could lead to severe negative consequences [24]. Understanding and addressing the limitations of AI-based models is essential for the safe and effective use of AI in healthcare communication [16, 22, 25].

The performance of AI-based models is highly influenced by the quality of the underlying training data [26]. Therefore, variations in AI-based model performance would reasonably be anticipated across different languages and cultural contexts [27]. Consequently, a thorough assessment of AI-based model performance in a variety of languages is needed, to ensure accuracy and reliability of these models in diverse languages.

To address this critical issue, this study aimed to evaluate the performance of a group of popular AI-based models, namely ChatGPT, Bing, and Bard in English and Arabic languages. The focus of this study involved one aspect of health-related information by choosing queries on five infectious diseases (HIV/AIDS, TB, malaria, COVID-19, and influenza). By exploring the capabilities and shortcomings of these AI-based models in the context of health information dissemination in Arabic, the study aimed to highlight the need to enhance the quality of healthcare content that would be provided to the native speakers of Arabic for better health outcomes within Arab communities. Additionally, the study sought to identify potential disparities in the language performance of AI-based models, which are predominantly trained on English datasets.

Study design

This descriptive study was designed following the METRICS checklist for AI-based studies in healthcare [28]. This framework involves careful consideration of the features and settings of the AI models, a detailed evaluation methodology, and clear specifications of prompts, languages, and data sources. Additionally, the study rigorously addressed factors such as the count of queries, the individual factors in query selection, and the subjectivity inherent in evaluation of the generated content. The study design also considered the issues of randomization and range of topics tested, adhering to the principles of transparency and thoroughness.

Ethics statement

This study was approved by the Institutional Review Board (IRB) at the Faculty of Pharmacy, Applied Science Private University (Approval number: 2023-PHA-51, on 23 November 2023).

Features of the AI Models Tested

Four AI-based models were employed in this study as follows. Two versions of ChatGPT (the publicly available GPT-3.5 and the more advanced, subscription-based GPT-4), Microsoft Bing, using the more balanced conversational style, and Google Bard Experiment, both available for free. To ensure content replicability, each model was tested under its default configuration. The prompting of these AI models was carried out concurrently on a single day by the first author (M.S.), specifically on 23 December 2023, to maintain consistency and control for time-sensitive variables in their performance assessment.

Features and Count of the Queries used to Test the AI Models

In this study, 15 distinct queries were executed on each AI model. This query count was based on the calculated sample size necessary for comparing means between two groups: n = (Z_α/2+Z_β)² *2*σ² / d² considering a 90% confidence level, an 80% desired power, and an assumed difference and variance of 1 [29]. This yielded a minimum of 13 queries to elucidate possible differences between the two languages effectively. This decision was guided by the aim to effectively examine the AI-generated responses, while also accommodating the operational constraints imposed by the rate limits of the AI models.

Sources of Data to Formulate the Infectious Disease Queries

The queries purposefully examined five common infectious diseases, focusing on transmission, treatment, diagnosis, prevention, and epidemiology. For each disease, three queries were randomly selected using Excel's randomize function from a pool of 15 questions per topic to minimize selection bias. The initial pool of queries were retrieved from credible English sources and covered key questions on HIV/AIDS, malaria, TB, COVID-19, and influenza as follows [30–39]. For HIV/AIDS, the three questions were: (1) What is the extent of risk of HIV transmission through French kiss? (2) What is the extent of risk of HIV transmission through hijama? (3) Why gays have higher chance of getting HIV infection? For malaria, the three questions were (1) Is malaria a contagious disease? (2) Is it considered safe for me to breastfeed while taking an antimalarial drug? (3) How do I know if I have malaria for sure? For TB, the three questions were: (1) Who doesn’t get sick from tuberculosis? (2) How can TB be tested for? (3) Is BCG vaccination recommended for all children? For COVID-19, the three questions were: (1) Can COVID-19 be passed through breastfeeding? (2) Can COVID-19 infection affect HIV test result? (3) What is long COVID-19 condition? Finally, for influenza, the three questions were: (1) Can I get a COVID-19 vaccine and flu vaccine at the same visit? (2) Is it possible to have both COVID-19 and flu at the same time? (3) When will flu activity begin and when will it peak?

The questions were translated into Arabic by one bilingual author (M.B.) and back translated by another (M.S.), with subsequent discussions among the two authors leading to minor modifications for clarity.

Specificity of Prompts Used

The prompting approach for each AI model involved using the prompts as exact questions without any feedback. This was ensured by selecting the “New Chat” or “New Topic” options for each query. The “Regenerate Response” feature was not utilized to maintain the integrity of first responses. Additionally, each query was initiated as a new chat or topic when switching languages to prevent any carryover effects between languages. This approach was critical to ensure that the responses for the same query in different languages were independent and not influenced by previous interactions.

Evaluation of the AI Generated Content

The evaluation of the AI-generated content was conducted independently by two authors with expertise in infectious disease from clinical microbiology (M.S.) and pharmacy (M.B.) perspectives. To minimize subjectivity in the evaluation process, a consensus key response was formulated prior to assessment based on the query sources. The evaluation was based on the CLEAR tool across 5 components as follows: Completeness, Lack of false information (accuracy), Evidence-based content, Appropriateness, and Relevance [40]. Each component was assessed using a 5-point Likert scale ranging from 5 (excellent) to 1 (poor). The inter-rater agreement was assessed using the Cohen’s κ statistic where the responses were 1 was assigned for excellent responses and 0 for other responses. Inter-rater agreement showed values of .386 (fair agreement) for English and .659 (good agreement) for Arabic content.

Statistical and Data Analyses

Statistical analyses were conducted using IBM SPSS Statistics for Windows, Version 26, with a significance level set at P < .05. The average CLEAR scores across the two raters were utilized, including both component-specific and overall CLEAR scores. Based on the non-normal distribution of the scale variables assessed using the Shapiro-Wilk test, the Kruskal Wallis H (K-W) and Mann Whitney U (M-W) tests were used for mean difference testing. The overall CLEAR scores were categorized for descriptive analysis of content quality as follows: 1–1.79 as “poor”, 1.80–2.59 as “below average”, 2.60–3.39 as “average”, 3.40–4.19 as “above average”, and 4.20–5.00 as “excellent”.

Overall Performance of each AI Model in English vs. Arabic

Using the average CLEAR scores, variability was observed between the content generated in English based on the model with the best performance for Bard (mean CLEAR: 4.6 ± 0.68) followed by Bing (mean CLEAR: 4.37 ± 0.59), ChatGPT-4 (mean CLEAR: 4.36 ± 0.76), and ChatGPT-3.5 (mean CLEAR: 4.15 ± 0.68, P = .012, K-W). In Arabic, the same differences were observed; nevertheless, the differences lacked statistical significance (mean CLEAR: 4.39 ± 0.89 for Bard, 4.21 ± 0.72 for Bing, 4.13 ± 0.97 for ChatGPT-4, and 3.81 ± 0.68 for ChatGPT-3.5, P = .082, K-W).

Consistent superior performance of the four AI models tested was noted in English queries as opposed to the Arabic content (Table 1). However, statistically significant differences were observed only with ChatGPT-3.5 and Bard. Based on the descriptive assignments of the CLEAR scores, the four AI models content in English was described as “Excellent” while the performance of both ChatGPT models in Arabic was “above average”, as opposed to “excellent” performance in Arabic in Bing and Bard.

Table 1

The performance of the four AI models tested in English and Arabic stratified per average CLEAR scores.
AI ¹ model	Language	C ³	L ⁴	E ⁵	A ⁶	R ⁷	CLEAR ⁸
ChatGPT-3.5	English	4.5 ± 0.5	4.53 ± 0.81	4.37 ± 0.64	4.63 ± 0.4	4.43 ± 0.42	4.49 ± 0.49
ChatGPT-3.5	Arabic	3.8 ± 0.8	3.87 ± 1.01	3.83 ± 1.06	3.63 ± 0.93	3.9 ± 0.83	3.81 ± 0.68
P value ²		0.012	0.045	0.074	0.002	0.006	0.010
ChatGPT-4	English	4.6 ± 0.39	4.73 ± 0.56	4.6 ± 0.39	4.67 ± 0.41	4.4 ± 0.47	4.6 ± 0.37
ChatGPT-4	Arabic	4.2 ± 0.98	4.17 ± 1.26	4.03 ± 1.26	4.1 ± 0.97	4.13 ± 0.85	4.13 ± 0.97
P value		0.292	0.126	0.519	0.036	0.345	0.144
Bing	English	4.23 ± 0.65	4.67 ± 0.52	4.77 ± 0.32	4.43 ± 0.59	4.57 ± 0.32	4.53 ± 0.39
Bing	Arabic	3.9 ± 0.93	4.07 ± 1.28	4.27 ± 0.98	4.2 ± 0.77	4.63 ± 0.44	4.21 ± 0.72
P value		0.318	0.256	0.159	0.450	0.398	0.381
Bard	English	4.73 ± 0.42	4.97 ± 0.13	4.87 ± 0.4	4.93 ± 0.26	4.57 ± 0.37	4.81 ± 0.25
Bard	Arabic	4.53 ± 0.9	4.13 ± 1.23	4.33 ± 1.03	4.67 ± 0.9	4.3 ± 0.8	4.39 ± 0.89
P value		0.639	0.011	0.082	0.169	0.322	0.049
¹AI: Artificial intelligence; ²P value: Calculated using the Kruskal Wallis test; ³C: Completeness; ⁴L: Lack of false information; ⁵E: Evidence-based; ⁶A: Appropriateness; ⁷R: Relevance; ⁸CLEAR: The average CLEAR scores based on the scoring of two independent raters. The significant P values are highlighted in bold style.

Please insert Table 1 here.

Performance of each AI Model Stratified per CLEAR Components

In stratified analysis of AI model performance across the five CLEAR components, English content consistently scored higher in 19 out of 20 comparisons (95%). The exception was Bing's superior relevance score in Arabic compared to English. Statistically significant differences were observed with ChatGPT-3.5 and Bard. Specifically, ChatGPT-3.5 exhibited superior performance in completeness and relevance in English as opposed to Arabic content, while both ChatGPT-3.5 and Bard showed higher accuracy (lack of false information) in English. Additionally, ChatGPT-3.5 and ChatGPT-4 content in English outperformed the Arabic content in appropriateness (Table 1).

Performance of each AI Model Stratified per Infectious Disease Topic

Out of the 20 comparisons across the 2 languages for the four AI models, higher average CLEAR scores were observed across all infectious disease topics in English content, with the exception of better performance in Arabic for the influenza queries in Bing and Bard (Fig. 1).

Please insert Fig. 1 here.

In English, Bard topped the performance in HIV/AIDS, malaria, TB, and COVID-19 while ChatGPT-3.5 topped the performance in influenza. The lowest level of performance for HIV/AIDS and COVID-19 was seen in ChatGPT-3.5 content and for malaria and TB, the lowest performance in English was seen with Bing content, while the lowest for influenza was in Bard (Fig. 2A).

Please insert Fig. 2 here.

In Arabic, Bard also topped the performance in four topics (TB, COVID-19, influenza, and malaria together with ChatGPT-4), while the best performance for HIV/AIDS was observed for Bing. The lowest level of performance per topic in Arabic was seen for ChatGPT-3.5 in HIV/AIDS, malaria, and COVID-19, and the lowest for TB was the Arabic content of Bing and the lowest for influenza was content generated by ChatGPT-4 (Fig. 2B).

Descriptive Labeling of the Performance of each AI Model in English vs. Arabic

Compiled together as shown in (Fig. 3), the overall performance of the four models in English was “excellent” with a mean CLEAR score of 4.6 ± 0.4 while in Arabic it was “above average” with a mean CLEAR score of 4.1 ± 0.82 (P = .002, M-W).

Please insert Fig. 3 here.

In this study, we investigated one crucial aspect of AI model utility in acquisition of health information. This involves testing the hypothesis of existent language disparity in AI model performance. Specifically, the study pursuit was in the context of infectious diseases which represent a significant global health burden. Such a quest appears timely and relevant as AI models are increasingly accessed by lay individuals for health information.

The key finding in this study was the overall lower performance of the tested AI models in Arabic compared to English. In this study, the overall Arabic performance of AI models in the context of infectious disease queries could be labeled as “above average” as opposed to “excellent” performance in English. Additionally, the differences in performance across the two languages showed statistical significance in ChatGPT-3.5 and Bard. Another important observation was the uniformly excellent performance of the four AI models in English. This consistency highlight the effectiveness of these models in English language in the context of infectious disease queries. Additionally, a consistent pattern where the four AI models exhibited superior performance in English extended across all the five tested infectious disease topics. However, a notable variability in performance in Arabic was evident, particularly in handling topics related to HIV/AIDS, TB, and COVID-19.

The disparity in AI model performance across languages may be attributed to varying qualities of the AI training datasets [41]. Prior research which sought to characterize such disparity in AI model performance across languages remains limited despite its timeliness and significance [42]. This includes even fewer studies that compared the AI content generated for the same queries in multiple languages.

Several studies assessed AI model performance in non-English languages with variable results despite the overall trend of below bar performance in non-English languages. For example, Taira et al. tested ChatGPT performance in the Japanese National Nursing Examination in Japanese language in five consecutive years [43]. Despite approaching the passing threshold in four years and passing the 2019 exam, the results indicated the relative weakness of ChatGPT in Japanese [43]. Nevertheless, attributing this result to language limitations alone is challenging, given the superior performance of ChatGPT-4 in Japanese language compared to medical residents in the Japanese General Medicine In-Training Examination, as reported by Watari et al. [44]. This study also exposed ChatGPT-4 limitations in test aspects requiring empathy, professionalism, and contextual understanding [44].

In a study by Guigue et al., ChatGPT limitations in French were evident, with only one-third of questions correctly answered in a French medical school entrance examination, mirroring its performance in obstetrics and gynecology exam [45]. Additionally, worse performance of ChatGPT compared to students was seen in the context of family medicine questions in Dutch language [46]. Conversely, in the Polish Medical Final Examination, ChatGPT demonstrated similar effectiveness in both English and Polish, with a marginally higher accuracy in English for ChatGPT-3.5 [47]. In Portuguese, ChatGPT-4 displayed satisfactory results in the 2022 Brazilian National Examination for Medical Degree Revalidation [48].

In the context of the Arabic language and in line with our findings, Samaan et al. showed less accurate performance of ChatGPT in Arabic compared to English in cirrhosis-related questions [49]. In a non-medical context, Banimelhem and Amayreh showed that ChatGPT performance as an English to Arabic machine translation tool was suboptimal [50]. In a comprehensive study, Khondaker et al. revealed that smaller, Arabic-fine-tuned models consistently outperformed ChatGPT, indicating a significant room for improvement in multilingual capabilities, particularly in Arabic dialects [51]. In the current study, our results suggested that the pattern of lower performance in Arabic extends to all tested AI models despite lacking significance in Bing and Bard.

The use of the CLEAR tool in this study was crucial for pinpointing specific areas for improvement in each language. Specifically, the study findings revealed that in both GPT-3.5 and GPT-4 models, the appropriateness in Arabic lagged behind English. This highlights key areas for enhancement in Arabic, such as the need to improve areas of ambiguities in the generated content and the need for organizing the content in a more effective style. Additionally, accuracy issues observed in ChatGPT-3.5 and Bard highlighted the need for content verification particularly in health-related queries as well as the necessity of acknowledging the potential for inaccuracies in these models (e.g., through clear flagging of potential inaccuracies within the generated responses).

Based on the study findings, we recommend AI developers, particularly at OpenAI, to prioritize cultural and linguistic diversity, especially in health-related content. It is important to acknowledge and address these disparities to ensure equitable and accurate health information dissemination. Further research is needed to confirm if such language disparities are prevalent in other languages as well, which would reinforce the need for more inclusive and diverse AI training datasets. The study calls for collaborative efforts from AI developers, researchers, and healthcare professionals to enhance the performance of AI models for a broader, more inclusive outreach. As AI continues to be integrated into healthcare information dissemination, addressing these linguistic and cultural disparities is crucial for advancing global health equity.

Finally, it is important to interpret the results of the study in light of several limitations as follows: First, the limited number of queries tested on each model, albeit sufficient to reveal potential disparities might limit the generalizability of the findings. Second, the assignment of CLEAR scores may vary if assessed by different raters. To mitigate potential measurement bias, this study employed key answers derived from credible sources as an objective benchmark before CLEAR scoring of the AI generated content. Third, the study did not account for the various Arabic dialects, focusing only on the Standard Arabic. Future research could expand on this particular issue in light of the previous evidence showing potential variability in dialectical performance [51]. Finally, future studies can benefit from including a broader range of queries involving not only infectious disease topics to achieve a more comprehensive understanding of AI performance in diverse health and linguistic contexts.

This study demonstrated the disparity in AI-models’ performance in Arabic compared to English, with the former showing inferior performance in spite of being rated “above average”. These findings highlight the language-based performance gaps in commonly used AI chatbots. This suggests the further need for enhancements in AI performance in Arabic despite the need for further research across various health topics to discern this pattern. To achieve equitable global health standards, it is important to consider the cultural and linguistic diversity in AI-model fine tuning.

Artificial intelligence

CLEAR

Completeness, Lack of false information, Evidence, Appropriateness, and Relevance

COVID-19

Coronavirus disease 2019

HIV/AIDS

Human immunodeficiency virus/acquired immunodeficiency syndrome

K-W

Kruskal Wallis H test

METRICS

Model, Evaluation, Timing, Range/Randomization, Individual, Count, and Specificity of prompt and language

M-W

Mann Whitney U test

Tuberculosis

Ethics approval and consent to participate

This study was approved by the Institutional Review Board (IRB) at the Faculty of Pharmacy, Applied Science Private University (Approval number: 2023-PHA-51, on 23 November 2023). The consent to participate in this study is not applicable based on no involvement of human subjects.

Consent for publication

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are available in the public data tool Open Science Framework (OSF), using the following link: https://osf.io/f86tk/; DOI: 10.17605/OSF.IO/F86TK

Competing interests

The authors declare no conflict of interest.

Funding

This research received no funding.

Author Contributions

Conceptualization: M.S.; Data curation: M.S., K.A.-M., O.A., H.A., F.A., D.A., M.Y.A.-B., M.B., J.E.; Formal analysis: M.S.; Investigation: M.B. and M.S.; Methodology: K M.S., K.A.-M., O.A., H.A., F.A., D.A., M.Y.A.-B., M.B., J.E.; Visualization: M.S.; Project administration: M.S.; Supervision: M.S., J.E; Writing - original draft: M.S.; Writing - review & editing: M.S., K.A.-M., O.A., H.A., F.A., D.A., M.Y.A.-B., M.B., J.E.; All authors contributed to the article and approved the submitted version.

UNESCO. World Arabic Language Day. 25 December 2023, 2023. Updated 18 December 2023. Accessed 25 December 2023, 2023. https://www.unesco.org/en/world-arabic-language-day
Alfakhry GM, Dashash M, Jamous I. Native Arabic Language Use Acceptability and Adequacy in Health Professional Instruction: Students and Faculty’s Perspectives. Health Professions Education. 2020/12/01/ 2020;6(4):454-464. doi:10.1016/j.hpe.2020.06.004
Al Shamsi H, Almutairi AG, Al Mashrafi S, Al Kalbani T. Implications of Language Barriers for Healthcare: A Systematic Review. Oman Med J. Mar 2020;35(2):e122. doi:10.5001/omj.2020.40
Gazzaz ZJ, Baig M, Albarakati M, Alfalig HA, Jameel T. Language Barriers in Understanding Healthcare Information: Arabic-Speaking Students' Comprehension of Diabetic Questionnaires in Arabic and English Languages. Cureus. Oct 2023;15(10):e46777. doi:10.7759/cureus.46777
Makam P, Matsa R. "Big Three" Infectious Diseases: Tuberculosis, Malaria and HIV/AIDS. Curr Top Med Chem. 2021;21(31):2779-2799. doi:10.2174/1568026621666210916170417
Bhutta ZA, Sommerfeld J, Lassi ZS, Salam RA, Das JK. Global burden, distribution, and interventions for infectious diseases of poverty. Infectious Diseases of Poverty. 2014/07/31 2014;3(1):21. doi:10.1186/2049-9957-3-21
Finset A, Bosworth H, Butow P, et al. Effective health communication - a key factor in fighting the COVID-19 pandemic. Patient Educ Couns. May 2020;103(5):873-876. doi:10.1016/j.pec.2020.03.027
Fauci AS. Pandemic influenza threat and preparedness. Emerg Infect Dis. Jan 2006;12(1):73-7. doi:10.3201/eid1201.050983
Jia X, Pang Y, Liu LS. Online Health Information Seeking Behavior: A Systematic Review. Healthcare (Basel). Dec 16 2021;9(12):1740. doi:10.3390/healthcare9121740
Dalmer NK. Questioning reliability assessments of health information on social media. J Med Libr Assoc. Jan 2017;105(1):61-68. doi:10.5195/jmla.2017.108
Moretti FA, Oliveira VE, Silva EM. Access to health information on the internet: a public health issue? Rev Assoc Med Bras (1992). Nov-Dec 2012;58(6):650-8. doi:10.1590/s0104-42302012000600008
Abdaljaleel M, Barakat M, Mahafzah A, Hallit R, Hallit S, Sallam M. TikTok Content on Measles-Rubella Vaccine in Jordan: A Cross-Sectional Study Highlighting the Spread of Vaccine Misinformation. JMIR Preprints. 2023;doi:10.2196/preprints.53458
Fridman I, Johnson S, Elston Lafata J. Health Information and Misinformation: A Framework to Guide Research and Practice. JMIR Med Educ. Jun 7 2023;9:e38687. doi:10.2196/38687
Suarez-Lledo V, Alvarez-Galvez J. Prevalence of Health Misinformation on Social Media: Systematic Review. J Med Internet Res. Jan 20 2021;23(1):e17187. doi:10.2196/17187
Meyrowitsch DW, Jensen AK, Sørensen JB, Varga TV. AI chatbots and (mis)information in public health: impact on vulnerable communities. Front Public Health. 2023;11:1226776. doi:10.3389/fpubh.2023.1226776
Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). Mar 19 2023;11(6):887. doi:10.3390/healthcare11060887
Sallam M, Salim NA, Al-Tammemi AB, et al. ChatGPT Output Regarding Compulsory Vaccination and COVID-19 Vaccine Conspiracy: A Descriptive Study at the Outset of a Paradigm Shift in Online Search for Information. Cureus. Feb 2023;15(2):e35029. doi:10.7759/cureus.35029
Choudhury A, Elkefi S, Tounsi A. Exploring factors influencing user perspective of ChatGPT as a technology that assists in healthcare decision making: A cross sectional survey study. medRxiv. 2023:2023.12.07.23299685. doi:10.1101/2023.12.07.23299685
Abdaljaleel M, Barakat M, Alsanafi M, et al. Factors Influencing Attitudes of University Students towards ChatGPT and its Usage: A Multi-National Study Validating the TAME-ChatGPT Survey Instrument. Preprints: Preprints; 2023.
Sallam M, Salim NA, Barakat M, et al. Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study. JMIR Med Educ. Sep 5 2023;9:e48254. doi:10.2196/48254
Shahsavar Y, Choudhury A. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Hum Factors. May 17 2023;10:e47564. doi:10.2196/47564
Li J, Dada A, Kleesiek J, Egger J. ChatGPT in Healthcare: A Taxonomy and Systematic Review. medRxiv. 2023:2023.03.30.23287899. doi:10.1101/2023.03.30.23287899
Emsley R. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia. 2023/08/19 2023;9(1):52. doi:10.1038/s41537-023-00379-4
Wang Y, McKee M, Torbica A, Stuckler D. Systematic Literature Review on the Spread of Health-related Misinformation on Social Media. Social Science & Medicine. 2019/11/01/ 2019;240:112552. doi:10.1016/j.socscimed.2019.112552
Kleesiek J, Wu Y, Stiglic G, Egger J, Bian J. An Opinion on ChatGPT in Health Care-Written by Humans Only. J Nucl Med. May 2023;64(5):701-703. doi:10.2967/jnumed.123.265687
Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet. 2023;15(6):192. doi:10.3390/fi15060192
Taye MM. Understanding of Machine Learning with Deep Learning: Architectures, Workflow, Applications and Future Directions. Computers. 2023;12(5):91. doi:10.3390/computers12050091
Sallam M, Barakat M, Sallam M. METRICS: Establishing a Preliminary Checklist to Standardize Design and Reporting of Artificial Intelligence-Based Studies in Healthcare. JMIR Preprints. 2023;doi:10.2196/preprints.54704
Rosner B. Fundamentals of biostatistics. 8th ed. Cengage learning; 2015.
Centers for Disease Control and Prevention. Frequently Asked Influenza (Flu) Questions: 2022-2023 Season. 25 December 2023, 2023. 2023. https://www.cdc.gov/flu/season/faq-flu-season-2022-2023.htm
WHO Viet Nam. Q&A on COVID-19 and Breastfeeding. 25 December 2023, 2023. 2023. https://www.who.int/vietnam/news/feature-stories/detail/q-a-on-covid-19-and-breastfeeding
Centers for Disease Control and Prevention. Malaria: Frequently Asked Questions (FAQs). 25 December 2023, 2023. 2023. https://www.cdc.gov/malaria/about/faqs.html
Guinn KM, Rubin EJ. Tuberculosis: Just the FAQs. mBio. Dec 19 2017;8(6)doi:10.1128/mBio.01910-17
Rehman A, Ul-Ain Baloch N, Awais M. Practice of cupping (Hijama) and the risk of bloodborne infections. American Journal of Infection Control. 2014;42(10):1139. doi:10.1016/j.ajic.2014.06.031
WHO South-East Asia. Post COVID-19 (long COVID) Q&A. 25 December 2023, 2023. 2023. https://www.who.int/southeastasia/outbreaks-and-emergencies/covid-19/questions/post-covid-19-q-a
The NHS website for England. Can you catch HIV from kissing? 25 December 2023, 2023. Updated 2021. https://www.nhs.uk/common-health-questions/sexual-health/can-you-catch-hiv-from-kissing/
The WHO Regional Office for the Eastern Mediterranean. Tuberculosis Frequently Asked Questions (FAQs). 25 December 2023, 2023. 2023. https://www.emro.who.int/tuberculosis/faqs/index.html
WHO Coronavirus disease (COVID-19) and people living with HIV. 25 December 2023, 2023. Updated 7 June 2023. 2023. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/coronavirus-disease-(covid-19)-covid-19-and-people-living-with-hiv
Centers for Disease Control and Prevention. BCG Vaccine Fact Sheet. 25 December 2023, 2023. 2023. https://www.cdc.gov/tb/publications/factsheets/prevention/bcg.htm
Sallam M, Barakat M, Sallam M. Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus. Nov 2023;15(11):e49373. doi:10.7759/cureus.49373
Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review. JAMA Dermatol. Nov 1 2021;157(11):1362-1369. doi:10.1001/jamadermatol.2021.3129
Lai V, Ngo Trung N, Veyseh A, et al. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. arXiv. 2023;doi:10.48550/arXiv.2304.05613
Taira K, Itaya T, Hanada A. Performance of the Large Language Model ChatGPT on the National Nurse Examinations in Japan: Evaluation Study. JMIR Nurs. Jun 27 2023;6:e47305. doi:10.2196/47305
Watari T, Takagi S, Sakaguchi K, et al. Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study. JMIR Med Educ. Dec 6 2023;9:e52202. doi:10.2196/52202
Guigue P-A, Meyer R, Thivolle-Lioux G, Brezinov Y, Levin G. Performance of ChatGPT in French language Parcours d'Accès Spécifique Santé test and in OBGYN. International Journal of Gynecology & Obstetrics. 2023/09/01 2023;n/a(n/a)doi:10.1002/ijgo.15083
Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Medical Teacher. 2023/06/03 2023;45(6):665-666. doi:10.1080/0142159X.2023.2187684
Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Scientific Reports. 2023/11/22 2023;13(1):20512. doi:10.1038/s41598-023-46995-z
Gobira M, Nakayama LF, Moreira R, Andrade E, Regatieri CVS, Belfort R, Jr. Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation. Rev Assoc Med Bras (1992). 2023;69(10):e20230848. doi:10.1590/1806-9282.20230848
Samaan JS, Yeo YH, Ng WH, et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab Journal of Gastroenterology. 2023/08/01/ 2023;24(3):145-148. doi:10.1016/j.ajg.2023.08.001
Banimelhem O, Amayreh W. Is ChatGPT a Good English to Arabic Machine Translation Tool? 2023:1-6.
Khondaker MTI, Waheed A, Nagoudi EMB, Abdul-Mageed M. GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP. arXiv preprint arXiv:230514976. 2023;doi:10.48550/arXiv.2305.14976

No competing interests reported.

Download PDF

Editorial decision: Revision requested
26 Feb, 2024
Reviews received at journal
26 Feb, 2024
Reviews received at journal
03 Feb, 2024
Reviewers agreed at journal
03 Feb, 2024
Reviewers invited by journal
02 Feb, 2024
Editor assigned by journal
02 Feb, 2024
Editor invited by journal
18 Jan, 2024
Submission checks completed at journal
10 Jan, 2024
First submitted to journal
02 Jan, 2024

You are reading this latest preprint version

Superior Performance of Artificial Intelligence Models in English Compared to Arabic in Infectious Disease Queries

Status:

Version 1

Abstract

Background

Methods

Results

Conclusions

Figures

BACKGROUND

METHODS

Study design

Ethics statement

Features of the AI Models Tested

Features and Count of the Queries used to Test the AI Models

Sources of Data to Formulate the Infectious Disease Queries

Specificity of Prompts Used

Evaluation of the AI Generated Content

Statistical and Data Analyses

RESULTS

Overall Performance of each AI Model in English vs. Arabic

Performance of each AI Model Stratified per CLEAR Components

Performance of each AI Model Stratified per Infectious Disease Topic

Descriptive Labeling of the Performance of each AI Model in English vs. Arabic

DISCUSSION

CONCLUSIONS

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1