In this study, we investigated one crucial aspect of AI model utility in acquisition of health information. This involves testing the hypothesis of existent language disparity in AI model performance. Specifically, the study pursuit was in the context of infectious diseases which represent a significant global health burden. Such a quest appears timely and relevant as AI models are increasingly accessed by lay individuals for health information.
The key finding in this study was the overall lower performance of the tested AI models in Arabic compared to English. In this study, the overall Arabic performance of AI models in the context of infectious disease queries could be labeled as “above average” as opposed to “excellent” performance in English. Additionally, the differences in performance across the two languages showed statistical significance in ChatGPT-3.5 and Bard. Another important observation was the uniformly excellent performance of the four AI models in English. This consistency highlight the effectiveness of these models in English language in the context of infectious disease queries. Additionally, a consistent pattern where the four AI models exhibited superior performance in English extended across all the five tested infectious disease topics. However, a notable variability in performance in Arabic was evident, particularly in handling topics related to HIV/AIDS, TB, and COVID-19.
The disparity in AI model performance across languages may be attributed to varying qualities of the AI training datasets [41]. Prior research which sought to characterize such disparity in AI model performance across languages remains limited despite its timeliness and significance [42]. This includes even fewer studies that compared the AI content generated for the same queries in multiple languages.
Several studies assessed AI model performance in non-English languages with variable results despite the overall trend of below bar performance in non-English languages. For example, Taira et al. tested ChatGPT performance in the Japanese National Nursing Examination in Japanese language in five consecutive years [43]. Despite approaching the passing threshold in four years and passing the 2019 exam, the results indicated the relative weakness of ChatGPT in Japanese [43]. Nevertheless, attributing this result to language limitations alone is challenging, given the superior performance of ChatGPT-4 in Japanese language compared to medical residents in the Japanese General Medicine In-Training Examination, as reported by Watari et al. [44]. This study also exposed ChatGPT-4 limitations in test aspects requiring empathy, professionalism, and contextual understanding [44].
In a study by Guigue et al., ChatGPT limitations in French were evident, with only one-third of questions correctly answered in a French medical school entrance examination, mirroring its performance in obstetrics and gynecology exam [45]. Additionally, worse performance of ChatGPT compared to students was seen in the context of family medicine questions in Dutch language [46]. Conversely, in the Polish Medical Final Examination, ChatGPT demonstrated similar effectiveness in both English and Polish, with a marginally higher accuracy in English for ChatGPT-3.5 [47]. In Portuguese, ChatGPT-4 displayed satisfactory results in the 2022 Brazilian National Examination for Medical Degree Revalidation [48].
In the context of the Arabic language and in line with our findings, Samaan et al. showed less accurate performance of ChatGPT in Arabic compared to English in cirrhosis-related questions [49]. In a non-medical context, Banimelhem and Amayreh showed that ChatGPT performance as an English to Arabic machine translation tool was suboptimal [50]. In a comprehensive study, Khondaker et al. revealed that smaller, Arabic-fine-tuned models consistently outperformed ChatGPT, indicating a significant room for improvement in multilingual capabilities, particularly in Arabic dialects [51]. In the current study, our results suggested that the pattern of lower performance in Arabic extends to all tested AI models despite lacking significance in Bing and Bard.
The use of the CLEAR tool in this study was crucial for pinpointing specific areas for improvement in each language. Specifically, the study findings revealed that in both GPT-3.5 and GPT-4 models, the appropriateness in Arabic lagged behind English. This highlights key areas for enhancement in Arabic, such as the need to improve areas of ambiguities in the generated content and the need for organizing the content in a more effective style. Additionally, accuracy issues observed in ChatGPT-3.5 and Bard highlighted the need for content verification particularly in health-related queries as well as the necessity of acknowledging the potential for inaccuracies in these models (e.g., through clear flagging of potential inaccuracies within the generated responses).
Based on the study findings, we recommend AI developers, particularly at OpenAI, to prioritize cultural and linguistic diversity, especially in health-related content. It is important to acknowledge and address these disparities to ensure equitable and accurate health information dissemination. Further research is needed to confirm if such language disparities are prevalent in other languages as well, which would reinforce the need for more inclusive and diverse AI training datasets. The study calls for collaborative efforts from AI developers, researchers, and healthcare professionals to enhance the performance of AI models for a broader, more inclusive outreach. As AI continues to be integrated into healthcare information dissemination, addressing these linguistic and cultural disparities is crucial for advancing global health equity.
Finally, it is important to interpret the results of the study in light of several limitations as follows: First, the limited number of queries tested on each model, albeit sufficient to reveal potential disparities might limit the generalizability of the findings. Second, the assignment of CLEAR scores may vary if assessed by different raters. To mitigate potential measurement bias, this study employed key answers derived from credible sources as an objective benchmark before CLEAR scoring of the AI generated content. Third, the study did not account for the various Arabic dialects, focusing only on the Standard Arabic. Future research could expand on this particular issue in light of the previous evidence showing potential variability in dialectical performance [51]. Finally, future studies can benefit from including a broader range of queries involving not only infectious disease topics to achieve a more comprehensive understanding of AI performance in diverse health and linguistic contexts.