In summary, we found that the images of patients created by the Bing Image Generator and Meta Imagine often did not accurately represent the disease-specific demographic characteristics. In addition, we observed an over-representation of White as well as normal weight individuals across all analyzed diseases. Female individuals as well as Asian, BAA, HL, NHPI, and AIAN individuals were depicted as being more overweight compared to male and White individuals, respectively. Such inaccuracies raise concern about the role of AI in amplifying misconceptions in healthcare14 given its large numbers of users and use cases3–9.
We found that images by both Bing and Meta displayed a broad range of demographic inaccuracies. This was most striking for Meta’s depictions of patients with prostate cancer, hemophilia B, premenstrual syndrome, and eclampsia, for which both female and male individuals were shown. Likewise, images by Bing often displayed substantial inaccuracies, especially regarding the races/ethnicities that were accurately depicted for only two diseases.
Presumably, these inaccuracies are caused by the composition of the training data of the generative AI models. They are typically trained on large non-medical datasets consisting of publicly available images from the internet, databases such as ImageNet, Common Objects in Context, and other sources6. Such large datasets are necessary to produce photorealistic images. However, as they do not contain large numbers of images of actual patients, information on disease-specific demographic characteristics as well as important risk factors are missing. Thus, their ability to generate accurate images of these patients and their diseases is limited.
Another factor influencing the quality of the output are bias mitigation strategies in the code of the algorithms that, for example, aim to counteract known biases in the training data. These bias mitigation strategies can result in an over-correction of biases as was shown previously25. Thus, it can also be speculated that the depictions of both female and male patients in the images of sex-specific diseases by Meta were influenced by such code-based adaptions. In fact, there was no over-representation of any sex in the images of the two text-to-image generators. This may be interpreted as a positive sign as sex/gender biases have been a common phenomenon in generative AI algorithms12,14. On the other hand, achieving accurate demographic representation by applying bias mitigation strategies seems challenging and representative training data may be necessary.
Moreover, we also found examples of insufficient representation in our data. We detected a bias toward White individuals in both generators. Interestingly, this bias was stronger in Bing than in Meta, which may be another sign of stricter bias mitigation in Meta compared to Bing. A similar over-representation of White individuals has previously been reported in a study on AI-generated images of healthcare professionals14. In addition, we detected a bias towards normal weight in both generators. Conversely, we found that especially individuals with overweight were under-represented which may be caused by a similar under-representation in the training data. However, these results need to be interpreted cautiously as the images did not depict the entire body.
In both text-to-image generators, the depicted females were younger than the males, which may represent a bias of the two algorithms, potentially veering towards gender stereotypes. However, the real-world epidemiology is complex. While females generally have a higher life expectancy than males26, there are studies suggesting an earlier onset in females in some of the diseases included in our analyses, e.g., depression27 or COVID-1928. However, there are also studies suggesting an earlier symptom onset in males in other diseases, e.g., schizophrenia29 or diabetes type 230; or no known sex differences, e.g., in multiple sclerosis31 or malaria32. There is additional research on sex differences in the age of diagnosis in contrast to the age of disease onset29. Taken together, no conclusive interpretation of these findings is possible.
In addition, females were depicted as having higher weight than males. Among the general population, females tend to have a slightly higher BMI33 and body-fat percentage34 than males. If this is also the case for the diseases included in this article has yet to be investigated. Further, Asian, BAA, HL, NHPI, and AIAN individuals combined showed more weight than White individuals, which is inaccurate despite global shifts in the distribution of under- and overweight33,35. Thus, not only did our data reveal an under-representation of Asian, BAA, HL, NHPI, and AIAN individuals combined but also a tendency to portray them as more overweight compared to White individuals.
AI-generated images are not yet ready to be used in the medical context without caution. Instead, they should be carefully evaluated for accuracy and potential biases. Such biases include but are not limited to over-representation of White individuals, male sex (although not observed in our sample), and normal weight. Importantly, inaccuracies and biases can be manually corrected, e.g., by selecting specific images so that the final sample has a fitting distribution of the most important demographic characteristics. However, this requires careful research on these characteristics in the real world. Given our findings, scientific and non-scientific publications should highlight if patient images are real or AI-generated.
There are limitations to our study. Firstly, the rating process is inherently limited. One can only approximate demographic characteristics from images. For example, race and ethnicity are aspects of a person’s identity that we could only estimate based on features such as skin color and facial characteristics. Moreover, all raters identify as White which may have led to biased ratings despite all measures to ensure accurate ratings. It is also difficult to estimate the weight category just from pictures of faces. Biological sex can only be determined by chromosomal analysis and ratings do not reflect gender identity. Secondly, our comparisons to real-world epidemiological data are limited by the availability and quality of the real-world epidemiological data itself. Thirdly, the field of generative AI is rapidly evolving. Thus, our findings are only a snapshot of the features and capabilities of these algorithms in February 2024. Fourthly, there are differences between the outputs of the Bing and Meta algorithms. Thus, our findings may not extend to other text-to-image generators.
Taken together, images of patients generated by two commonly used text-to-image generators did not accurately display fundamental demographic characteristics such as sex, age, and race/ethnicity. In addition, we observed an over-representation of White as well as normal weight individuals. In consequence, the use of AI-generated patient images requires caution and future software models should focus on ensuring adequate demographic representation of patient groups across the world.