Key results
To our knowledge, this is the first work to investigate the capabilities of ChatGPT and GPT-4 on PALS core cases in the hypothetical scenario that laypersons would use the chatbot for support until EMS arrive.
However, our results clearly show that ChatGPT/GPT-4 was not consistent in activating the correct emergency response (hypothesis 1) and advising correct first aid actions (hypothesis 3). Therefore, the only hypothesis we could confirm is DIAGNOSIS (hypothesis 2), as ChatGPT/GPT-4 mostly provided the correct diagnosis.
Additional analyses showed that when combining all parameters, the model failed to obtain safe support to first aiders in nearly half the cases, especially in the shock cases (see Table 2). Despite our high expectations, the use of recent ChatGPT/GPT-4 in medical emergencies must be questioned for patient safety.
Recent research on health apps (preceding LLMs) regarding their validity and accuracy could show a wide range of quality as many apps failed the quality standards [7, 31]. However, some useful apps can be used in paediatric emergencies [32], but these would have to be downloaded before or during the incident. Other apps mainly focus on health care providers, specific aspects of resuscitation [33–35], or function as educational tools [36]. Concerning LLMs, the first preprints show satisfactory accuracy of ChatGPT answering multidisciplinary medical questions [22, 37] and a capability to pass older BLS, PALS and ACLS exams [18].
However, lay rescuers face different barriers in emergencies. Concerning cardiopulmonary arrest, resuscitation attempts and competencies are still far below desirable [38] and may overwhelm bystanders. Furthermore, health literacy among parents is low [39–41], again showing the demand for education on cardiac arrest prevention, but also on emergency management itself, including improved advice by EMS operators.
However, even emergency call centre operators or telephone-triage nurses and general practitioners show limited accuracy in emergencies below 95%, partially even below 60% sensitivity or specificity [42–44]. Additionally, they may be biased by language barriers [45]. This insight reduces the argument of our high-performance expectation of ChatGPT/GPT-4 of 95%, which was not met in our study. Nonetheless, the LLMs show better, and thus promising results compared to humans’ competence in emergency situations that might be optimized in future research, development, and training of LLMs.
Limitations
Studying LLMs such as ChatGPT or GPT-4 presents several challenges, partly due to the opacity of these models and the limited knowledge of their specific training data. For example, when using standard scenarios (e.g., based on PALS) to investigate the diagnostic and triage capacities, it is difficult to discern whether the models perform beyond mere “memorization” of correct answers, as it is likely that they have been trained on these specific vignettes. To avoid this shortcoming, we modified the wording of standard scenarios [46].
Furthermore, the output generated from LLMs is highly sensitive to the input prompts, and different prompting strategies can significantly impact the model’s abilities and performance. Complex language models have the capacity for in-context learning, e.g., through demonstrations of a few training examples of the relevant task [47]. Chain-of-thought technique, which demonstrates step-by-step reasoning in the prompts [48], has been suggested as a promising strategy in the context of complex medical or health-related questions [49]. In our study, instead of providing ChatGPT/GPT-4 with input-output training examples or chain-of-thoughts, we intentionally used a simple prompting strategy (“zero-shot”), that we believe imitates laypersons’ interaction with the chatbot. Hence, the prompts contained only a case description and the questions “What is the diagnosis?” and “What can I do?”. It remains an area for further research to test different prompting strategies and their impact on diagnostic and triage accuracy.
Our study faces further limitations: First, there is no comparative group of lay persons, first responders, EMS operators, or other medical staff that would be asked the same questions. However, this was not the aim of our study, leaving opportunities for future work on human-LLM inter-reliability, accuracy, and results on hybrid human/AI cooperation on emergency calls.
Second, selection bias might still be present. As ChatGPT/GPT-4 uses answer structures that differ between entries, perhaps six iterations are insufficient, indicating the need to analyse more iterations and differing prompts.
Third, we used English cases created by B2/C1 level speakers, not native speakers. However, different language proficiencies (e.g., due to migration) are realistic and were therefore left as is. Further evaluations of different language proficiencies and barriers and the ability of ChatGPT/GPT-4 to deal with this issue creates a future research opportunity [45].
Fourth, some cases included typical phrases for illnesses, e.g., “barking cough” in the croup case. Consequently, we reduced the vignettes and eliminated the pathognomonic “clues” for specific diseases that probably would not be mentioned by lay persons consulting ChatGPT/GPT-4. In this case, ChatGPT/GPT-4 identified croup in 50% after the terminology modification. For future research, these cases should be reduced, or lay persons should describe the simulated cases to be evaluated.
Further, we used European standards to determine patient safety and correct EMS activation. As ChatGPT is accessible worldwide through the internet, expected answers may differ regionally, especially in regions without access to EMS. For example, in case 14, we treated the transport to a hospital as incorrect due to missed spine immobilization, but in rural areas this may be the only chance to survive, as there is no EMS. Consequently, LLMs should be evaluated context-sensitively to local and national medical resources and recommendations.