Our study was conducted to systematically review the characteristics of prospective studies that applied AI to emergency triage systems and integrate their effects.
Six types of triage were applied in the studies included in this review; among them, ESI was the most common, with three [A4, A5, A6]. ESI is a classification scale divided into five levels, from level 1 requiring immediate resuscitation to level 5 of least urgent, developed in the United States in 1999 and used in an increasing number of countries [14]. Except for one study that classified the ETS into four levels [A8], most studies used a 5-level triage. This is due to the publication of previous studies showing that the 5-level triage has higher reliability and validity than the 3-level triage used in the past [45, 46], and it is thought that this is because the triage developed and used in each country follow the 5-level triage.
In our study, most of the 5-level triages were used; however, in two studies, a predictive model was trained by setting the target classification stage, which had the advantage of being able to accurately and independently classify patients in the target classification stage [A3, A4]. Among them, patients with levels 1 and 2, which were considered severe and life-threatening in one study, were excluded because AI-based interventions were judged to be unsafe and unfeasible [A3]. This highlights the limitations of the currently developed AI-based triages. As a result supporting this, among the eight literatures included in this study, the results of two studies that confirmed the model performance or classification prediction for each classification level included in the study can be presented. In a study by Karlafti et al. [A5], Levels 1 and 2 were factors that lowered the model performance, and in a study by Farahmand et al. [A4], Level 2 had the lowest level of prediction of emergency patient classification. When triaging emergency patients, first-impression risk assessment, in which the overall patient condition is grasped within the first 3–5 seconds, is very important [47]. The currently developed AI-based triage identifies and classifies patients using only data; however, nurses can evaluate the risk of first impressions through face-to-face observation of patients. Patients with levels 1 and 2, which are considered to have severe and/or life-threatening conditions, can be a limitation of AI-based triage, and cannot be observed because there is also a lot of information that can be obtained through the observation of the patient's appearance.
According to the South Korea 2021 Emergency Medical Statistical Yearbook, only 7.1% of patients visiting the emergency department had a high severity of KTAS 1 and 2, but 92.9% of patients had KTAS 3–5, it is becoming the main cause of emergency department overcrowding [6]. In particular, in the current triage system, classification accuracy is reduced owing to middle-level classification ambiguity, such as level 3 [48], and some patients classified as non-emergency patients face problems of undertriage and overtriage [49, 50]. In addition, previous studies have shown that a triage system using an automated algorithm predicts hospitalization for ESI level 3–5 patients better than nurses do, suggesting the need for AI-based triage to increase the accuracy of classification by targeting level 3–5 patients [51]. The use of AI-based triage targeting a specific level is expected to increase and accelerate the accuracy of decision-making by triage nurses, thereby reducing the LOS in the emergency department and improving the flow of the emergency department.
The term "AI delivery method" generally refers to an approach for spreading and providing services to users. This delivery method is set by developers to implement AI effectively rather than simply spreading algorithmic models when designing the best possible medical delivery system for a given problem in the medical field [52]. The AI delivery methods used in our study can be divided into standalone applications and cloud-based networks. Stand-alone applications have the advantage of better data privacy than cloud-based networks, as all data remain on the local device without an Internet connection and respond faster owing to the lack of network latency. However, maintenance and updating must be performed for each device, so scalability is poor and there is a risk of inconsistency. Therefore, only one study was used in this systematic review. Cloud-based networks are scalable, easy to maintain and update, and do not require much storage space on the device used; therefore, they were used in seven out of the eight studies. However, a stable Internet connection is essential, and when unstable, delays may occur. Because of the nature of cloud-based services, privacy-related problems may occur compared to standalone applications [53, 54]. Therefore, the delivery method suitable for effectively implementing AI-based programs in the future may vary depending on the available resources and user requirements.
In our meta-analysis, seven studies applied machine learning, and one study applied fuzzy logic. Machine learning has the advantage of being easy to implement because it is designed as a series of algorithms that computers continuously learn [27], but in this systematic review, the performance of the model varied from 60–99%. On the other hand, one study predicted classification using fuzzy logic, but the fuzzy clip model accurately classified 615 out of 616 cases by 99%, compared to 72–95% for WEKA, a type of machine learning. [A6]. A previous study that applied fuzzy logic to an ESI classification system using retrospective data also showed 99% accuracy. Compared with machine learning methods, this facilitates the processing of ambiguous variables, thereby increasing the accuracy and predictive power of classification. Unlike machine learning, which requires repetitive learning, classification worked quickly and could help support emergency department triage nurses’ decision-making [55]. However, it is dangerous to judge that fuzzy logic is superior only because fuzzy logic has a higher accuracy than machine learning in the results of this study. Notably, costs and benefits may vary depending on the sample used, available resources, and knowledge [56]. Based on the results of this study, it is expected that the two AI methods can be used in a complementary manner to efficiently solve complex problems.
Two studies combined various machine learning models in this systematic review. When machine learning was first introduced, research was conducted to verify the performance of single models. Recently, various algorithm models have been mixed and repeatedly trained, and better-performing algorithms have been constructed and applied in clinical environments [57]. For example, the trauma hybrid-suite entry algorithm (THETA), which was developed to classify patients with severe trauma, combines six algorithms–Bayesian ridge regression, linear regression, multilayer perceptron, clustering, support vector machine, and XGBoost–into a model. It showed 2–3 times higher performance than existing algorithms, making predictions more robust [58]. In this study, the ensemble model that combined the prediction results of the first-generation model showed an approximately 3% better performance than the prediction result of the first-generation model [A4]. However, one study failed to confirm the model performance and prediction accuracy of the classification [A2]. It collects categorical information such as chief complaints and medical history using natural language processing based on voice recognition. This is because the Korean language, which is no simpler than the continuous category, has limitations expressed in various words, and is difficult to process in natural languages [59]. There may have been limitations in the performance and classification prediction that could be evaluated quantitatively.
In this study, the classification accuracy of the same type of machine learning model was different, and the classification accuracy was different when the machine learning model type was different, even for the same sample type. The two studies using deep learning had a low classification predictive value of 33.9% [A3] and a high value of 84.6% [A5], indicating a large difference. This is because the types of deep learning algorithms, such as Bayesian networks and feedforward neural networks, are different. In the case of the Bayesian network model, 33.9% of the classifications completely matched the gold standard classification level, and 57.1% of the overtriages showed that 91% were classified conservatively. In addition, the classification accuracy of the study using the decision tree was very high at 99.9% [A1]; however, the classification accuracy of the first-generation study using only the decision tree in the mixed model was approximately 70%. Even in the same decision tree model, the classification accuracy differed, and even in the same sample, the accuracy differed depending on the type of machine learning model used [A4]. Therefore, as in all intervention studies, even with the same program, if the sample is not the same, the results will be different. Even with the same sample, different machine-learning methods can show different levels of accuracy [60].
Thus, in the results of this study, there was no AI model with perfect performance or prediction accuracy. Therefore, for technical supplementation, data that can be continuously monitored in the future should be continuously introduced to optimize the algorithm, and follow-up studies should be conducted to increase high-level performance and classification prediction. In addition, because triage is a serious issue linked to patient safety, it is important to develop a new model for accurate prediction and find the optimal model before applying it to actual patients by continuously comparing it with previous studies.
Most of the studies in this systematic review focused on classification accuracy and prediction, but some studies dealt with interesting results, such as a reduction in classification time, emergency classification comparison, risk factors for classification errors, treatment, and prognosis. Cho et al. [A2] created a real-time medical record input assistance system by applying natural language processing to convert raw voice data generated in the clinical field into text using a speech-to-text algorithm. The researcher compared the time required to classify emergency patients using the AI system with the time required to classify emergency patients using only the existing handwriting input method, which was applied by the nurse during triage through a Bluetooth microphone. Consequently, the classification time using the AI system was 27 s shorter than the manual classification time. Although the record completion rate was 81.84% and the record reproduction accuracy was only 50% in the study by Cho et al. [A2], it is believed that if the AI model's completeness increases, it will contribute to the possibility of clinical use and the reduction of classification time as a classification aid for nurses. A study by Cotte et al. [A3] confirmed the effectiveness of AI-based triage by focusing on undertriage and overtriage of classification and the resulting potential threat situation. In addition, by confirming the retrospective evaluation of the medical staff's manual classification, not only the nurse's manual classification and comparison but also the retrospective evaluation of the classification results, it was possible to supplement the limitations of subjective judgment when setting the medical staff's manual classification as the gold standard. In addition, by checking the variables to be considered in AI-based triage [A8] and those to be considered when evaluating predictive effects [A7], efforts to supplement the limitations of programs developed thus far should be continuously made.
The quality appraisal of the eight studies included in this systematic review revealed that the statistical methods and study sizes were somewhat insufficiently described. Among the statistical analysis methods in three studies, missing value treatment, confounding factor control, and sensitivity analysis were not mentioned. Although there was a record of the number of samples in all studies, four studies did not fully describe the basis for calculating the study size. Depending on the treatment method, missing values and confounding factors threaten internal validity, which can cause errors or biases and affect the research results [61, 62]. In the future, when designing a study to verify the effect of AI-based triage, an analysis method that includes or controls for missing values and confounding factors should be considered and recorded as an important part of observational studies to present correct results. In addition, when planning a study, the basis for sampling, based on statistical grounds and similar previous studies, should be specified in detail. STROBE was first drafted in 2004, then the 4th edition was revised in 2007; it is currently being used as a reporting guideline for observational studies [43]. However, unlike other quality appraisal tools, no subsequent revisions were made. Because some of the detailed items in the literature are not suitable for evaluation or present somewhat strict standards, the guidelines need to be revised and supplemented through adjustment and discussion for proper application.
Through this study, it is meaningful to compare the effects of various AI models in prospective studies, providing basic data that can support the work of nurses for triage in emergency departments. AI cannot replace real nurses because nurses' interactions with patients may contain important information that is difficult to enter into the system [51]. However, through this study, it is considered that identifying the degree of prediction of various models according to the type of model can be a supportive means of clinical decision-making for triage in emergency departments where emergency situations occur simultaneously. In addition, more rigorous interdisciplinary studies should be conducted to examine their effectiveness in improving patient health and other outcomes before they can be applied in real-world clinical settings. For interdisciplinary research, it is necessary to provide an environment in which nurses can develop, use, and evaluate AI models. Support and intervention at the public level, such as local governments, should precede the integration of AI-related subjects into the nursing curriculum and provide nursing students with opportunities to learn new knowledge and skills. Thus, if the optimal algorithm is verified and the technology is standardized, it will become a powerful tool to support triage nurses' decision-making in overcrowded emergency rooms. Furthermore, accurate and rapid triage can reduce undertriage and overtriage, and improve emergency room flow. Ultimately, we hope that this will become a resource that can positively affect the treatment outcomes of patients.
This study had several limitations. First, the predictive accuracy of the AI model was measured based on the judgment levels of nurses and doctors as the gold standard. When evaluating the performance of an AI model, it is important to compare it with a gold standard. Using the judgments of nurses and doctors as the gold standard for classification raises concerns that bias among raters may occur because the subjective opinion of the rater affects the classification result [22–24]. In this study, only one supplemented the vulnerability of the gold standard by adding the results of specialists, and all others were compared based on the judgment results of nurses and doctors, which may vary depending on the practitioner. Although the subjects of the classification decision for comparison were variously trained nurses, emergency department nurses, triage nurses, professional triage nurses, and doctors with specialized training and extensive experience, inter-recipient reliability was measured in only one study. To eliminate the subjectivity of implementers and raise the gold standard, a comparison through reliability measurement between implementers should be made in future studies, and it is thought that a certain level of qualification standards, such as basic education and experience required as a classifier, should be applied. In addition, in follow-up studies, it is necessary to secure the reliability of the gold standard by using clearly defined empirical criteria such as 'length of stay in emergency department,’ ‘inpatient rate in intensive care unit,’ and 'in-hospital mortality,’ which are not open to interpretation or subjectivity [63]. Second, the selected studies were all conducted using a prospective observational design, and the results cannot be generalized to all emergency department environments because of the limited number of analyzed studies and the environments of various countries in which the studies were conducted. Therefore, high-quality experimental studies applied to clinical practice should be conducted in the future through the verification of various models, identification of factors that can increase classification accuracy, and comparative analysis with prior literature.