We evaluated the ability of DEWS to predict IHCA in general ward-admitted patients in a large multicenter cohort. The results of all three key questions (predictive performance of IHCA, alarming performance, timeliness performance) for DEWS were superior to those of MEWS. In both cohorts, DEWS achieved better performance in predicting IHCA within 24 hours of vital sign observation than MEWS: DEWS achieved 14.0% (300%) and 15.2% (240%) higher AUROCs (AUPRCs) than MEWS, respectively. Alarms are a very sensitive issue for RRS teams because they are eventually associated with the team’s workload. In this study, the alarm rate of DEWS was 44.2% that of MEWS for a cut-off score of 3, 37.0% that of MEWS for a cut-off score of 4, and 48.7% that of MEWS for a cut-off score of 5 in the external validation cohort. DEWS has nearly half of the alarm rate of MEWS. The third key question was the timeliness of the prediction. At every time point from 24 hours to 30 minutes before the event, DEWS detected more IHCA cases at the same time point than MEWS. It enables RRSs to evaluate and assess deteriorating patients with more time to respond. Therefore, better prediction with fewer alarms and earlier prediction indicate that DEWS has the potential to be an effective alternative screening tool for conventional early warning systems.
Various studies have attempted to predict mortality in critically ill patients (i.e., those in ICUs) using machine learning (ML) or DL [30–34]. ICUs, in particular, have many databases for continuous vital sign monitoring and large numbers of diagnostic tests, including laboratory tests, imaging tests, microbiologic reports, medical history panels, patient demographics, ordered fluids, drugs, transfusions, etc. This large database enables ICUs to be a setting for which to conduct artificial intelligence (AI)-based studies. Most AI-based ICU studies have studied mortality or major event prediction (such as hypotension, sepsis, readmission), and generally, algorithm-based prediction achieved better performance compared to conventional prognostic systems [35, 36]. Furthermore, a study using reinforcement learning in sepsis patients showed the potential to solve a complex medical problem and suggest individualized and clinically interpretable treatment strategies for sepsis [37].
However, few studies have focused on deteriorating patients admitted to general wards. In 2016, Churpek et al’s study [38] showed that an ML (i.e., random forest) algorithm (AUROC 0.80, 95% confidence interval (CI) [0.80–0.80]) predicted clinical deterioration more accurately than MEWS (AUROC 0.70, 95% CI [0.70–0.70]) in general ward patients. In Churpek et al’s study, an ML method was used to develop the prediction algorithm. Both ML and DL methods analyze data through self-learning to solve the task or problem. ML requires feature engineering, whereas DL does not; rather, it tries to learn the representation of the raw data in multiple levels of abstractions by itself, which is the essence of why DL methods achieve higher accuracy than ML methods [15]. Alvin Rajkomar et al. demonstrated the effectiveness of DL models in a wide variety of predictive problems and settings [39]. However, the study did not focus on general ward patients and sudden CA but rather on the entire length of stay, including the general ward and the ICU. The outcomes of interest were inpatient mortality, readmission, length of stay and discharge diagnoses. Thus, to the best of our knowledge, our study is the first to apply DL to detect deteriorating patients in general wards in a large multicenter cohort.
The strength of DEWS is that it consists of a limited number of basic vital signs as predictor variables. In this validation study, DEWS used only five basic vital signs: SBP, DBP, HR, RR and BT. The two previous AI-based studies [38, 39] in general ward patients used a variety of predictor variables, including demographics, vital signs, laboratory values, etc. Prediction models with more variables would have better predictability, but there are significant limitations to the scalability and applicability of models with many variables. The predictor variables used in DEWS are basic essential vital signs that are almost always checked in admitted patients and lack missing data. Therefore, DEWS can be applied worldwide without any difficulties in technical implementation. Additionally, a DL-based algorithm enables each institution to have tailored approach by adding one or two main variables depending on the specific features of the hospital. [40].
Five hospitals in South Korea participated in this validation study. The characteristics of each hospital are quite different in terms of the locations, hospital sizes, admitted patients and operating policies. The two hospitals involved in the internal validation have approximately 300 beds; one is a cardiovascular-specific hospital, and the other is community general hospital. The hospitals in the external validation have more than 900 ~ 1000 beds, and all three hospitals are tertiary teaching hospitals, which are affiliated with each of the three different medical universities. Since the original DL model was developed and trained from the two hospitals with 300 beds, the results of the external validation cohort are important in terms of generalization. As a result, DEWS achieved superior performance in the external validation cohort (AUROC 0.905, 95% CI [0.901–0.910]) compared to the internal validation cohort (AUROC 0.860, 95% CI [0.832–0.888]), which suggests that DEWS is robust across multiple hospitals.
Our study has several limitations. First, DL is known as a “black-box” method, as it tries to find the relationship between the training data and the labels rather than creating rules using domain knowledge. Although, most of DEWS alarm can be interpretable in clinical practice through patient review, there would be some cases that RRS staff does not know the exact reason for the alarm. Therefore, RRS staff would need a certain amount of time to react. In this study, DEWS reduces the number of false alarms and increases the sensitivity at the same time. Thus, the rapid response team can spare enough time to verify the alarms, and the staff can intuitively speculate the reason. Second, we consider only the first CA for each patient admission, although second and third CAs are also important. Nonetheless, the first CA is the highest priority because the rapid response team focuses on patients after CA. Last, this study was performed in a retrospective manner. To apply DEWS in clinical practice as an alternative to other triggering score systems in RRS, a well-designed prospective clinical trial is necessary.