In addition to the 46,428 ICU patients and 61,051 ICU admissions in the MIMIC-III database, 50,048 ICU patients and 69,619 ICU admissions in the MIMIC-IV database and 177,863 ICU patients and 626,858 ICU admissions in the eICU Collaborative Research Database were available. A total of 621,189 sequential delirium assessment records were available. The final 78,365 patients were included in this study. We identified 22,159 (28.28%) patients with positive records. The demographic characteristics and pertinent outcomes of the cohort are described in supplemental file 4. Briefly, the group of patients with delirium had statistically significant differences from the group without delirium in terms of age, ethnicity, admission type, ICU type, vital signs, laboratory tests, hospital characteristics, and time to delirium assessment.
No candidate variables were missing in approximately 5% of patients. The proportion of missing values in the overall data was approximately 20.0%. We eliminated variables for which less than 40% of the real data were available, including bands (9.1%), bicarbonate (26.0%), base deficit (23.7%), FiO2 (19.1%), and sedation score (15.0%). The amounts of missing data are detailed in supplemental file 3. The proportion of missing values for the remaining covariate data was approximately 20.0%.
Predictor selection
Using the abovementioned statistically significant factors for the correlation analysis, the correlation coefficient matrix heat map of the features shows that the top ten features that were negatively correlated with the outcomes were the verbal response score, pain score, albumin, eye opening score, motor response score, hemoglobin, hematocrit, Glasgow Coma Scale score, region, and platelets; the top ten characteristics that were positively correlated with outcomes were the APACHE-III score, mechanical ventilation, length of ICU stay, benzodiazepines, opioid analgesics, respiratory failure, alpha2-adrenergic receptor agonist, loop diuretics, blood urea nitrogen, and teaching (supplemental file 5). In addition, strong correlations were found between many features. For example, the correlation coefficient between alanine transaminase (ALT) and glutamic oxaloacetic transaminase (AST) reached 0.85; therefore, it was necessary to reduce redundant features.
The Kaiser-Meyer-Olkin test gave a value of 0.7, and Bartlett's test of sphericity showed a significance level of P<0.001, indicating that the factor analysis was effective. Factor analysis and visualization of the characteristic root gravel map and load matrix revealed that the seventeen principal components and eight factors were the most predictive (supplemental files 6-7); for example, the correlation between hemoglobin and the second main factor reached 0.84. Considering the accuracy and practicability of using the CV models, clinical experience and actual comparisons were combined to select seventeen features representing the eight principal component factors, namely, age, APACHE-III score, Diastolic blood pressure (DBP), hemoglobin, urine volume, AST, BUN, verbal response score, hypertension, diabetes, mechanical ventilation, opioid analgesics use, alpha2-adrenergic receptor agonists use, adrenergic agonists use, number of beds, and length of ICU stay.
In addition, we selected risk factors that can be used to build RP models that do not require laboratory test results. These risk factors included age, admission type, DBP, urine volume, verbal response score, mechanical ventilation, opioid analgesics use, alpha 2-adrenergic receptor agonists use, adrenergic agonists use, number of beds, and length of ICU stay. We believe that these variables provide a good representation of delirium risk factors. In addition, these variables are easy to obtain in clinical settings. They also reduce the incidence of missing values.
Comparison of models
The output of the E-PRE-DELIRIC model and the 18 prediction models is presented in Table 1. The E-PRE-DELIRIC model had an AUC of 077, a Youden index of 0.38, an RR of 6.42, a PPV of 0.85, an NPV of 0.54, an accuracy of 0.76, an F1 score of 0.56, a PLR of 1.82, and an NLR of 0.28.
The AUC was higher for the all features set conditions of the AF models (range of AUC, 0.77–0.93 across algorithms) than for the selected features set conditions of the CV models (AUC range, 0.77–0.88) and the fast features set conditions of the RP models (AUC range, 0.75–0.87).
The RF AF model had the highest AUC (0.93), Youden Index (70) and RR (32) and was considered a more comprehensive evaluation approach. The XGBoost AF model had a slightly lower AUC of 0.92, Youden index of 66, and RR of 31. Similarly, the best-performing CV models were the RF model and the XGBoost model, which had AUCs of 0.88 and 0.86, respectively. Among the RP models, the KNN model had the highest AUC (0.87) and RR (14.52); the NB model had the lowest AUC (0.75) and RR (5.82).
Interpretation and evaluation of the machine learning model
Random forest models can provide measures of the importance of variables, thus providing some insight into the factors with the greatest influence on the predictions. The ten most highly ranked variables in this model were length of ICU stay, verbal response score, APACHE-III score, urine volume, hemoglobin, alanine transaminase, blood urea nitrogen, diastolic blood pressure, age, and mechanical ventilation (Figure 1A). Based on the SHAP algorithm, the characteristics of length of ICU stay, APACHE-III score, alanine transaminase, hypertension, and blood urea nitrogen correlated positively with the outcomes and were the top five risk factors; additionally, verbal response score, urine volume, hemoglobin, diastolic blood pressure, and alpha2 adrenergic receptor agonists use correlated negatively with the outcomes and were the top five protective factors (Figure 1B).