Dataset used in this study
We used the eRI (https://www.usa.philips.com/healthcare/solutions/enterprise-telehealth/eri) database, which is an open database and provides anonymous data, including demographic information, vital signs measurements, laboratory test results, drug information, procedural information, fluid balance reports, hospital length of stay data, and data on in-hospital mortality, donated by >400 member institutions. The eRI contains data associated with 200859 ICU admissions with more than 100 variables. In this study, we carefully selected 58 clinically important variables to construct a model, which is easy to interpret clinically. Selected variables are summarized in Table S1 (see Additional file 1). Briefly, the variables were as follows: (I) Listed from laboratory measurements, including white blood cell count, hematocrit, bilirubin, creatinine, sodium, albumin, blood urea nitrogen, glucose, arterial pH, fraction of inspired oxygen, arterial oxygen pressure, and arterial blood carbon dioxide pressure. (II) Listed from routine charted data, including temperature, respiratory rate, heart rate, mean arterial blood pressure, urine output, and Glasgow Coma Scale (including score for eye, motor, and verbal responses). (III) Listed from information taken at the time of ICU admission, including age, gender, height, weight, time from hospitalization to ICU admission, type of hospital, bed count of the hospital, hospital ID, and diagnoses names at ICU admission. (IV) Comorbidities, including myocardial infarction within 6 months, diabetes, hepatic failure, dialysis, immunosuppressive disease, lymphoma, leukemia, metastatic cancer, cirrhosis, acquired immune deficiency syndrome, and history of intubation and mechanical ventilation. Intervention required at admission, including catheter intervention for myocardial infarction, coronary artery bypass grafting with or without internal thoracic artery grafts, and use of thrombolytics. (V) APACHE scoring system, including not only the APACHE score but also actual ICU and in-hospital mortality, predicted ICU and in-hospital mortality, length of ICU stay, length of hospital stay, and ventilation duration. Among these variables, diagnoses names at ICU admission were used to define the sepsis or non-sepsis group (see the Definition of sepsis and non-sepsis groups in Methods section for details), actual ICU and in-hospital mortality were used as response variables to construct the models, predicted ICU and in-hospital mortality using the APACHE IV or IVa were used as benchmarks against our model, and others (53 parameters in total) were used as explanatory variables for machine learning.
Definition of sepsis and non-sepsis groups
Inclusion criteria for the sepsis group were as follows: (I) extraction by diagnoses names at ICU admission, namely “Sepsis, cutaneous/soft tissue,” “Sepsis, GI,” “Sepsis, gynecologic,” “Sepsis, other,” “Sepsis, pulmonary,” “Sepsis, renal/UTI (including bladder),” and “Sepsis, unknown;” (II) selection by documentation of prognosis; and (III) exclusion by cases with any missing data in Acute Physiology Score (APS)-related variables and prognosis information. Inclusion criteria for the non-sepsis group were almost the same as the one used for the sepsis group. Briefly, the complementary subset of (I) was first selected. Then, (II) and (III) were applied to the resulting subset. Thus, 4226 and 23170 cases were defined as sepsis and non-sepsis groups, respectively. To reproduce our results and access patient lists, follow the jupyter notebook at https://github.com/tatsumashoji/ICU/1_the_sepsis_group_and_non_sepsis_group.ipynb.
Subgrouping based on missing data
Subgroups were defined according to the diagram shown in Fig. S1 (see Additional file 2). Briefly, (I) patient lists, containing no missing data for any pattern of 52 variables (derived from 53 variables) were first generated. Then, (II) the patient list, which had a size not too small and not too large among 53 lists, was defined as subgroup #1. For the other subgroups, we repeated (I) and (II) with the other patients. To reproduce this subgrouping, follow the jupyter notebook opened at https://github.com/tatsumashoji/ICU/2_subgrouping_sepsis.ipynb for the sepsis group and https://github.com/tatsumashoji/ICU/3_subgrouping_non_sepsis.ipynb for the non-sepsis group.
Generation and performance of our model
To construct the model for each group, we used the random forest classifier implemented with “scikit-learn (0.24.1)” [23]. Briefly, we first selected 80% data as a training dataset for each group so that the ratio of “ALIVE” and “EXPIRED” cases were the same between the two datasets. After hyperparameters for the random forest were determined using the grid search algorithm, the actual model was generated, and the mean and standard deviation of accuracy were checked through 5-fold cross-validation (see Table S2-S5 in Additional file 1). Finally, patient mortalities in the test dataset were predicted by the generated model and compared to those from APACHE IV or IVa by drawing receiver operating characteristic (ROC) curves and calculating the area under the ROC (AUROC). The confidence intervals for the AUROC were calculated as described [24]. For calibration plots, we used the module “sklearn.calibration.calibration_curve.” All results in Fig. 2 can be reproduced by running “https://github.com/tatsumashoji/ICU/4_sepsis_prediction.ipynb.”
Imputation of missing values
For imputation of missing values, we used multivariate imputation algorithms implemented with “sklearn.impute.IterativeImputer,” which uses the entire set of available feature dimensions to estimate missing values. We used the 0.24.1 version of scikit-learn. To reproduce results shown in Fig. 3, follow the jupyter notebook opened at https://github.com/tatsumashoji/ICU/5_imputation.ipynb.