Participants characteristics
The selection of participants for the study is described in detail in Fig. 1. A total of 507 patients who met all inclusion criteria were enrolled between 30th June 2019 and 30th June 2020. The incidence of delirium, which reached up to 28% in our study, suggested that an accurate predictive model for patients with VHD was essential. To ensure the accuracy of the data, detailed medical records on delirium were reviewed by expert clinicians to collect clinical information on diagnosis, surgical procedure, left ventricular ejection fraction (LVEF), duration of anesthesia, duration of CPB, abnormal laboratory results, development of delirium, etc.
Following the grouping methods of previous studies, the enrolled cases were randomly divided into a training group (80%, N = 405) and a validation group (20%, N = 103). The baseline characteristics of all cases are described in Table 1. Data are presented as mean and standard deviation (SD) for continuous variables and as percentages for dichotomous variables. Missing data were multiply imputed using chain equations[42]. As for the assessment of educational attainment, we developed a scoring scheme by converting multi-categorical variables into continuous variables. According to our admission scoring system, the educational level of patients was divided into three categories: junior high school education and below (score as 0); high school education or undergraduate degree (score as 1); postgraduate degree and above (score as 2). The average education score of the overall samples was 0.5, indicating a low level of education among our participants. Some researchers considered low educational level as a risk factor for the development of delirium due to the lack of mental training activities and insufficient cognitive reserve[43, 44].
The characteristics of our participants were associated with the surgical procedure and the pathogenetic features of VHD. For example, our participants were predominantly female (59%), and the average age (55.7 years) was below 70 years, as VHD is more common in females and middle-aged individuals, which differs from other models that do not differentiate the primary disease[15]. The average cardiopulmonary bypass (CPB) duration (157.3 minutes versus 198.34 minutes), aortic cross-clamping duration (98.0 minutes versus 114.86 minutes), and anesthesia duration (259.0 minutes versus 476.91 minutes) were much shorter in this study compared with the previous POD study in patients with type A aortic dissection (AAD)[45]. In addition to the conventional indicators of postoperative laboratory tests, the postoperative use of IABP/ECMO within the delirium assessment period was also included in this study. IABP/ECMO is associated with hemodynamic instability and internal environmental disturbances, which may lead to the development of delirium[46, 47, 48]. In this study, we included the pain score assessed by the Digital Evaluation Scale (NRS) as a predictor of delirium outcome[49]. Poor pain management caused by inadequate analgesia or excessive sedation may trigger delirium after surgery. The average pain score of 2.2 points in this study indicates mild pain (no sleep effects) after surgery.
Model development
Table 1 and Table 2 describe 32 complete and 20 selected characteristics, respectively, including preoperative, intraoperative, and postoperative information. The training and validation sets were randomly selected, and the distribution of delirium was balanced between them, indicating that the variation between the two data sets was due to chance rather than grouping. In both the full feature set and the simple feature set, the proportion of delirium remained consistent (with minor variations due to rounding) across the full sample (28%), training sample (28%), and validation sample (28%).
The specific development process is shown in Fig. 2. For the training group, seven classical machine learning algorithms (Logistic Regression[35, 36], Support Vector Machine[37], K-nearest neighbors[38], Naïve Bayes (GaussianNB)[39], Perceptron[40], Decision Tree Classifier[39], Random Forest Classifier[41]) were used to develop prediction models for delirium under both the full feature set and the simple feature set. These algorithmic models were then validated in the validation group to assess their performance.
Model performance
The final validation result of the predictive modeling was determined by machine learning algorithms, as described in Table 3. In this table, seven prediction models developed by different machine learning algorithms were ranked according to their prediction efficiency. We used the test score and the area under the receiver operating characteristic curve (AUC) to assess the stability and accuracy of the delirium prediction models. Briefly, the higher the test score in a random case (internal or external validation dataset), the better the potential predictive efficiency of a prediction model. Using the full feature set, the seven models were ranked from highest to lowest by test score as Random Forest Classifier, Logistic Regression, Support Vector Machine, K-nearest Neighbors, Naïve Bayes (GaussianNB), Decision Tree Classifier, and Perceptron. Similarly, in the selected feature set, the arrangement order was Random Forest Classifier, Support Vector Machine, Logistic Regression, Naïve Bayes (GaussianNB), Decision Tree Classifier, K-nearest Neighbors, and Perceptron. Most machine learning algorithms performed better with a full feature set than with a simple feature set, suggesting that models using a full feature set had a better potential for predicting delirium cases. In addition, the random forest classifier had the highest training and test scores regardless of the feature set used, indicating its excellent potential for predicting delirium cases.
Table 3
Validation result of the prediction modeling under Full feature set and Simple feature set
Full feature set (n = 39) | Training score | Test score | ROC |
Random Forest | 100 | 82.35 | 0.86 |
logistic Regression | 85.68 | 79.41 | 0.73 |
Support Vector Machine | 81.98 | 74.51 | 0.78 |
K-nearest neighbors | 81.48 | 71.57 | 0.62 |
Naïve Bayes | 70.12 | 69.61 | 0.61 |
Decision tree | 70.01 | 69.61 | 0.76 |
Perceptron | 67.9 | 58.82 | 0.65 |
Simple feature set (n = 20) | Training score | Test score | ROC |
Random Forest | 100 | 75.49 | 0.76 |
Support Vector Machine | 79.51 | 73.53 | 0.64 |
logistic Regression | 74.57 | 72.55 | 0.64 |
Naïve Bayes | 69.88 | 71.57 | 0.51 |
Decision tree | 68.88 | 71.57 | 0.72 |
K-nearest neighbors | 79.75 | 69.61 | 0.5 |
Perceptron | 62.22 | 57.84 | 0.31 |
Receiver operating characteristic (ROC) curves for prediction models, generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, are shown in Fig. 3. A (full feature set) and Fig. 3. B (simple feature set), respectively[50]. Overall, the ROC curves in Fig. 3. A (full feature set) is closer to the upper left corner and located on the left of the main diagonal, indicating more significant improvement and more robust prediction for most algorithms with the full feature set. The area under the ROC curve (AUC) is a gold standard for evaluating the quality of classifiers (predictive models). Among all the predictive approaches, the highest AUC (0.86) was observed in the random forest classifier with the full feature set. Typically, an AUC greater than 0.85 indicates a superior predictive value of a predictive approach; therefore, we can conclude that the random forest classifier can accurately predict cases of delirium comparatively well. Even with the simple feature set (Fig. 3. B), the random forest classifier (AUC = 0.76) shows a relatively excellent predictive value compared to other classifiers. The decision tree classifier, a simpler machine learning system than the random forest classifier, is prone to overfitting during machine learning. Although increasing the number of restrictions in the arithmetic can partially reduce overfitting, the availability of features may be sacrificed at the same time[51]. The decision tree classifier had an AUC of 0.76 with the full feature set and 0.72 with the simple feature set, indicating a relatively stable performance in delirium prediction. The support vector machine, a supervised learning method, achieved better statistical results in the case of small sample size by improving its generalization ability[52, 53]. The performance of the support vector machine (AUC = 0.78) is slightly lower than that of the random forest classifier under the full feature set. Notably, the AUC under the simple feature set is 0.64, indicating a relatively unstable performance. As for the classical logistic regression, a method mainly used in previous clinical studies[54, 55], it had an AUC of 0.73 under the full feature set and an AUC of 0.64 under the simple feature set, considered mediocre for model building. Furthermore, the AUC of K-nearest Neighbors (0.50), Naïve Bayes (0.51), and perceptron (0.31) under the simple feature set are less than or around 0.5, suggesting that their prediction accuracy is equivalent to random guessing and thus has no predictive value. Even with the full feature set, these classifiers have AUCs between 0.61 and 0.65, illustrating their limited predictive ability.
In conclusion, the random forest classifier is a prediction model with excellent accuracy and stability.