The register-based cohort study is part of a research approved by the Health Research Authority (HRA) and Health and Care Research Wales and a waiver for patients’ consent was waived (HCRW) (IRAS ID: 278171).
Dataset and Patient Population
The study was performed using the National Adult Cardiac Surgery Audit (NACSA) dataset, which comprises UK adult cardiac surgery data prospectively collected by NACSA. Patients under the age of 18, having congenital cases, transplant and mechanical support device insertions and missing information on mortality were excluded. Rather than only examining the dataset across one institution as previously reported,[50] this analyses was performed using data for all NHS cardiac surgery hospital sites across the UK and a selection of private hospitals from 1 January 1996 to 31 March 2019.
Missing and erroneously inputted data in the dataset were cleaned according to the National Adult Cardiac Surgery Audit Registry Data Pre-processing recommendations. The detailed methodology of data pre-processing and handling of missing data has been outlined in a previous study.[51] The overall percentage of missing data for baseline information is very low (1.7%). Generally, for any variable data that were missing, it was assumed that the variable was at baseline level, i.e., no risk were present. Missing categorical or dichotomous variable data were imputed with the mode while missing continuous variables data imputed with the median. A total of 647,726 patients from 45 hospitals were included in this analysis following the removal of 4,244 (0.65%) patients missing information on mortality.
The dataset was split into three cohorts: Training 65% (n = 420,639; 1996-2011; Supplementary Materials, Table S1), Update 24% (n = 157,196; 2012-2016; Table S2) and Holdout 11% (n = 69,891; 2017-2019; Table S3). The primary outcomes were discrimination, calibration, clinical utility and overall accuracy of the different models in prediction of in-hospital mortality risk following cardiac surgery.
Baseline Statistical analysis
Numerical variables were summarised as mean and standard deviation or median and interquartile range and compared using t-tests or Mann–Whitney tests. Categorical variables were tabulated as frequencies and percentages and compared using χ2 tests. Scikit-learn v0.23.1 and Keras v2.4.0 were used to develop the models and to evaluate their discrimination capabilities. Statistical analyses were conducted using STATA-MP version 17 and R v4.0.2.[52] Anova Assumptions were checked using R rstatix package.
Preprocessing and linkage
A common id across both variable categories were created to ensure linkage. Data rows were then randomised using seed number 7 for reproducibility. Data standardization was performed by subtracting variable mean and dividing by the standard deviation values[53].
Geometric Approach to Ensemble learning and Evaluation
The Geometric mean is defined as \(g\left(x,y\right)= \sqrt{xy}\). Since \(log\sqrt{xy}\) = \((\text{log}x+\text{log}y)/2\), the geometric mean can be interpreted as the antilog of the arithmetic mean of log transformed data. The Geometric mean is able to better adjust to outlier and small sized data than the arithmetic mean,[54] and does not ignore all data except the middle element as the median does. As we expect the different base learners of a small set of ensembles to have a skewed performance distribution in the probabilities predictions and evaluation scores, we select the Geometric mean as the function for 1) ensembling the base learner prediction probabilities; and 2) ensembling the set of M metrics used to evaluate the models.
Ensemble modelling
We used six statistical algorithms to generate mortality predictions - Logistic Regression, Neural Network (Neuronetwork),[53] Random Forest (RF),[55] Weighted Support Vector Machine (SVM),[56] Xgboost[57] and Bayesian Update (a bayesian modelling approach to recalibration and updating coefficients).[50,58] Each algorithm was ‘trained’ using two different sets of variables – those of LogES and ES II (Table S4.1). Parameters for the ML models are listed in Table S4.2. The models based on LogES variables were further ‘updated’ using the data partitions as explained above. Thus resulting in 12 base learner models.
Ensemble models were created in two ways – heterogeneous or homogeneous techniques. Homogeneous models involve using the same algorithm to generate different models/predictions based on different samples of the base data (e.g. XGBoost based on ES II and XGBoost based on LogES variables), known as logES-ESII-P. Heterogeneous models involve using different algorithms on the same base data (e.g. XGBoost, LR, NN, RF ... etc., all trained on ES II variables), known as logES-O, ESII-O and logES-ESII-A models. Nine ensemble models were created; a more detailed explanation can be found on Table S5 and the corresponding Supplementary Material notes.
All models were evaluated using the applied Holdout dataset from the years 2017–2019.[41] Geometric average was used for all soft-voting transformations to bring probability distribution of base learners into one ensemble distribution.[59] Details of base learner model specification are provided in Supplementary Materials, Section 2.
Assessment of model performance
The models’ performance was measured across four broad parameters, but based on a consensus metric approach as described later on in this section[60]:
- Discrimination: AUC[61], F1 score[46]
- Calibration: 1 - ECE.[62]
- Overall accuracy[60]: 1 - Brier score.[63]
- Clinical utility – Net benefit Analysis[16]
The Area Under the Curve (AUC) performances of all variant models were evaluated, and the ROC curves plotted.[61] As a sensitivity analysis, we excluded the True Negative Rate from the performance evaluation, by calculating the F1 score.[46] Decision Curve net benefit index was used to test clinical benefit.[16] 1 - Expected Calibration Error (ECE) was used to determine calibration performance, with higher values being better.[62] The adjusted Brier score (1 – Brier) was used without the normalization term,[63] but with higher values indicating higher overall accuracy performance.
We evaluated the following comparisons:
- Bayesian updated base learners using LogES and ES II vs. Original LogES and ES II scores
- ROC AUC of logES-ESII-P Ensemble models.
- logES-O, ESII-O against logES-ESII-A models.
- The logES-ESII-A Ensemble against logES-ESII-P ensemble models.
To determine the best model in terms of both discrimination and calibration, we took a consensus approach using geometric average[37,53,59] of AUC, F1,[46] Decision Curve net benefit (Treated + Untreated), 1 – ECE and 1 – Brier, referred to as clinical effectiveness metric (CEM). The arithmetic mean of net benefit value was taken prior to CEM calculation due to the Geometric mean assumption of no negative values. 1000 bootstrap samples were taken for all metrics.
For comparison groups 1) We performed forest plots for comparing the Bayesian updated LogES base learner coefficients against original logES coefficients, using coefficients from 1996-2011 as priors and updating coefficients based on data from 2012-2016. Markov Chain Monte Carlo (MCMC) was used for the bayesian update process. The same plot was generated for comparing Bayesian updated ES II base learner against original ES II coefficient model. Due to late adoption of ES II, MCMC coefficients were obtained using data from 2012-2016 only. Three Chains of JAGs MCMC was applied, with each having 1000 iterations and burn-in of 200. Thinning interval was set to 10 and deviance information criterion (DIC) was set to False. Plots are generated using R version 4.0.2, packages: tidyverse and ggforestplot.
Comparion 2) above were analysed using ROC-AUC and 95% CI. CEM was evaluated for all models in comparison groups 3) and 4) above. Using the bootstrap samples, comparison 3) was tested using One-Way Anova and Bonferroni Corrected multiple pairwise paired t-tests. The logES-ESII-A results were compared against each of the logES-ESII-P models using repeated measures One-Way Anova and Bonferroni Corrected multiple pairwise paired t-tests (comparison 4); this was followed by Dunnett’s Correction for multiple comparison, with logES-ESII-A as control. Anova assumptions for outliers were checked. Normality assumptions were checked using Shapiro-Wilk test.[64] A sensitivity analysis of individual metric results comprising the CEM was conducted for comparison 4).
SHAP
We also adopted the SHAP (SHapley Additive exPlanations) for the highest performing model to investigate which variables contribute most to mortality risk prediction on the Holdout set.[65] This model provides both high accuracy and consistency in terms of explaining which variables are important.[66] SHAP was used to examine the overall importance ranking of variables and applied to specific variables for interaction analysis. Importance was reported in either log-odds or absolute importance magnitude.