An Ensemble approach for Ensemble-Modelled Cardiac Surgery Risk 
Evaluation, Data Usage and Clinical Interpretability

doi:10.21203/rs.3.rs-1905463/v1

Download PDF

Article

An Ensemble approach for Ensemble-Modelled Cardiac Surgery Risk Evaluation, Data Usage and Clinical Interpretability

https://doi.org/10.21203/rs.3.rs-1905463/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Risk stratification plays a major role in the clinical decision-making process, patient consent and clinical governance analysis. However, the calibration of current risk scores (e.g., European System for Cardiac Operative Risk Evaluation (EuroSCORE), The Society of Thoracic Surgeons (STS) risk score) has been shown to deteriorate over time – a process known as calibration drift. The introduction of new clinical scores with different variable sets typically result in disparate datasets due to different levels of missingness. This is a barrier to the full insight and predictive capability of datasets across all potentially available time ranges. Little is known about the use of ensemble learning with ensemble metrics to mitigate the effects of calibration drift and changing risk across siloed datasets and time. In this study, we evaluated the effect of various combinations of Machine Learning (ML) models in improving model performance. The National Adult Cardiac Surgery Audit dataset was used (January 1996 to March 2019, 647,726 patients). We trained six different base learner models including Logistic Regression, Neuronetwork, Random Forest (RF), Weighted Support Vector Machine, Xgboost and Bayesian Update, based on two different variable sets of either Logistic EuroScore (LogES) or EuroScore II (ES II), partitioned by the time of score adoption (1996–2011 and 2012–2019). These base learner models are ensembled using nine different combinations to produce homogeneous or heterogeneous ensembles. Discrimination, calibration, clinical effectiveness and overall accuracy were assessed using an ensemble metric, referred to as clinical effectiveness metric (CEM). Xgboost homogenous ensemble (HE) was the highest performing model (CEM 0.725) with AUC (0.8327; 95% Confidence Interval (CI) 0.8323–0.8329) followed by Random Forest HE (CEM 0.723; AUC 0.8325; 95%CI 0.8320–0.8326). Across different heterogenous ensembles, significantly better performance was obtained by combining siloed datasets across time (CEM 0.720) than building ensembles of either 1996–2011 (t-test adjusted, p = 1.67e-6) or 2012–2019 (t-test adjusted, p = 1.35e-193) datasets alone. Both homogenous and heterogenous ML ensembles performed significantly better than traditional recalibration method (Bayesian Update). Combining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single ensemble metric improved the efficiency of cognitive decision-making. Xgboost/Random Forest homogenous ensembling and a highly heterogeneous ensemble approach showed high performance across multifaceted aspects of ML performance and were superior to traditional recalibration methods. Time-dependent ensemble combination of variables, having differing qualities according to time of score adoption, enabled previously siloed data to be combined, leading to increased power, clinical interpretability of variables and usage of data. For models to guide clinicians in individual decisions, performance exceeding these multifaceted benchmarks is necessary.

cardiac surgery

risk prediction

machine learning

mortality

ensemble learning

The Society of Thoracic Surgeons (STS) score[1] and the European System for Cardiac Operative Risk Evaluation (EuroSCORE)[2] are commonly used risk prediction models for operative mortality following cardiac surgery.[3] However, lack of discrimination and calibration remains a problem in particular for high-risk patients.[6] The ES II has been shown by numerous studies to display poor discrimination and calibration across datasets with differing characteristics, including but not limited to age, ethnicity, time,[7] geographical locations,[7] and procedures groups.[8–13]. Preventing model miscalibration is critical in order to avoid ineffective treatment recommendations, harm to the patient,[15] and waste of scarce clinical resources.[14]

Ensemble models are a machine learning approach that combines two or more models in the prediction process and then synthesises the results into a single score or probability distribution to improve the accuracy of predictions. Some studies have assessed the performance of single ML algorithms (referred to as base learners in the ensemble context) against ES II and Logistic Regression (LR) in small or medium-sized cohorts,[16, 17] but have not considered Ensemble based modelling approaches. This approach has the potential to reduce the amount of error in the prediction attributable to variance.[18] Conversely, when a model's variance is high, it performs well on training data but inaccurately on test data (known as overfitting).

Currently, the vast majority of cardiac surgery risk stratification studies rely solely on the Area Under the Curve (AUC) and only few studies have evaluated calibration, and clinical usefulness.[14, 19–24] The AUC is not well suited for assessing cardiac outcomes with very low incidence rates, and typically mortality rates are as low as 3%. The No Free Lunch Theorem states that all optimisation algorithms perform equally well when their performance is averaged across all possible problems,[25] suggesting there is no best model, and that different models perform better under different data distributions. A key consideration in ensemble learning is achieving diversity from base learners. There are various ways to generate ensembles, and one approach is to combine using the same type of base learner (homogenous ensembles), but with different samples of the data. Some ML classifiers are inherently homogenous ensembles in nature such as the Random Forest (RF) and Xgboost models. The alternative is to consider a collection of diverse model types (heterogenous ensembles).

We hypothesise that different Machine learning models may perform better for different metrics and that providing a panel of metrics would be important for covering the multifaceted aspects of clinical model performance. The Miller’s law observed that the human working memory is limited to holding on to an average of seven items in the short-term memory.[26] This is particularly relevant in a scenario where a clinician would need to select from a number of ML models based on a panel of performance metrics. The split-attention effect cognition theory indicates that a single integrated source of information enhances knowledge acquisition better than separated sources of information.[27] Therefore, we hypothesise that it would be better to take an ensemble approach to metric evaluation by providing an ensemble metric.

Due to the late adoption of the ES II scoring system, the clinical recording of the 18 variables used to calculate this score began after year 2011, restricting the usage of these variables for modelling to the period 2012–2019. This makes it challenging to achieve full utilisation of the National Adult Cardiac Surgery Audit (NACSA) dataset, ranging from 1996–2019 for the purpose of risk stratification. While other variables within the range of 1996–2011 could be considered if missing rates are lower, the high usage of the most commonly considered scoring system, LogES,[4, 5] within this time interval implies that its 17 variables would be suitable for risk stratification in terms of low missing rates and high clinical interpretability.

Under the assumption that ML models perform better when provided with larger datasets (Big Data), we, therefore, hypothesised whether it is possible to combine siloed data in the periods 1996–2011 and 2012–2019 using ensemble models trained on logES and ES II variables, respectively, whilst maximising clinical interpretability of variables considered for risk stratification on previous unseen datasets. We propose to consider both homogenous and heterogenous ensembles, and use an ensemble metric to enhance cognitive selection of ensemble models (Fig. 1).

Patients characteristics

A total of 647,726 adult cardiac surgery patients over 18 years from 45 hospitals were included in this analysis, following the removal of 3,930 congenital cases, 1,586 transplant and mechanical support device insertion cases and 4,244 patients missing information on mortality (Table 1). There were 21,374 deaths (mortality rate of 3.30%). A patient flow consort diagram is shown in Supplementary Materials, Figure S1. Missing rates of variables for both the logES and ES II were low except for Left Ventricular Function, Pulmonary Hypertension/Arterial Pressure, Poor mobility and Creatinine (Figure S2 and S3). Missing variables were backfilled using other informative variables according to NACSA dataset cleaning protocol: https://www.nicor.org.uk/wp-content/uploads/2018/09/nacsacleaning10.3.pdf and then imputed to improve variable quality, after which there were no missing variable values.

Recalibration using Bayesian Updating and Calibration drift visualisation

A diagnostic of the two Singleton models: Bayesian updated LogES and Bayesian updated ES II scores showed that characteristics of some variables, but not others in the NACSA dataset, have diverged substantially from the dataset from which the LogES and ES II coefficients were originally derived (Figure 2a-b). There was good match between Bayesian updated and original logES/ES II scores (Figure 2c-d), with ES II scores having better calibration and less calibration drift across dataset and time, in terms of distance between original and updated coefficients (Figure 2e-f).

Homogeneous (logES-ESII Paired) Ensembles

Within the category of homogenous ensembles (Figure 3a), the Xgboost homogenous ensemble (HE) was the highest performing model in terms of AUC (0.8327; 95% Confidence Interval (CI) 0.8323-0.8329) followed by Random Forest HE (0.8325; 95%CI 0.8320-0.8326). Overlapping confidence indicates that the evidence of a difference is weak. The next highest AUC model was Logistic Regression HE, for which the AUC (0.8258; 95%CI 0.8254-0.8260) was significantly lower than that of Xgboost and RF HEs. Neural Network HE was the fourth highest AUC model (0.8246; 95%CI 0.8242-0.8248), which had similar performance to Weighted SVM HE (0.8245; 95%CI 0.8240-0.8247). The Bayesian update HE (0.8200; 95%CI 0.8196-0.8203) performed worst. More comprehensive CEM results for homogenous ensembles are described in subsequent sections.

Heterogeneous ensemble models (logES-O, ESII-O and logES-ESII-A)

No extreme outliers were found. The CEM scores was normally distributed for all three models, as assessed by Shapiro-Wilk’s test (p > 0.05). There was strong evidence of a difference across the three models p < 0.0001 (Table S6). There was a significant evidence of a difference across all pairwise paired t-tests (Figure 3b). logES-ESII-A (CEM 0.720) was significantly better overall compared to ESII-O (p = 1.67e-6) and logES-O (p = 1.35e-193) (Table S7). This indicates that a more diverse set of base learners combining siloed datasets across time periods enhanced performance across heterogenous datasets. The magnitude of difference in CEM between logES-ESII-A and ESII-O was smaller compared to other groups of comparison (Table S7: t-statistic 5.04 vs. 37.7 and 33.3).

As sensitivity analysis, the ROC-AUC plot shows that performance ranking matched that of CEM and was in the ascending order logES-O, ESII-O and logES-ESII-A (Figure 3c). logES-ESII-A (0.8314; 95%CI 0.831-0.8316) provided significantly better discrimination than ESII-O (0.8305; 95%CI 0.8302-0.8308). There was statistical significance that both logES-ESII-A and ESII-O outperformed logES-O (0.8173; 95%CI 0.8168-0.8175).

Heterogeneous (logES-ESII-A) vs. Homogeneous (logES-ESII-P) Ensemble models

No extreme outliers were found. The CEM scores was normally distributed for all models except LR logES-ESII-P homogenous ensemble (HE), as assessed by Shapiro-Wilk’s test (p > 0.05). There was a significant difference across models p < 0.0001, except between three logES-ESII-P HEs: LR, NN and RF (Table S8 and Figure 3e). Dunnett’s test showed that there was a significant evidence that logES-ESII-A was superior to the logES-ESII-P HEs: Bayesian Update, NN, and Weighted SVM (p < 1.8e-06) (Table 2). However, HEs: Xgboost and RF significantly outperformed logES-ESII-A (p < 2e-16), with Xgboost HE having highest overall performance ranking. No statistically significant difference in CEM performance was found for logES-ESII-A against LR HE (p = 0.0554), although CEM score was lower in the latter model. Overall CEM performance of both HEs: Xgboost and RF were significantly higher than that of LR HE as demonstrated by no 95% CI overlap.

As a sensitivity analysis, logES-ESII-A (AUC 0.8314) was found to provide better discrimination, with no 95% CI overlap, than LogES-ESII-P ensembles: Bayesian Update (0.820), Weighted SVM (0.8245), Neuronetwork (0.8246) and LR (0.8258). A detailed report of individual metric results comprising the CEM are given in Table 3 and Supplementary Materials, Section 4. In accordance with the No Free Lunch Theorem, it was difficult to determine which model performed best across all metrics by examining each metric manually. The ensemble metric CEM showed the overall ranking of model performance across all individual metrics. CEM ranking concorded with the Dunnett’s test.

SHAP results

SHAP analysis was performed for Xgboost (ES II Base learner) on the Holdout set as the Xgboost homogenous ensemble was the best performing model. Most patients with important variables showed a clear separation of high variable values contributing to higher log-odds of mortality, and lower variable values contribute to lower log-odds (Figure 4a).

An exception was renal impairment, which showed that patients with high pre-operative impairment can be associated with both high and low log-odds of mortality. The variables most associated with the prediction of mortality outcome were in decreasing order: weight of intervention, operative urgency, age, NYHA class, renal impairment, previous cardiac surgery, chronic pulmonary disease, extra-cardiac arteriopathy (PVD), critical preoperative state (CPS) (Figure 4b). NYHA class III and IV, but not I and II were found to be associated with high log-odds of mortality (Figure 4c). Less urgent cases appear to be associated with higher log-odds of mortality for NYHA class II and III patients. Protective effect of dialysis was observed for patients with renal impairment (Figure 4d). Moderate renal impairment was associated with a low log-odds of mortality. Severe renal impairment was associated with high log-odds of mortality but was protected by increasing the number of procedures used in each operation.

Pre-operative risk stratification based on machine learning has the potential to provide early risk identification and quantitative measures to assist physicians, patients, and family members in making critical surgical decisions, with increased individual specificity and accuracy compared to traditional models.[28–30] Identification of the best model, even in scenarios where there are only small differences in model performance is important for assessing fitness for surgery and deciding between surgical or non-surgical interventions.[19] Conversely, poor selection of models may lead to detrimental effects on patients outcome and hospital resource utilisation.

While the vast majority of studies rely solely on threshold-independent metrics such as the AUC, few studies evaluated clinical risk models using a combination of discrimination, calibration, clinical usefulness and overall accuracy.[14, 19–24] According to a recent review, only a few studies have utilised and evaluated dynamic modelling approaches for clinical prediction models, with the majority focusing on discrete updating methods and model generalisability across populations rather than ways to handle temporal changes over time.[31] Dynamic Model Averaging (DMA) has been suggested to be beneficial for model performance if applied to dynamically update more than one model in parallel.[31, 32] Ensemble-based approaches to drift adaptation and preventing concept drifts (drift in a model’s decision boundary) have been described experimentally but not clinically applied.[33, 34] With most of the ensemble models proposed in literature being homogeneous models, very few heterogeneous ensemble models have been proposed.[35] Even fewer studies have fully evaluated a combination of discrimination, calibration, clinical effectiveness and overall accuracy for ML ensemble models that combined siloed temporal datasets with different variables (LogES and ES II) preferred for the corresponding dataset.[14, 22–24, 36] While ensemble metric has been used to evaluate Covid-19 predictions,[37] these have not been applied to evaluate ensemble models in the cardiac surgical risk domain.

In this study, we found that combining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single Clinical Effectiveness Metric improved the efficiency of cognitive decision-making (according to Miller’s Law[26]) for selecting the optimal ensemble models. Furthermore, Machine Learning ensemble models performed significantly better than DMA of Bayesian Update models, which traditionally has been one of the few limited approaches for dealing with model miscalibration over time. Whilst Bayesian Update models enable useful visualisation of calibration drift across dataset and time, such models are very computationally inefficient for large datasets, whilst other ML ensembles have been found to be more efficient here.

This study also found that Xgboost/Random Forest homogenous ensembling or a highly heterogeneous ensemble approach such as logES-ESII-A should be the preferred choice for high performance across multifaceted aspects of ML performance. Through the separate use of LogES and ES II variable sets in ensemble component models for both homogeneous and highly heterogeneous ensembles, the previously siloed data could be combined, leading to increased power, clinical interpretability of variables and usage of data. That is, the entire dataset ranging from 1996–2016 can be used for training, whilst this is not possible for the ES II category of models due to lack of data for periods before or near its adoption, i.e., 1996–2011. Using conventional approaches, one would have to either use the lower performance LogES score, compromise by using a non-complete dataset using ES II or use Bayesian updating.[38–40]

The inclusion of high-performing and more diverse models such as Neural Network, and Xgboost may have contributed to the reduction of high variance and bias issues, which could potentially be detrimental to discrimination, calibration, clinical utility and overall accuracy across datasets and time.[41, 42] For example, the logES-ESII-A ensemble substantially exceeded the performance reported in a small sized study that used an ensemble of GBM, RF, Support vector machine and Naïve Bayes, built using logES, ESII and other clinical variables without temporal consideration of variables, to predict cardiac postoperative mortality (AUC = 0.832 vs 0.795).[16] However, a smaller sized study that included Xgboost as part of a heterogeneous set of Super Learner ensemble did not achieve high performance using pre-operative data compared to this study’s logES-ESII-A ensemble (AUC = 0.832 vs. 0.718 [0.687–0.749]),[43] or homogeneous Xgboost and RF (logES-ESII-P) ensembles (AUC = 0.832). In addition, homogenous ensembles: Xgboost and RF (AUC = 0.832) both outperformed the RF model reported (variables n = 46: 0.828, variable n = 8: 0.782) in a similar-sized study predicting mortality outcomes in heart failure patients.[44] The current study provides evidence supporting the use of the ensemble approaches described here for cardiac surgery mortality risk prediction.

Our study also shows that the base learner of the best performing Xgboost homogenous ensemble model can provide a detailed understanding of variable association to outcome for its parent ensemble model through the use of interpretive tools such as SHAP. This approach enables the analysis of not only variable importance and associations but also variable interactions.[45] The latter is not easily interpretable using the standard Bayesian Update approach.[21, 46–49] However, the inclusion of the Bayesian Update model, as part of an ensemble, has demonstrated improvement in interpretability of of calibration drift for ensemble models in relation to the coefficients of the data on which they were originally modelled. The benefit of the SHAP approach over traditional Logistic Regression is that both the individual patient procedural contribution to each variable’s importance and the global variable importance in relation to the outcome can be simultaneously visualised.

This study is not without limitations. Only 3 months of data for 2019 was available and future efforts will be to obtain a more up to date version of the NACSA dataset. Class imbalance does exist within the NACSA dataset and can be tackled with different strategies such as over and under-sampling or algorithm-centred approaches that modify the algorithm to favour its prediction toward the less-represented class. However, these approaches are controversial and have not been employed in this analysis. There also remains an open question of post-pandemic stability of risk prediction models trained on pre-pandemic data.

This study based on a large national registry data found that combining the metrics covering all four aspects of discrimination, calibration, clinical usefulness and overall accuracy into a single ensemble metric improved the efficiency of cognitive decision-making for cardiac sugery risk model selection. The evaluation approach showed that Ensemble machine learning models outperformed the traditional model recalibration approach of Dynamic Model Averaging (DMA) by Bayesian Update. Xgboost/Random Forest homogenous ensembling and a highly heterogeneous ensemble approach demonstrated high performance across multifaceted aspects of ML performance. It was also shown that the time-dependent ensemble combination of variables, having differing quality according time of score adoption, enabled previously siloed data to be combined, leading to increased power, clinical interpretability of variables and usage of data. Lastly, this study highlights the versatility of the SHAP tool for not only understanding why predictions are made by base learners of ensembles, but also the associations and interactions between variables and outcomes at both the individual and global levels. Future studies should aim to investigate ensemble approaches adjusting for score adoption to other cardiac surgery cohorts with different characteristics and explore whether other combinations of ensemble models can lead to further performance improvements in less well performing metrics such as the F1 score.

The register-based cohort study is part of a research approved by the Health Research Authority (HRA) and Health and Care Research Wales and a waiver for patients’ consent was waived (HCRW) (IRAS ID: 278171).

Dataset and Patient Population

The study was performed using the National Adult Cardiac Surgery Audit (NACSA) dataset, which comprises UK adult cardiac surgery data prospectively collected by NACSA. Patients under the age of 18, having congenital cases, transplant and mechanical support device insertions and missing information on mortality were excluded. Rather than only examining the dataset across one institution as previously reported,[50] this analyses was performed using data for all NHS cardiac surgery hospital sites across the UK and a selection of private hospitals from 1 January 1996 to 31 March 2019.

Missing and erroneously inputted data in the dataset were cleaned according to the National Adult Cardiac Surgery Audit Registry Data Pre-processing recommendations. The detailed methodology of data pre-processing and handling of missing data has been outlined in a previous study.[51] The overall percentage of missing data for baseline information is very low (1.7%). Generally, for any variable data that were missing, it was assumed that the variable was at baseline level, i.e., no risk were present. Missing categorical or dichotomous variable data were imputed with the mode while missing continuous variables data imputed with the median. A total of 647,726 patients from 45 hospitals were included in this analysis following the removal of 4,244 (0.65%) patients missing information on mortality.

The dataset was split into three cohorts: Training 65% (n = 420,639; 1996-2011; Supplementary Materials, Table S1), Update 24% (n = 157,196; 2012-2016; Table S2) and Holdout 11% (n = 69,891; 2017-2019; Table S3). The primary outcomes were discrimination, calibration, clinical utility and overall accuracy of the different models in prediction of in-hospital mortality risk following cardiac surgery.

Baseline Statistical analysis

Numerical variables were summarised as mean and standard deviation or median and interquartile range and compared using t-tests or Mann–Whitney tests. Categorical variables were tabulated as frequencies and percentages and compared using χ² tests. Scikit-learn v0.23.1 and Keras v2.4.0 were used to develop the models and to evaluate their discrimination capabilities. Statistical analyses were conducted using STATA-MP version 17 and R v4.0.2.[52] Anova Assumptions were checked using R rstatix package.

Preprocessing and linkage

A common id across both variable categories were created to ensure linkage. Data rows were then randomised using seed number 7 for reproducibility. Data standardization was performed by subtracting variable mean and dividing by the standard deviation values[53].

Geometric Approach to Ensemble learning and Evaluation

The Geometric mean is defined as \(g\left(x,y\right)= \sqrt{xy}\). Since \(log\sqrt{xy}\) = \((\text{log}x+\text{log}y)/2\), the geometric mean can be interpreted as the antilog of the arithmetic mean of log transformed data. The Geometric mean is able to better adjust to outlier and small sized data than the arithmetic mean,[54] and does not ignore all data except the middle element as the median does. As we expect the different base learners of a small set of ensembles to have a skewed performance distribution in the probabilities predictions and evaluation scores, we select the Geometric mean as the function for 1) ensembling the base learner prediction probabilities; and 2) ensembling the set of M metrics used to evaluate the models.

Ensemble modelling

We used six statistical algorithms to generate mortality predictions - Logistic Regression, Neural Network (Neuronetwork),[53] Random Forest (RF),[55] Weighted Support Vector Machine (SVM),[56] Xgboost[57] and Bayesian Update (a bayesian modelling approach to recalibration and updating coefficients).[50,58] Each algorithm was ‘trained’ using two different sets of variables – those of LogES and ES II (Table S4.1). Parameters for the ML models are listed in Table S4.2. The models based on LogES variables were further ‘updated’ using the data partitions as explained above. Thus resulting in 12 base learner models.

Ensemble models were created in two ways – heterogeneous or homogeneous techniques. Homogeneous models involve using the same algorithm to generate different models/predictions based on different samples of the base data (e.g. XGBoost based on ES II and XGBoost based on LogES variables), known as logES-ESII-P. Heterogeneous models involve using different algorithms on the same base data (e.g. XGBoost, LR, NN, RF ... etc., all trained on ES II variables), known as logES-O, ESII-O and logES-ESII-A models. Nine ensemble models were created; a more detailed explanation can be found on Table S5 and the corresponding Supplementary Material notes.

All models were evaluated using the applied Holdout dataset from the years 2017–2019.[41] Geometric average was used for all soft-voting transformations to bring probability distribution of base learners into one ensemble distribution.[59] Details of base learner model specification are provided in Supplementary Materials, Section 2.

Assessment of model performance

The models’ performance was measured across four broad parameters, but based on a consensus metric approach as described later on in this section[60]:

Discrimination: AUC[61], F1 score[46]
Calibration: 1 - ECE.[62]
Overall accuracy[60]: 1 - Brier score.[63]
Clinical utility – Net benefit Analysis[16]

The Area Under the Curve (AUC) performances of all variant models were evaluated, and the ROC curves plotted.[61] As a sensitivity analysis, we excluded the True Negative Rate from the performance evaluation, by calculating the F1 score.[46] Decision Curve net benefit index was used to test clinical benefit.[16] 1 - Expected Calibration Error (ECE) was used to determine calibration performance, with higher values being better.[62] The adjusted Brier score (1 – Brier) was used without the normalization term,[63] but with higher values indicating higher overall accuracy performance.

We evaluated the following comparisons:

Bayesian updated base learners using LogES and ES II vs. Original LogES and ES II scores
ROC AUC of logES-ESII-P Ensemble models.
logES-O, ESII-O against logES-ESII-A models.
The logES-ESII-A Ensemble against logES-ESII-P ensemble models.

To determine the best model in terms of both discrimination and calibration, we took a consensus approach using geometric average[37,53,59] of AUC, F1,[46] Decision Curve net benefit (Treated + Untreated), 1 – ECE and 1 – Brier, referred to as clinical effectiveness metric (CEM). The arithmetic mean of net benefit value was taken prior to CEM calculation due to the Geometric mean assumption of no negative values. 1000 bootstrap samples were taken for all metrics.

For comparison groups 1) We performed forest plots for comparing the Bayesian updated LogES base learner coefficients against original logES coefficients, using coefficients from 1996-2011 as priors and updating coefficients based on data from 2012-2016. Markov Chain Monte Carlo (MCMC) was used for the bayesian update process. The same plot was generated for comparing Bayesian updated ES II base learner against original ES II coefficient model. Due to late adoption of ES II, MCMC coefficients were obtained using data from 2012-2016 only. Three Chains of JAGs MCMC was applied, with each having 1000 iterations and burn-in of 200. Thinning interval was set to 10 and deviance information criterion (DIC) was set to False. Plots are generated using R version 4.0.2, packages: tidyverse and ggforestplot.

Comparion 2) above were analysed using ROC-AUC and 95% CI. CEM was evaluated for all models in comparison groups 3) and 4) above. Using the bootstrap samples, comparison 3) was tested using One-Way Anova and Bonferroni Corrected multiple pairwise paired t-tests. The logES-ESII-A results were compared against each of the logES-ESII-P models using repeated measures One-Way Anova and Bonferroni Corrected multiple pairwise paired t-tests (comparison 4); this was followed by Dunnett’s Correction for multiple comparison, with logES-ESII-A as control. Anova assumptions for outliers were checked. Normality assumptions were checked using Shapiro-Wilk test.[64] A sensitivity analysis of individual metric results comprising the CEM was conducted for comparison 4).

SHAP

We also adopted the SHAP (SHapley Additive exPlanations) for the highest performing model to investigate which variables contribute most to mortality risk prediction on the Holdout set.[65] This model provides both high accuracy and consistency in terms of explaining which variables are important.[66] SHAP was used to examine the overall importance ranking of variables and applied to specific variables for interaction analysis. Importance was reported in either log-odds or absolute importance magnitude.

Acknowledgements

This work was supported by a grant from the BHF-Turing Institute and the NIHR Biomedical Research Centre at University Hospitals Bristol and Weston NHS Foundation Trust and the University of Bristol.

Author Contributions

T.D., S.S., A.D., D.P.F., J.C., B.Z., H.A.V., P.N., L.D., M.G., M.C., C.H., U.B., G.D.A. contributed to experimental design. T.D. and S.S. acquired data. T.D., S.S. and A.D. performed the data preprocessing. T.D. wrote the source code to perform the experiments, and are accountable for all aspects of the work. T.D., S.S., A.D., D.P.F., J.C., B.Z., H.A.V., P.N., L.D., M.G., G.D.A and C.H.. analyzed the results. T.D. wrote the first version of the paper. All authors revised the paper and approve the submission.

Data availability

All data used in this study are from the National Adult Cardiac Surgery Audit (NACSA) dataset. These data may be requested from Healthcare Quality Improvement Partnership (HQIP), https://www.hqip.org.uk/national-programmes/accessing-ncapop-data/#.Ys6gN-zMLdp. Code for deriving training, validation, and hold-out datasets is available on GitHub and authors can provide confirmatory de-identified record IDs for each set upon reasonable request.

Competing Interests

All authors declare that there are no competing interests.

Code availability

All source code used in this study are available on GitHub (https://github.com/s0810110/EnsembleScoreAdaption). Analyses were performed using Scikit-learn v0.23.1, Keras v2.4.0, STATA-MP version 17 and R v4.0.2.

Shahian DM, O’Brien SM, Filardo G, et al. The Society of Thoracic Surgeons 2008 Cardiac Surgery Risk Models: Part 1—Coronary Artery Bypass Grafting Surgery. The Annals of Thoracic Surgery 2009;88:S2–22. doi:10.1016/j.athoracsur.2009.05.053
Nashef SAM, Roques F, Sharples LD, et al. EuroSCORE II. European Journal of Cardio-Thoracic Surgery 2012;41:734–45. doi:10.1093/ejcts/ezs043
Ad N, Holmes SD, Patel J, et al. Comparison of EuroSCORE II, Original EuroSCORE, and The Society of Thoracic Surgeons Risk Score in Cardiac Surgery Patients. The Annals of Thoracic Surgery 2016;102:573–9. doi:10.1016/j.athoracsur.2016.01.105
Hickey GL, Grant SW, Murphy GJ, et al. Dynamic trends in cardiac surgery: why the logistic EuroSCORE is no longer suitable for contemporary cardiac surgery and implications for future risk models. Eur J Cardiothorac Surg 2013;43:1146–52. doi:10.1093/ejcts/ezs584
Siregar S, Groenwold RHH, de Heer F, et al. Performance of the original EuroSCORE. European Journal of Cardio-Thoracic Surgery 2012;41:746–54. doi:10.1093/ejcts/ezr285
Gummert JF, Funkat A, Osswald B, et al. EuroSCORE overestimates the risk of cardiac surgery: results from the national registry of the German Society of Thoracic and Cardiovascular Surgery. Clin Res Cardiol 2009;98:363–9. doi:10.1007/s00392-009-0010-8
Sinha S, Dimagli A, Dixon L, et al. Systematic review and meta-analysis of mortality risk prediction models in adult cardiac surgery. Interact Cardiovasc Thorac Surg 2021;33:673–86. doi:10.1093/icvts/ivab151
Silaschi M, Conradi L, Seiffert M, et al. Predicting Risk in Transcatheter Aortic Valve Implantation: Comparative Analysis of EuroSCORE II and Established Risk Stratification Tools. Thorac Cardiovasc Surg 2015;63:472–8. doi:10.1055/s-0034-1389107
Carnero-Alcázar M, Silva Guisasola JA, Reguillo Lacruz FJ, et al. Validation of EuroSCORE II on a single-centre 3800 patient cohort. Interactive CardioVascular and Thoracic Surgery 2013;16:293–300. doi:10.1093/icvts/ivs480
Zhang G, Wang C, Wang L, et al. Validation of EuroSCORE II in Chinese Patients Undergoing Heart Valve Surgery. Heart, Lung and Circulation 2013;22:606–11. doi:10.1016/j.hlc.2012.12.012
Arangalage D, Cimadevilla C, Alkhoder S, et al. Agreement between the new EuroSCORE II, the Logistic EuroSCORE and the Society of Thoracic Surgeons score: Implications for transcatheter aortic valve implantation. Archives of Cardiovascular Diseases 2014;107:353–60. doi:10.1016/j.acvd.2014.05.002
Atashi A, Amini S, Tashnizi MA, et al. External Validation of European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) for Risk Prioritization in an Iranian Population. Braz J Cardiovasc Surg 2018;33:40–6. doi:10.21470/1678-9741-2017-0030
Provenchère S, Chevalier A, Ghodbane W, et al. Is the EuroSCORE II reliable to estimate operative mortality among octogenarians? PLOS ONE 2017;12:e0187056. doi:10.1371/journal.pone.0187056
Davis SE, Lasko TA, Chen G, et al. Calibration Drift Among Regression and Machine Learning Models for Hospital Mortality. AMIA Annu Symp Proc 2018;2017:625–34.
Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Med 2019;17:230. doi:10.1186/s12916-019-1466-7
Allyn J, Allou N, Augustin P, et al. A Comparison of a Machine Learning Model with EuroSCORE II in Predicting Mortality after Elective Cardiac Surgery: A Decision Curve Analysis. PLOS ONE 2017;12:e0169772. doi:10.1371/journal.pone.0169772
Mejia OAV, Antunes MJ, Goncharov M, et al. Predictive performance of six mortality risk scores and the development of a novel model in a prospective cohort of patients undergoing valve surgery secondary to rheumatic fever. PLoS One 2018;13:e0199277. doi:10.1371/journal.pone.0199277
Ensemble Machine Learning. https://link.springer.com/book/10.1007/978-1-4419-9326-7 (accessed 18 Jul 2022).
Nilsson J, Algotsson L, Höglund P, et al. Comparison of 19 pre-operative risk stratification models in open-heart surgery. European Heart Journal 2006;27:867–74. doi:10.1093/eurheartj/ehi720
Walsh CG, Sharman K, Hripcsak G. Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk. Journal of Biomedical Informatics 2017;76:9–18. doi:10.1016/j.jbi.2017.10.008
Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning. New York, NY, USA:: Association for Computing Machinery 2006. 233–40. doi:10.1145/1143844.1143874
Lapp L, Bouamrane M-M, Kavanagh K, et al. Evaluation of Random Forest and Ensemble Methods at Predicting Complications Following Cardiac Surgery. In: Riaño D, Wilk S, ten Teije A, eds. Artificial Intelligence in Medicine. Cham:: Springer International Publishing 2019. 376–85. doi:10.1007/978-3-030-21642-9_48
Fernandes MPB, Armengol de la Hoz M, Rangasamy V, et al. Machine Learning Models with Preoperative Risk Factors and Intraoperative Hypotension Parameters Predict Mortality After Cardiac Surgery. Journal of Cardiothoracic and Vascular Anesthesia 2021;35:857–65. doi:10.1053/j.jvca.2020.07.029
Marvao A de, Dawes TJ, Howard JP, et al. Artificial intelligence and the cardiologist: what you need to know for 2020. Heart 2020;106:399–400. doi:10.1136/heartjnl-2019-316033
Adam SP, Alexandropoulos S-AN, Pardalos PM, et al. No Free Lunch Theorem: A Review. In: Demetriou IC, Pardalos PM, eds. Approximation and Optimization: Algorithms, Complexity and Applications. Cham:: Springer International Publishing 2019. 57–82. doi:10.1007/978-3-030-12767-1_5
Kang X. The Effect of Color on Short-term Memory in Information Visualization. In: Proceedings of the 9th International Symposium on Visual Information Communication and Interaction. Dallas TX USA:: ACM 2016. 144–5. doi:10.1145/2968220.2968237
Ayres P, Cierniak G. Split-Attention Effect. In: Seel NM, ed. Encyclopedia of the Sciences of Learning. Boston, MA:: Springer US 2012. 3172–5. doi:10.1007/978-1-4419-1428-6_19
Ong CS, Reinertsen E, Sun H, et al. Prediction of operative mortality for patients undergoing cardiac surgical procedures without established risk scores. The Journal of Thoracic and Cardiovascular Surgery Published Online First: 14 September 2021. doi:10.1016/j.jtcvs.2021.09.010
Al-Ahmari S, Nadeem F. Machine Learning-Based Predictive Model for Surgical Site Infections: A Framework. In: 2021 National Computing Colleges Conference (NCCC). 2021. 1–6. doi:10.1109/NCCC49330.2021.9428873
Elfanagely O, Toyoda Y, Othman S, et al. Machine Learning and Surgical Outcomes Prediction: A Systematic Review. Journal of Surgical Research 2021;264:346–61. doi:10.1016/j.jss.2021.02.045
Jenkins DA, Sperrin M, Martin GP, et al. Dynamic models to predict health outcomes: current status and methodological challenges. Diagnostic and Prognostic Research 2018;2:23. doi:10.1186/s41512-018-0045-2
Hickey GL, Grant SW, Caiado C, et al. Dynamic Prediction Modeling Approaches for Cardiac Surgery. Circulation: Cardiovascular Quality and Outcomes 2013;6:649–58. doi:10.1161/CIRCOUTCOMES.111.000012
Liu A, Lu J, Zhang G. Diverse Instance-Weighting Ensemble Based on Region Drift Disagreement for Concept Drift Adaptation. IEEE Transactions on Neural Networks and Learning Systems 2021;32:293–307. doi:10.1109/TNNLS.2020.2978523
Krittanawong C, Virk HUH, Bangalore S, et al. Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep 2020;10:16057. doi:10.1038/s41598-020-72685-1
Haque MN, Noman MN, Berretta R, et al. Optimising weights for heterogeneous ensemble of classifiers with differential evolution. In: 2016 IEEE Congress on Evolutionary Computation (CEC). 2016. 233–40. doi:10.1109/CEC.2016.7743800
Friedrich S, Groß S, König IR, et al. Applications of artificial intelligence/machine learning approaches in cardiovascular medicine: a systematic review with recommendations. European Heart Journal - Digital Health 2021;2:424–36. doi:10.1093/ehjdh/ztab054
Devaraj J, Madurai Elavarasan R, Pugazhendhi R, et al. Forecasting of COVID-19 cases using deep learning models: Is it reliable and practically significant? Results in Physics 2021;21:103817. doi:10.1016/j.rinp.2021.103817
Authors/Task Force Members, Vahanian A, Alfieri O, et al. Guidelines on the management of valvular heart disease (version 2012): The Joint Task Force on the Management of Valvular Heart Disease of the European Society of Cardiology (ESC) and the European Association for Cardio-Thoracic Surgery (EACTS). European Heart Journal 2012;33:2451–96. doi:10.1093/eurheartj/ehs109
Chhor V, Merceron S, Ricome S, et al. Poor performances of EuroSCORE and CARE score for prediction of perioperative mortality in octogenarians undergoing aortic valve replacement for aortic stenosis. European Journal of Anaesthesiology | EJA 2010;27:702–7. doi:10.1097/EJA.0b013e32833a45de
Kuwaki K, Inaba H, Yamamoto T, et al. Performance of the EuroSCORE II and the Society of Thoracic Surgeons Score in patients undergoing aortic valve replacement for aortic stenosis. J Cardiovasc Surg (Torino) 2015;56:455–62.
Hosni M, Carrillo de Gea JM, Idri A, et al. A systematic mapping study for ensemble classification methods in cardiovascular disease. Artif Intell Rev 2021;54:2827–61. doi:10.1007/s10462-020-09914-6
Mustaqeem A, Anwar SM, Khan AR, et al. A statistical analysis based recommender model for heart disease patients. International Journal of Medical Informatics 2017;108:134–45. doi:10.1016/j.ijmedinf.2017.10.008
Castela Forte J, Mungroop HE, de Geus F, et al. Ensemble machine learning prediction and variable importance analysis of 5-year mortality after cardiac valve and CABG operations. Scientific Reports 2021;11:3467. doi:10.1038/s41598-021-82403-0
Ahmad T, Lund LH, Rao P, et al. Machine Learning Methods Improve Prognostication, Identify Clinically Distinct Phenotypes, and Detect Heterogeneity in Response to Therapy in a Large Cohort of Heart Failure Patients. Journal of the American Heart Association;7:e008081. doi:10.1161/JAHA.117.008081
Lundberg SM, Nair B, Vavilala MS, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2018;2:749–60. doi:10.1038/s41551-018-0304-0
Tiwari P, Colborn KL, Smith DE, et al. Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation. JAMA Network Open 2020;3:e1919396–e1919396. doi:10.1001/jamanetworkopen.2019.19396
Sevakula RK, Au-Yeung WM, Singh JP, et al. State‐of‐the‐Art Machine Learning Techniques Aiming to Improve Patient Outcomes Pertaining to the Cardiovascular System. Journal of the American Heart Association 2020;9:e013924. doi:10.1161/JAHA.119.013924
Hizoh I, Domokos D, Banhegyi G, et al. Mortality prediction algorithms for patients undergoing primary percutaneous coronary intervention. J Thorac Dis 2020;12:1706–20. doi:10.21037/jtd.2019.12.83
Flach P, Kull M. Precision-Recall-Gain Curves: PR Analysis Done Right. Advances in Neural Information Processing Systems 2015;28.https://proceedings.neurips.cc/paper/2015/hash/33e8075e9970de0cfea955afd4644bb2-Abstract.html (accessed 4 Mar 2021).
Benedetto U, Sinha S, Lyon M, et al. Can machine learning improve mortality prediction following cardiac surgery? European Journal of Cardio-Thoracic Surgery 2020;58:1130–6. doi:10.1093/ejcts/ezaa229
Benedetto U, Sinha S, Dimagli A, et al. Decade-long trends in surgery for acute Type A aortic dissection in England: A retrospective cohort study. The Lancet Regional Health - Europe 2021;7:100131. doi:10.1016/j.lanepe.2021.100131
StataCorp. Stata Statistical Software: Release 17. College Station, TX: StataCorp LLC; 2021.
Dong T, Benedetto U, Sinha S, et al. A Deep Recurrent Reinforced Learning model to compare the efficacy of targeted local vs. national measures on the spread of COVID-19 in the UK. medRxiv Published Online First: 2021. doi:10.1101/2021.05.21.20248630
Olivier J, Johnson WD, Marshall GD. The logarithmic transformation and the geometric mean in reporting experimental IgE results: what are they and when and why to use them? Annals of Allergy, Asthma & Immunology 2008;100:333–7. doi:10.1016/S1081-1206(10)60595-9
Sarica A, Cerasa A, Quattrone A. Random Forest Algorithm for the Classification of Neuroimaging Data in Alzheimer’s Disease: A Systematic Review. Front Aging Neurosci 2017;9:329. doi:10.3389/fnagi.2017.00329
Prabhakararao E, Dandapat S. A Weighted SVM Based Approach for Automatic Detection of Posterior Myocardial Infarction Using VCG Signals. In: 2019 National Conference on Communications (NCC). 2019. 1–6. doi:10.1109/NCC.2019.8732238
Rajliwall NS, Davey R, Chetty G. Cardiovascular Risk Prediction Based on XGBoost. In: 2018 5th Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE). 2018. 246–52. doi:10.1109/APWConCSE.2018.00047
Siregar S, Nieboer D, Versteegh MIM, et al. Methods for updating a risk prediction model for cardiac surgery: a statistical primer. Interactive CardioVascular and Thoracic Surgery 2019;28:333–8. doi:10.1093/icvts/ivy338
Krejčí J, Stoklasa J. Aggregation in the analytic hierarchy process: Why weighted geometric mean should be used instead of weighted arithmetic mean. Expert Systems with Applications 2018;114:97–106. doi:10.1016/j.eswa.2018.06.060
Huang C, Li S-X, Caraballo C, et al. Performance Metrics for the Comparative Analysis of Clinical Risk Prediction Models Employing Machine Learning. [Miscellaneous Article]. Circulation: Cardiovascular Quality & Outcomes 2021;14. doi:10.1161/CIRCOUTCOMES.120.007526
Kumar NK, Sindhu GS, Prashanthi DK, et al. Analysis and Prediction of Cardio Vascular Disease using Machine Learning Classifiers. In: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). 2020. 15–21. doi:10.1109/ICACCS48705.2020.9074183
Mehrtash A, Wells WM, Tempany CM, et al. Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation. IEEE Transactions on Medical Imaging 2020;39:3868–78. doi:10.1109/TMI.2020.3006437
Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology 2010;21:128–38. doi:10.1097/EDE.0b013e3181c30fb2
González-Estrada E, Cosmes W. Shapiro–Wilk test for skew normal distributions based on data transformations. Journal of Statistical Computation and Simulation 2019;89:3258–72. doi:10.1080/00949655.2019.1658763
Barda N, Riesel D, Akriv A, et al. Developing a COVID-19 mortality risk prediction model when individual-level data are not available. Nat Commun 2020;11:4439. doi:10.1038/s41467-020-18297-9
Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions.;:10.

Table 1

Patient characteristics of all time periods 1996–2019 (n = 647,726) ); WOI - Weight of intervention.
	Operative mortality
Variable	No, N = 626,352¹	Yes, N = 21,374¹	P value²
Age (years), mean (SD)	68.38 (7.85)	71.63 (8.16)	< 0.001
Female gender, n (%)	166,065 (27%)	8,093 (38%)	< 0.001
Procedure urgency (ES II), n (%)			< 0.001
1 – Elective	428,260 (69%)	9,123 (43%)
2 – Urgent	177,041 (28%)	7,367 (35%)
3 – Emergency	18,263 (2.9%)	3,652 (17%)
4 – Salvage	1,402 (0.2%)	1,146 (5.4%)
Unknown	1,386	86
Diabetes on insulin, n (%)	35,097 (5.6%)	1,607 (7.5%)	< 0.001
Systolic pulmonary pressure > 60mmHg (LogES), n (%)	12,933 (2.1%)	1,287 (6.0%)	< 0.001
Year of Procedure, mean (SD)	2,008.77 (5.72)	2,007.96 (5.67)	< 0.001
Chronic pulmonary disease, n (%)	75,594 (12%)	3,874 (18%)	< 0.001
Extracardiac arteriopathy, n (%)	68,884 (11%)	4,307 (20%)	< 0.001
Previous cardiac surgery, n (%)	36,451 (5.8%)	4,063 (19%)	< 0.001
Emergency, n (%)	19,665 (3.1%)	4,798 (22%)	< 0.001
Other than isolated CABG, n (%)	246,522 (39%)	13,655 (64%)	< 0.001
Surgery on thoracic aorta, n (%)	13,786 (2.2%)	1,725 (8.1%)	< 0.001
Active endocarditis, n (%)	12,394 (2.0%)	1,309 (6.1%)	< 0.001
Critical preoperative state, n (%)	16,537 (2.6%)	3,745 (18%)	< 0.001
Serum Creatinine > 200 µmol/ L, n (%)	11,402 (1.8%)	2,214 (10%)	< 0.001
Renal impairment, n (%)			< 0.001
0 - Normal	26,735 (4.3%)	2,128 (10.0%)
1 - Moderate	105,584 (17%)	2,859 (13%)
2 - On Dialysis	491,510 (78%)	15,984 (75%)
3 - Severe	2,523 (0.4%)	403 (1.9%)
Recent myocardial infarction, n (%)	101,622 (16%)	4,698 (22%)	< 0.001
Post infarct septal rupture, n (%)	613 (< 0.1%)	399 (1.9%)	< 0.001
NYHA, n (%)			< 0.001
0 – I	163,211 (26%)	4,145 (19%)
1 – II	261,454 (42%)	5,508 (26%)
2 – III	169,908 (27%)	7,356 (34%)
3 – IV	31,779 (5.1%)	4,365 (20%)
CCS class 4 angina, n (%)	60,856 (9.7%)	3,518 (16%)	< 0.001
LVEF (LogES) Moderate, n (%)	73,615 (12%)	3,030 (14%)	< 0.001
LVEF (LogES) Poor, n (%)	170,178 (27%)	10,276 (48%)	< 0.001
Unstable angina, n (%)	79,645 (13%)	4,957 (23%)	< 0.001
Neurological dysfunction/Poor mobility, n (%)	16,641 (2.7%)	1,116 (5.2%)	< 0.001
LVEF (ES II), n (%)			< 0.001
0 - Good (> 50%)	2,990 (0.5%)	372 (1.7%)
1 - Moderate (31–50%)	9,597 (1.5%)	868 (4.1%)
2 - Poor (21–30%)	69,424 (11%)	2,673 (13%)
3 - Very Poor (≤ 20%)	544,341 (87%)	17,461 (82%)
Systolic pulmonary pressure (ES II), n (%)			< 0.001
0 – PA Systolic (< 31mmHg)	573,953 (92%)	17,653 (83%)
1 – PA Systolic (31–55 mmHg)	32,201 (5.1%)	1,937 (9.1%)
2 – PA Systolic (> 55mmHg)	20,198 (3.2%)	1,784 (8.3%)
WOI Single Non-CABG, n (%)	145,401 (23%)	6,517 (30%)	< 0.001
WOI 2 Procedures, n (%)	94,694 (15%)	6,408 (30%)	< 0.001
WOI 3 Procedures, n (%)	6,427 (1.0%)	730 (3.4%)	< 0.001
¹Mean (SD) or Frequency (%)
²T-test; Mann–Whitney test; Pearson's Chi-squared test or Fisher’s exact test

Table 2

Dunnett's test with logES-ESII-A as control and logES-ESII-P ensembles as comparison; 95% family-wise confidence level are shown as well as mean difference in CEM and p-values.
				95% CI
Group 1	Group 2	CEM Difference (1–2)	P Value	Lower Bound	Upper Bound
Bayesian Update (logES-ESII-P)	logES-ESII-A (Control)	-0.0253	< 2e-16 ***	-0.0259	-0.0247
LR logES-ESII-P		-0.0006	0.0554 .	-0.0012	0.0000
NN logES-ESII-P		-0.0012	1.8e-06 ***	-0.0018	-0.0006
RF logES-ESII-P		0.0034	< 2e-16 ***	0.0028	0.0040
Weighted SVM logES-ESII-P		-0.1113	< 2e-16 ***	-0.1119	-0.1107
Xgboost logES-ESII-P		0.0053	< 2e-16 ***	0.0047	0.0059
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' '

Table 3

Geometric Mean of Individual metrics for logES-ESII-A vs. logES-ESII-P comparison; CEM refs to Clinical Effectiveness Metric; Standard deviation and 95% CI are shown for CEM; adjusted 1 - ECE and 1 - Brier score values are shown; net benefit is average absolute overall benefit across all thresholds.
Model Category	ECE	AUC	Brier	F1	Net Benefit	CEM.Mean	CEM.sd	CEM.CI lower limit	CEM.CI Upper limit
Bayesian Update logES-ESII-P	0.935	0.820	0.970	0.268	0.810	0.694	0.005	0.694	0.695
logES-ESII-A	0.983	0.832	0.976	0.276	0.877	0.720	0.005	0.719	0.720
LR logES-ESII-P	0.996	0.826	0.976	0.269	0.890	0.719	0.005	0.719	0.719
NN logES-ESII-P	0.996	0.824	0.976	0.268	0.891	0.718	0.005	0.718	0.719
RF logES-ESII-P	0.994	0.832	0.976	0.275	0.890	0.723	0.005	0.723	0.723
Weighted SVM logES-ESII-P	0.798	0.825	0.927	0.263	0.520	0.608	0.005	0.608	0.609
Xgboost logES-ESII-P	0.993	0.832	0.976	0.279	0.889	0.725	0.005	0.725	0.725

(Not answered)

SupplementaryMaterialsv2.2.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

An Ensemble approach for Ensemble-Modelled Cardiac Surgery Risk Evaluation, Data Usage and Clinical Interpretability

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Limitations

Conclusion

Methods

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1