Patient population
This retrospective study included all fetuses diagnosed with CHD from January 2012 to December 2021 at a single tertiary centre in Canada (Sunnybrook Health Sciences Centre). The hospital performs approximately 20,000 prenatal ultrasounds and up to 500 fetal echocardiograms per year. Peri- and postnatal outcomes were collected from Sunnybrook, but also from the affiliated obstetric and neonatal centers (Mount Sinai Hospital, Toronto, Michael Garron Hospital, Toronto and The Hospital for Sick Children, Toronto) depending on the ultimate location of delivery and postnatal care. The study was approved by the Research Ethics Board of all participating institutions as well as and Johns Hopkins University where the analysis was performed. The requirement for individual patient consent was waived for a retrospective study.
Clinical characteristics and neonatal outcomes (Table 1) were collected through chart review. The severity of congenital heart disease was defined according to the Hoffman criteria as mild, moderate or severe (See Appendix A) [19].
Machine learning algorithms were then developed to predict the following outcomes of interest: 1) in utero demise/stillbirth or death within 72 hours of birth despite planned active care, 2) need for high level neonatal care (delivery at a tertiary care hospital, prostaglandins, neonatal intensive care or intensive care admission, mechanical ventilation, neonatal surgical or catheter intervention < 30 days of life) and 3) favourable postnatal outcomes defined as survival without severe developmental delay at last follow up, which was extracted from the patient’s chart. The severity of congenital heart disease was defined according to the Hoffman criteria as mild, moderate or severe (See Appendix A) [19].
Predictive features and clinical outcomes
The feature set consisted of 70 potential predictors; 62 out of 70 predictors were integrated in all three models including information about demographics, comorbidities, medical management, and fetal structural findings from the fetal echocardiogram comprising cardiac anatomy. Additionally, 7 more predictors including labor induction, mode of delivery, sex, gestational age at birth, birth weight and Apgar score were used in the models predicting the need for high acuity neonatal care and favourable outcomes (69 predictors total for these 2 models). Finally, the ML model predicting the risk of adverse outcomes also included postnatal cardiac intervention (surgical or catheter based) information in addition to the 69 variables listed above.
Data preprocessing
Missing values imputation
We generated an analysis dataset for every ML model, comprising a subset of patients with exclusively recorded outcome values. Three separate analysis datasets were constructed, each aligning with the corresponding outcome and its associated number of predictors. To address missing information within each dataset, we employed a predictive imputation method [20]. This method considers the similarity between patients in each dataset. An iterative imputation algorithm was implemented, allowing up to 50 cycles. In each cycle, a decision tree regressor was applied to each dataset, aiding in discerning patterns among patients and relationships between predictors to approximate the missing measurements.
After estimating missing values in the three analysis datasets, predictor variables with more than two categories underwent transformation using one-hot encoding.
Tree based machine learning model induction and evaluation
The XGBoost tree-based ML algorithm [21] was applied to each of these datasets. The use of the XGBoost algorithm facilitated the categorization of patients into two distinct groups, allowing for the assessment of non-linear relationships between predictors and their respective outcomes. To improve the XGBoost predictions, optimization was performed using the area under the receiver-operating characteristic curve (AUC) as a benchmark to evaluate model effectiveness. Furthermore, the XGBoost algorithm underwent hyperparameter tuning [22] to achieve optimal results. This tuning process involved 5-fold cross-validation (CV), utilizing Bayesian optimization techniques [23] and implementing a search grid to identify the combination of XGBoost parameters that maximized the area under the curve (AUC).
We have employed SHAP (SHapley Additive exPlanations) method to gain insights into influence of individual features on the model's predictions [25]. SHAP values were calculated for each predictor across all patients, and we illustrated the impact of each feature on the model’s log-odds prediction through a beeswarm plot. Features with higher SHAP values contribute more significantly to the model's decision-making process, and are displayed further away from the center regardless of whether they increase or decrease the predicted outcome.
In estimated the 95% confidence interval (CI) for the AUC metric, bootstrapping was employed with 500 resamples per fold across the 5-fold CV, yielding a cumulative total of 2500 bootstraps for each model. The CI was then determined utilizing the standard error derived from the distribution of bootstrapped AUC values. All the analyses were implemented using Python version 3.9.12.