The protocol of this study has been peer reviewed and published[10]. Briefly, three machine learning (Survival Tree[11], Random survival forest[12] and Survival support vector machine[13]) and one traditional regression (Cox regression[14]) models of time-to-event (survival time) were generated. This study is reported using the methodology of Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD)[15].
Study cohort
The data source was the Australia and New Zealand Dialysis and Transplant Registry (ANZDATA)[16]. It collects and reports the prevalence, incidence and outcomes of dialysis and kidney transplanted patients across Australia. The dataset contained donor and recipient characteristics of 7,365 deceased donor transplants from January 1st, 2007 to December 31st, 2017 conducted in Australia.
Outcome
The primary outcome was time to graft failure starting from the transplantation date. Patients who died with a functioning graft were included and were right censored at their death date. Patients with a functioning graft at the end of the study period were right censored on December 31st, 2017. Sixty-five patients (0.9%) were lost to follow-up and were right censored at their last known follow-up date.
Independent variables
Our aim was to develop a risk index for use in pre-transplant decision making, hence we used only the variables available before the transplantation and variables reported in ANZDATA across all patient groups. In total, 67 possible independent variables, both recipient and donor characteristics, were identified[17].
Model development
Model development was a sequential process with the following five steps: data preparation, splitting the data set into training and validation datasets, variable selection, model training, and model evaluation (Figure 1).
Step 1: Data preparation
Prior to model development, data were processed by: treating missing values, dummy coding categorical variables and scaling the continuous variables. The dataset had nearly 500,000 data points (7,365 patients×67 independent variables) and 2.5% of the data points were missing. Most of the variables (64%) had less than 1% missing. For variables with missing values, multiple imputations were used for 14 categorical variables and 17 continuous variables with random hot deck and Classification and Regression Trees (CART) using the R package ‘simputation’[18] with the full dataset of 7,365 patients. Based on expert opinion, missing values of 13 categorical variables were assigned a separate category of “missing” to avoid the data being lost.
Numerical independent variables were normalized using min-max scaling, converting them to a similar scale to simplify the comparisons between variables[19]. Categorical variables were dummy coded into nominal categories. After dummy coding the total number of independent variables was 98.
Step 2: Training and validation data
The dataset was randomly divided into two parts: a training dataset and a validation dataset. The training set, used to train the four predictive models, contained 70% of the data (n=5,156). The validation set (n=2,209) was used to robustly test the predictive power of each model. Having a separate validation set provided more realistic estimates of the models’ prediction accuracy and helped avoid over-fitting.
Step 3: Variable selection
An important step in the model building process is the selection of a parsimonious set of predictor variables from the large set of available independent variables (n=98). Too many independent variables in the model risks over-fitting, in turn reducing the predictive power[20].
Three methods were used to select the independent variables:
1. Expert opinion: Three experienced nephrologists reviewed the potential set of independent variables and indicated whether the variable had clinical significance. An agreement of at least two experts was considered adequate to include a variable into the model.
2. Principal component analysis[21] reduces the dimensionality of a dataset by transforming it to a smaller number of principal components based on the correlations between variables. This set of components ideally retains the majority of the variance and so does not lose information but does so using fewer variables. We used the number of principal components that retained 90% of the original variance.
3. Elastic net trades-off model fit and complexity to find a parsimonious model. It examines a range of models using penalties to avoid over-fitting that range from no penalty (Ridge regression – L2) to an extreme penalty (Lasso regression – L1) to find the ideal penalties trade-off point using cross-validation[22]. The L1 and L2 values which produced the lowest mean squared error during cross validation were used to fit the elastic net model.
These individual variable selection methods were applied alone and also in all possible combinations, e.g., expert opinion followed by elastic net. Therefore, a total of seven variable selection methods were used to generate seven different sets of independent variables.
Step 4: Model training
We used four approaches to model time-to-the primary event ie survival outcome.
Cox proportional regression[14]. This semi-parametric model is widely used to explore the relationship between outcomes like survival data and independent variables. After modelling the selected independent variables, the number of variables were further reduced by including only those that were statistically significant (p<0.05). This made the model more parsimonious and also improved predictive power.
Survival Tree[11]. A survival tree is a tree-like structure, where leaves represent outcome variables, i.e. graft failure (1) or no graft failure (0), and branches are independent variables that influence the timing of the outcome. The complexity parameter was set to 0.00001 and the following two hyper-parameters were regularized until the optimal tree was created: the minimum number of samples that must exist in a node in order for a split to be attempted, and the number of competitor splits retained in the output.
Random survival forest (RSF) [12]. RSF is an ensemble method where numerous unpruned survival trees are developed via bootstrap aggregation[23, 24]. The ‘variable importance’ was set to “permutation” and the splitting rule to “log-rank”. The hyper-parameters, number of variables to possibly split at each node, number of trees and minimum number of nodes were regularised to achieve the lowest out-of-bag prediction error. ‘Variable importance’, a variable selection algorithm widely used in RSF, was used to avoid overfitting and to reduce the prediction error[25].
Survival support vector machine[13]. This uses hyperplanes to create classes of independent variables either with linearly (e.g. linear kernel function) or non-linearly separable data (e.g. polynomial kernel)[26, 27]. Based on the model’s performance, all support vector machine models were fitted using a linear kernel function with a ‘regression’ type survival support vector machine model.
The seven sets of independent variables were used to train and validate the four predictive models giving 28 results: seven variable selection methods × four predictive models. The predicted outcome for each of the four models was an index on an interval scale, which we label the Kidney Transplant Risk Index.
Step 5: Evaluating the models
We evaluated the models using methods proposed by Royston and Altman[28]. Model performance was evaluated using two metrics: discrimination and calibration. An index with good discrimination should have higher risk scores for higher risk patients and vice versa. Calibration measures the prediction accuracy as it compares the accuracy of the predicted survival from the index with the survival in the observed data[29]. For our study objective discrimination is more important than calibration, as our aim is to provide a guide to decision making that identifies relatively high and low risk patients[28]. Therefore the best model was chosen using the concordance index (C-index)[30], an index which evaluates the discriminative ability of a model. The C-index is defined as the fraction of pairs of patients where the patient who has a longer survival time also has a lower risk predicted score. The concordance range is between zero and one, with a higher value indicating better performance and 0.5 indicating discrimination by chance.