This study demonstrates that ML can be used to develop highly accurate, sensitive, and specific models to predict RT-benefit in early-stage BC patients. We present a high-performance SVM model (93.41% accuracy, 91.09% sensitivity, and 78.95% specificity) that can predict RT-benefit in early-stage BC patients independent of subtype. Here, RT-benefit was defined as relapse-free status following surgery and RT. The accuracy of this model (93.41%) represents an improvement from that of the best previously reported model (80%) in predicting RT benefit (18). This model used an SVM algorithm and expression values of a set of 977 genes referred to as the Wilcoxon-Cox set to predict RT-benefit. This study also presents a novel genomic feature selection approach that reduced the number of genes from genome-wide gene expression data by 96% using Wilcoxon RS followed by Cox PH. This feature selection method contrasts previous studies that selected genes with known functions from the literature or in vitro experiments to build RT-benefit models (18–23). To our knowledge, this is the first study to apply ML algorithms to predicting RT-benefit with consideration of all genes in the human genome.
The preliminary challenge in model development was finding a publicly available dataset with complete clinical and gene expression data. The dataset also needed to be balanced in the outcome variable as unbalanced datasets can lead to a misleadingly high prediction accuracy. The dataset also needed to be sufficiently large as small datasets can lead to model overfitting (47) and lack of precision (48) for some ML algorithms. The METABRIC dataset was chosen because of its large cohort size (2,509 patients), balance in outcome (approx. 40% of patients had a recurrence), extensive longitudinal follow-up data (approx. 30 years) and availability of genome-wide gene expression data (24,368 genes).
There was significant variability in the follow-up time for patients (range 0.21 – 29.25 years). This meant that some patients were right censored, i.e. the patient’s follow-up ended before relapse occurred. While Cox PH models account for right censorship, classification ML algorithms do not. Two strategies were used to address this issue of right censorship in the data: 1) patients who were lost to follow-up were assumed to have had no relapse in dataset one; 2) the cohort was limited to patients who were followed for at least 15 years, or had a recurrence or death within 15 years, whichever came first in dataset two. The assumption in the first method allowed all the data for the 463 patients to be retained for training. However, it had a disadvantage in that it treated patients with an unknown relapse status as no relapse, which may and may not be true for each patient. The second method made no assumption about the patient’s relapse status as it limited the cohort to only patients who had complete follow-up. The disadvantage of this method is that it led to loss of 46% of the data which would have introduced some degree of exclusion bias into the dataset. The majority of analysis here used dataset one, due to its larger sample size. Dataset two was used for comparison in order to determine whether or not the ML methodology used would apply across datasets with different limitations.
Controls on BC type, stage, chemotherapy, surgery and RT status were implemented to define the clinical population which was early-stage BC patients who were treated with BCS and RT. Previous work modelling RT-benefit also controlled for BC subtype by building separate models for ER+ and ER- patients (30). The rationale was that ER+ and ER- tumours are distinct in their gene expression profiles which are associated with the differences in outcomes observed by subtype (30). This study did not control for BC subtype for two reasons: first, such controls would create a more homogenous patient group eliminating key differences in expression profiles that the ML algorithm can utilize to make a classification; and second, a model that works irrespective of BC subtype would be easier to implement rather than two distinct models. For these same reasons, the cohort was not limited to patients who had hormone therapy (HT). HT is recommended for BC patients who are HR+ therefore, limiting to patients who either did or did not have HT would also limit the dataset to patients who were either HR+ or HR- subtype. The SVM model with the Wilcoxon-Cox set of 977 genes was shown to have high accuracy in ER+ and ER- patients demonstrating that the model is independent of subtype.
A key challenge in this study was to reduce the number of genes as the use of the entire set of 24,368 genes as the use of too many features can result in an overfitted model for some ML algorithms (49). To achieve this, a novel filter feature selection approach was developed that used Wilcoxon RS followed by Cox PH which reduced the number of genes by 96% to the Wilcoxon-Cox set of 977 genes. Wilcoxon RS was previously used to determine differentially expressed genes (DEGs) in BC datasets (29, 30). The application of Wilcoxon RS followed by Cox PH for genomic feature selection has not been reported. This novel approach reduced the dataset substantially by selecting a set of DEGs that also affected recurrence risk. This approach was also better than selecting known genes of biological relevance for training as when 64 radiogenes were used for training, the resulting SVM model had poor accuracy of 54.61%. Therefore, considering all genes, in a hypothesis-independent manner appears to be a better approach than selecting known genes for training.
A clear relationship between model accuracy and the number of genes selected using Wilcoxon RS and Cox PH was observed. When smaller gene sets of the top 1000, 500, 100 and 50 gene with the lowest p values were used for training, there was an overall decline in accuracy across all eight ML algorithms. Therefore, a larger number of genes was needed for higher accuracy. There was also a relationship between the significance threshold of 0.05 for gene selection and model accuracy. When gene sets with a p value greater than 0.05 were selected there was also a decline in accuracy for both the SVM and NN algorithms. The lowest performance (~55%) was seen when insignificant genes with a p value greater than 0.05 were randomly selected. These results demonstrate the importance of considering the significance threshold in genomic feature selection using Wilcoxon RS and Cox PH.
The top four models presented use SVM or NN with either the Wilcoxon set of 1,596 genes or the Wilcoxon-Cox set of 977 genes. Given that SVM and NN are the most consistently used algorithms in BC prediction research, this result further corroborates the utility and consistent performance of these models (25). The consistently high accuracy of SVM suggests that the genomic features selected by Wilcoxon RS and Cox PH are sufficiently separate in high dimensional space to determine an optimal hyperplane with a large margin. It also suggests that this feature selection approach was able to reduce noise in the feature space and overlap between classes. SVM with a radial or polynomial kernel function was also investigated, however this did not improve accuracy (data not shown), therefore a linear hyperplane was sufficient for this problem.
The lower performance of the majority of ML algorithms chosen (RF, DT, XGBoost, KNN, NB and LR) may be attributed to their underlying assumptions or their inability to model complex relationships. For example, NB and LR assume independence among predictors. This assumption would not hold with gene expression data where the expression pattern of one gene is often directly or indirectly dependent on the expression of another. LR is also generally not able to model complex relationships and is traditionally used to model a linearly separable classification problem. KNN is known to underperform with high dimensional data where all the vectors are almost equidistant making it difficult to determine clusters using distance metrics. DTs are also known to underperform as single trees are unstable and tend to overfit the data.
Addition of the subtype variable to the Wilcoxon-Cox set did not improve accuracy of the SVM model and decreased accuracy of the NN model. For the SVM model, the subtype variable was not a support vector and therefore did not influence the position of the linear hyperplane separating those who did have a recurrence from those who did not. In summary, subtype was an unnecessary feature for the models presented.
It is significant that the ML models demonstrated better prediction accuracies for RT-treated patients compared to untreated-patients. The top four models (SVM977, SVM1596, NN977, NN1596) all performed poorly when applied to RT-untreated patients, with prediction accuracies of 50–60%. Notably, patients in the RT-untreated cohort had larger tumours, were more likely to have a mastectomy, and to have no lymph nodes examined as positive. Therefore, biological differences between the tumours of patients in the RT-treated and untreated cohorts likely resulted in differences in gene expression profiles between the cohorts, which subsequently impacted the SVM model performance. A similar trend of poor accuracy (64.02%) was also observed when the SVM977 model was tested on data for chemotherapy-treated patients. Taken together, these results are promising in supporting the validity of the SVM977 model in predicting relapse in early-stage, surgery and RT-treated, chemotherapy-untreated BC patients. Future work would involve further controlling for treatment factors such the type of surgery, and control for the extent disease progression by selecting patients with no lymph node metastasis in the training cohort.
Comparison of the four models with the highest accuracy (SVM977, SVM1596, NN977 and NN1596) revealed small differences in AUROC values (1-2%), and even smaller differences in computational time (<1 second) that would not be noticeable to the end-user. Therefore, a model was not chosen based on these characteristics. Sensitivity or the number of true positives was more important than the specificity or the proportion of true negatives. That is, it is more important to correctly predict recurrence in RT-treated patients as they can be given the opportunity for RT-intensification or sensitization as a clinical intervention to reduce the risk of recurrence. A RT boost has been shown to significantly reduce the risk of LRR but with an increased risk of moderate to severe fibrosis (50). Patients who are correctly identified as having no recurrence (specificity) can continue with standard of care or have RT omission. Careful consideration of false positives is needed as these patients would be overtreated. Thus, the RT treatment course would require a risk-benefit discussion between the treating radiation oncologist and the patient. In summary, SVM977 is the best model because it had the highest sensitivity among all models.
Characterization of the Wilcoxon-Cox 977 gene set using GSEA revealed that many of these genes are involved in cell cycle and division and operate in the nucleoplasm. This was expected as it is well known that uncontrolled cell division is a hallmark of cancer (51). Further, previous work in BC cell lines found that the expression levels 51 genes that were correlated with radiosensitivity were enriched for genes involved in cell cycle arrest (24). This is also consistent with research that has shown that RT-resistance mechanisms are involved in repopulation and redistribution of cells to more radioresistant G1 and S phases of the cell cycle (10, 11). The 977 gene set was also enriched (6.2 times) with radiogenes which further demonstrated that the feature selection approach was able to select for known genes of biological relevance. These results suggest that it is likely the compounded effect of several hundred genes in highly interconnected networks involved in cell division and redistribution of cells in the cell cycle, that drives recurrence after RT.
When dataset two was used to develop a model to predict RT-benefit, the SVM model had approximately 7-9% lower accuracy, 20% lower sensitivity, but 15-18% higher specificity than when dataset one was used for training. This change in performance is likely due to the smaller training dataset used (limited to patients who had complete 15-year follow-up), also reflected in the wider confidence interval. However, the overall performance profile of this model was good, demonstrating that the methodology used for ML model development was valid using both datasets. Wilcoxon RS followed by Cox PH selected for a set of 1,044 genes in dataset two, of which 316 genes overlapped with the 977 gene set. Therefore, the genes selected for training using the proposed feature selection methodology is not fixed and depends on the patients in the cohort used. Given the genomic heterogeneity that has been shown to occur between and within BC subtypes (31, 52), and between different ancestral populations (53), it would be expected that the gene sets selected would vary with the cohort used.
This study had some limitations. First, the outcome used was relapse-free status. A more direct outcome for measuring RT-benefit would be ipsilateral LRR which was unavailable in the METABRIC dataset. Therefore, this study could not differentiate those patients who had recurrence of the same primary BC versus those who had a new primary. No information on the status of resection margins was available which is a known factor affecting recurrence risk. Further, information on RT-fields and dosages were absent to determine if the RT given was a commonly used dosage. This study also could not limit the cohort to patients who had a lumpectomy as the majority of patients had a mastectomy (~80%) while few had a lumpectomy (~20%). This is likely because the METABRIC cohort consists of patients who were diagnosed between 1977 and 2005 and since then there has been a shift toward breast conservation for early-stage BC patients (4). This study was also unable to test the SVM977 model in another BC cohort as the BC datasets available for public use were not adequately clinically annotated or sufficiently large for ML training. Further, inconsistent gene naming conventions resulted in an inability to select the Wilcoxon-Cox set of 977 genes in other dataset. Future work would involve the application of the methodology used here to another BC cohort, preferably in the setting of a prospective randomized controlled trial as the gold standard (54).