We used the Kaplan-Meier plots with the prognostic values of the combined genes approach to calculate the survival distributions (Fig. 18). All plotted genes have a P value (Log Rank test) lower than 0.05, indicating a statistically significant difference in survival between low and high expression levels. Three genes (SPDEF, BCAS, and TFF3) of the group of five genes (Fig. 18A) and two (TMEM210 and TENM4) of the rest of the group of nine genes (Fig. 18B) have a P value lower than 0.01, so they have a higher prognostic value.
The Oncotype DX Breast Recurrence Score Test (Genomic Health) analyzes, by quantitative TaqMan RT-PCR, the activity of 16 cancer-related genes and five control genes selected from a list of 250 candidate genes from a study of three independent cohorts totaling 447 patients (41). This test predicts the progression of early-stage, hormone-receptor-positive, and HER2-negative breast cancer and its response. Combined with other features of cancer, it can help the patient/doctor better evaluate whether or not to apply chemotherapy (42).
Prosigna® (PAM50) (Breast cancer Gene Signature Assay), a prognostic gene expression assay for breast cancer, simultaneously measures the expression of 50 genes, including additional five housekeeper genes, using RT-qPCR but runs in the proprietary nCounter DX analysis system developed by Nanostring (43). The genes were selected from a microarray and RT-qPCR data study from 189 prototype samples and tested on the prognosis of 761 patients and the prediction of pathological response to treatment in 133 patients (44). This panel identifies the intrinsic subtypes of tumors, particularly the differentiation between the luminal subtypes Luminal A and B, and the Risk of Recurrence (ROR). Wallden et al. (43) tested this platform in 2015 on 514 formalin-fixed, paraffin-embedded (FFPE) breast cancer patient samples after a previous training of the PAM50 algorithm using 304 FFPE patient samples from a well-annotated clinical cohort. They found that patients assigned to a Luminal A subtype would have the best prognosis compared to the other three subtypes. The Prosigna ROR score model was verified to be significantly associated with prognosis (43).
EndoPredict (EP; Myriad Genetics, Cologne, Germany) is a panel of 8 cancer-related and four control genes analyzed by RT-qPCR for FFPE breast cancer tissue samples (45, 46). The commercial test is available as EPclin, which considers clinical variables such as tumor stage and nodal status and generates a yes/no answer to recommend adjuvant chemotherapy and provides the risk of late recurrences (EPclin low-risk < 3.3, EPclin high-risk ≥ 3.3) (47). EPclin was clinically validated in randomized Phase III trials of 1700 post-menopausal BC patients treated with endocrine therapy alone, demonstrating that EP is prognostic for early (years 1–5) and late (years 5–10) distant recurrences in node-negative and node-positive disease (48).
The Breast Cancer Index (BCI, BioTheranostics) selected seven genes and the ratio between HOXB13 and IL17BR to predict late recurrence after treatment (49, 50). Five genes related to cell cycle regulation were chosen from a study of two publicly available microarray data sets implemented in 410 patients whose expression correlates with tumor grade and stage progression. It can assist the treating physician in determining if the patient should extend endocrine therapy in early-stage HR + breast cancer. Ma et al. (2008) (49) did a validation study with two cohorts of 323 patients, using RT-qPCR assay to analyze formalin-fixed paraffin-embedded clinical samples. They found that the combination of the expression analysis (molecular grade index or MGI) and the HOXB13:IL17BR ratio identifies a subgroup (∼30%) of early-stage estrogen receptor-positive breast cancer patients with very poor outcomes despite endocrine therapy. Sgroi et al. (2013) (50) found that a high HOXB13:IL17BR ratio could predict a higher risk for late recurrence in ER-positive patients who were disease-free after five years of tamoxifen when letrozole therapy was not extended, while high HOXB13:IL17BR ratio could predict the decreased probability of late disease recurrence when extended endocrine therapy with letrozole was prescribed.
The expression pattern and ratio have been retrospectively validated using samples from the Arimidex, Tamoxifen, Alone or in Combination (ATAC) Trial, the Adjuvant Tamoxifen-To Offer More? (aTTom) trial, as well as the Investigation of the Duration of Extended Letrozole (IDEAL) trial (51), which is now recommended for use in the 2021 National Comprehensive Cancer Network (NCCN) guideline (52).
MammaPrint (Agendia, The Netherlands) uses FFPE or fresh breast cancer tissue to carry out an RNA microarray assay, evaluating the expression pattern of 70 genes, which was initially validated by a cohort of 295 node-negative and node-positive breast cancer patients (47). Even though it can be used to guide clinical decisions as a prognostic tool in HR-positive, HER2-negative breast cancer with up to 3 positive nodes, it has proven not to be cost-effective and not very useful as a prospective tool to predict the benefits of chemotherapy in any of its risk groups.
To date, Prosigna, Endopredict, and Breast Cancer Index are recommended by several guidelines as prognostic tools in HR-positive, HER2-negative breast cancer with up to 3 positive nodes, including the NCCN.
Sestak et al. (2018) (53) carried out a within-patient comparison of the prognostic value of 6 multigene signatures in 774 post-menopausal women with ER-positive, ERBB2-negative breast cancer patients who received endocrine therapy for five years. They found that the signatures providing the most prognostic information were the PAM50-based Prosigna risk of recurrence, followed by the Breast Cancer Index and EndoPredict-EPclin, showing more information than the Clinical Treatment Score, the Oncotype Dx recurrence score, and the 4-marker immunohistochemical score. All six molecular tests provided substantially less information for the 183 patients with 1 to 3 positive nodes, but the BCI and EPclin provided more additional prognostic information than the other signatures (53).
Buus et al. (2021) (54) conducted another study comparing the Oncotype DX Recurrence Score, PAM50-based Prosigna risk of recurrence, EndoPredict, and Breast Cancer Index to estimate the risk of distant recurrence for patients receiving endocrine therapy. They studied 785 women with ER-positive/ HER2-negative breast cancer without chemotherapy treatment. They found that there were moderate to strong correlations among the four molecular scores (ρ = 0.63–0.74) except for the Oncotype DX Recurrence Score versus both Prosigna risk of recurrence (ρ = 0.32) and Breast Cancer Index (ρ = 0.35). Most of EndoPredict’s and Breast Cancer Index’s variation accounted for by the proliferation module (50.0% and 54.3%, respectively) and much less by the estrogen module (20.2% and 2.7%, respectively) (54).
Fewer studies focus on TNBC, presumably because the indication for adjuvant chemotherapy is less uncertain in most cases. Still, several gene expression studies have helped increase the understanding of this heterogeneous disease, which is not characterized by common features but rather by the absence of the three markers (Qian et al. 2021). For example, Lehmann et al. (2016) (55) have categorized TNBC into four subclasses: basal-like 1, 2, luminal androgen receptor, and mesenchymal subtypes.
When analyzing the genes in the Oncotype DX, PAM50, EndoPredict, Breast Cancer Index, and the top nine genes of the combined approach in this study (74 genes in total), only 13 were common among the different panels (Fig. 19). Also, none of the mentioned genes in this study were included in the mentioned commercial panels (Fig. 19). It shows that the diagnosis and prognosis of TNBC are difficult due to the lack of consistent gene expression panels. However, our analysis has proven to provide a set of genes that could have great potential as biomarkers. All the genes from the four commercial panel are described in supplementary information (List of the genes of the four commercial panels).
All gene selection approaches showed similar classification performance in the factorial designs. As a factor, the gene selection approach presented statistically significant differences between its levels in the initial analysis only. However, transmembrane and surfaceome genes that codify proteins are more likely to be present in cellular locations, which are accessible by capture antibodies or labeled molecules, making the application of diagnostic techniques and devices easier. Therefore, the genes selected by the transmembrane and surfaceome approaches could be useful for the development of cheaper diagnostic devices, which are required for vulnerable populations and developing countries. On the other hand, the combination of the best genes from the three approaches showed higher performance than each approach alone.
The MLAs showed statistically significant differences compared to the control algorithm in all the factorial designs, indicating that MLAs could use patterns in the transcriptomic data to increase the classification performance. Most of the MLAs obtained an acceptable accuracy (> 70%) except in the analysis on patient samples datasets, in which the algorithms were trained on cell lines and tested on patient samples, which is expected considering the significant differences between these types of samples. Some MLAs worked better in specific conditions and datasets, but no clear pattern showed that one MLA was significantly better than the others. On the other hand, the classification strategy showed non-significant differences between its levels in the initial analysis.
The number of genes presented significant differences between its levels in two of the five factorial designs (validation on independent cell lines datasets and analysis on patient samples datasets) where it was analyzed. However, in most cases, even the smallest number of genes had an acceptable accuracy. It was very relevant because it allowed reducing the number of genes without compromising the classification performance. As a result, the panel could be reduced to the top five genes in the combined approach.
On the other hand, the analysis of the type of quantification demonstrated that the absolute expression counts had a similar performance to FPKM. However, absolute counts are not always comparable between samples or datasets because the sequencing depths or library sizes can generate biases in the sequenced data (56), so it is better to use relative expression counts such as FPKM to eliminate possible biases. The results presented in this study suggest that the classification performance of the MLAs was not affected by the mentioned biases. In addition, the advantage of absolute counts is that they require fewer steps in the analysis, so the implementation in diagnostic procedures could be easier.
Although the performance of the combined genes approach is slightly inferior in comparison to the network analysis approach, the first one is preferable in a scenario with limited resources because this approach requires much less computational power.
After searching the Web of Science and Scopus databases for research articles about sensors or biosensors using any of the nine genes of the combined approach, CD44 and TFF3 were the genes that have been reported. CD44 has been applied in cancer detection and immune analysis (57–59), while TFF3 has been applied in an inflammatory response, kidney disease, and cancer (60–62). These results indicate that there is potential for the development of biosensors using the methodologies in this study; however, the majority of the combined approach genes have not been studied with a biosensor perspective. For this reason, experimental studies are required to validate the utility of the rest of the genes for biosensor implementation.
African, Hispanic, and Asian populations are underrepresented in the datasets analyzed in this study. The data availability and technical capabilities limited the number of analyzed patient samples, so testing in larger patient datasets is recommended. All the analyses were performed using public data, so the quality and reliability of the data depended on the policies implemented by the public repositories with transcriptomic data (GEO and TCGA). No experimental validations were performed in the present study, since the main goal here was to provide an efficient Machine Learning analytical method that can help identify new biomarkers.
Experimental and clinical validations are required to confirm that the selected biomarkers and the applied machine learning methods are adequate for early detection and diagnosis of breast cancer in the medical practice. A study focused on detecting and quantifying the selected biomarkers on breast cancer patient samples using RT-PCR, and proteomic techniques are recommended. A diagnostic clinical trial using the selected biomarkers can be performed if the results of the previous study are consistent. Moreover, the development of low-cost biosensors or diagnostic devices can be considered in which early breast cancer diagnostic accuracy is not compromised to reach the wide range of populations of different genetic backgrounds and for different socioeconomic groups.