Identification of survival-related clinical features in BRCA
We initially performed survival analysis between clinical features and OS and revealed higher patient’s age, more positive lymph nodes, higher cancer stage, clinical T stage, clinical N stage, clinical M stage, post-menopause were high risk prognosticators for OS in the TCGA cohort (P < 0.05 for all cases, Table1). Furthermore, the inverse correlations between overall survival and patient’s age, more positive lymph nodes, higher cancer stage, post-menopause, tumor size, radiotherapy were independently validated in the METABRIC cohort (P < 0.05 for all cases, supplementary Table1).
Identification and validation of survival-related genes in BRCA
We first examined the relation between gene expression and OS in the TCGA data set. The results showed that high expression levels of 1374 genes were related to significantly prolonged OS. While, high expression levels of 678 genes were related to significantly reduced survival in the TCGA cohort (P <0.05 for all cases, log rank test, Figure1). Multivariate Cox regression analysis confirmed 432 protective prognostic genes and 219 risk prognostic genes following the adjustment of clinical characteristics. Furthermore, the association between 651 gene expression and OS was analyzed in the METABRIC dataset (n=1904). The results validated 80 protective genes and 34 risk genes in the METABRIC cohort respectively (P <0.05 for all cases, log rank test, Figure1). Then, we analyzed the functional involvement of the protective and risk genes with g.profiler and uncovered the 80 protective genes were significantly enriched in the KEGG pathway of focal adhesion. While, the risk genes were significantly over-represented in GO terms, such as nuclear division, organelle fission, DNA metabolic process and nucleic acid metabolic process (adjusted P value < 0.05 for all cases, supplementary Figure1).
Construction of a 16-gene signature and its prognostic value in BRCA
The LASSO model comprising 16 genes showed the highest AUC value and was deemed the best model for survival prediction (Figure2A). Then we established the 16-gene score formula and computed the risk score for each BRCA patient. The Kaplan-Meier survival analysis and multivariate Cox regression analysis indicated that the high 16-gene score was indicative of worse OS in BRCA (P<0.05 for all cases, OR: 3.47, 95% confidence interval: 2.08-5.78, supplementary Figure2A). We also analyzed the association between the 16-gene score and DFS and PFS in the TCGA cohort. Similarly, we demonstrated that the high 16-gene score was significantly associated with shorter DFS and PFS (P value < 0.05 for all cases, supplementary Figure3). For further verification, the 16-gene score was calculated in the METABRIC dataset. The results also confirmed the negative correlation between the 16-gene score and patient's OS (Figure2D, supplementary Figure2B). Furthermore, the 16-gene score (AUC = 0.72, 0.71, 0.73, respectively) outperformed cancer stage (AUC = 0.71, 0.69, 0.66, respectively, supplementary Figure4) in predicting 1-year survival, 3-year survival and 5-year survival in the TCGA cohort. The results were also validated in the METABRIC cohort (supplementary Figure4) and suggested the 16-gene score is superior to cancer stage in the prediction of prognosis of BRCA patients.
Correlations between the 16-gene score and clinical factors in BRCA
The linear regression model analysis showed the 16-gene score was significantly positively associated with age, HER2 status, menopause status, clinical stage, clinical T stage, clinical M stage and negatively correlated with PR status, ER status, hormone therapy and radiotherapy in the TCGA cohort (p<0.05 for all cases, Figure3A). Moreover, the 16-gene score also exhibited positive correlation with age, HER2 status, menopause status, clinical stage and negative correlation with PR status, ER status, hormone therapy and radiotherapy in the METABRIC cohort (p< 0.05 for all cases, Figure3B). Next, we split BRCA patients into subgroups according to the clinical characteristics and conducted the Kaplan-Meier survival analysis to assess the prognostic value of the 16-gene score in clinical factor-specific subgroups. Overall, the results demonstrated that the high-risk was significantly correlated with worse OS in the same clinical subgroup of the TCGA cohort (P<0.05 for all cases, log rank test, supplementary Table2). Similar findings were also observed in the METABRIC cohort (supplementary Table3), suggesting that the implication of 16-gene score with OS is independent of clinicopathological characteristics.
Identification of signalling pathways associated with the 16-gene score
We performed the GSEA analysis to understand the biological functions related to the 16-gene score. The results exhibited thirteen signalling pathways were significantly over-represented in the high 16-gene score group of the TCGA cohort. Cell cycle, RNA degradation, oocyte meiosis, progesterone mediated oocyte maturation and DNA replication were the top five most enriched pathways (Figure4, q value <0.25 for all cases, supplementary Table4). While, up-regulation of arachidonic acid metabolism pathway genes were significantly associated with the low 16-gene score in the TCGA cohort (Figure4, q value <0.25, supplementary Table5). These results suggest that the aforementioned pathways probably are implicated in the association between 16-gene score and OS in BRCA.
Nomogram combined 16-gene signature and clinical‐related variables predicts patients’ OS
In the TCGA and METABRIC cohorts, patient’s age, tumor stage, menopause status, number of positive lymph nodes and 16-gene signature were significantly associated with OS. Then based on the above analysis results, we established a 16-gene nomogram that incorporated the survival‐related clinical factors and 16-gene signature (Figure 5A). The nomogram predicted well the 1-year, 3-year and 5‐year survival for BRCA patients in the TCGA cohort, ROC plot revealed the 16-gene nomogram showed improved prediction accuracies for 1-year, 3-year and 5‐year survival as compared to the 16-gene score alone (AUC: 0.91, 0.79 and 0.77 respectively, Figure5B). The improved prognosis prediction was also validated in the METABRIC cohort (AUC: 0.83, 0.77 and 0.76 respectively, Figure5C), demonstrating the clinical value and validity of the 16-gene nomogram for OS evaluation of BRCA patients.
Assessment of diagnostic value
We utilized the online server cBioPortal to investigate the genomics variants of 16 genes from the TCGA datasets. The results showed that DERL1, TNN, PXDNL, PCSK6 and KLRB1were the top five most frequently mutated genes, with mutation frequencies of 19%,10%, 9% 4%, 3% respectively in BRCA (supplementary Figure5). Similar mutation distribution was observed in the METABRIC cohort (supplementary Figure6). By comparing expression levels of 16 genes between 779 BRCA samples and 100 paired normal breast tissues, 7 genes expression, such as C7orf63, C9orf103, IGJ, ZNF385B and TNN, was significantly lower in tumor tissues as compared with those in normal tissues. In contrast, 9 genes, such as PXDNL, PCSK6, MORN3 and DERL1, were significantly higher expressed in BRCA tissues (adjusted P< 0.05 for all cases, student t test, Figure6A). ROC curves analysis further showed MORN3, IGJ, DERL1 particularly were able to differentiate BRCA tissues from normal breast tissues with high accuracy (Figure6B, adjusted P values <0.05, AUC>0.80 for all cases).