Clinicopathological features of breast cancer patients
1076 breast cancer patients with clinical information were collected from the TCGA database. By using a random number table, samples were divided into training (n = 514) and internal validation (n = 562) cohort. The detailed demographic and clinicopathological characteristics of the patients involved in the training, internal and entire datasets were shown in Table 1. The percentages of patients at clinical stages I, II, III and IV in the training cohort were 16.7%, 55.3%, 23.3% and 1.6%, respectively. The data was 16.4%, 58.2%, 22.6% and 2.1% in the internal validation cohort. Among them, 354 patients in the training cohort and 389 patients in the internal validation cohort had prognosis information. The follow-up periods encompassed the different pathological stages of breast cancer. The median follow-up periods were 592 days (range, 2-6434 days) and 631 days (range, 1-7125 days), respectively, in the training and internal validation sets. In the training cohort, the median age was 58 years (range, 27-90 years). And, in the internal validation cohort, the median age was 58 years (range, 26-90 years). During the follow-up, in training cohort, 48/354 (13.6%) patients died, whereas in internal validation cohort, 45/389 (11.6%) patients died.
Identification of DEGs
The exploration process of this study is shown in Figure 1. Firstly, 60483 genes were initially screened in the training cohort. Thresholds were set as fold change>3 and FDR<0.05 to investigate the differentially expressed genes between normal group and tumor group (Figure 2A). Volcano plots were used to show the differentially expressed genes (Figure 2B). 4805 genes had differential expressions between normal and tumor tissues, which consisted of 1269 upregulated genes and 3536 downregulated genes. 2805 DEGs were with protein coding functions.
Enrichment analyses of DEGs
To further understand the function of DEGs, the differentially expressed mRNAs were incorporated into functional annotation analyses. The molecular processes during the progression of breast cancer were investigated through GO enrichment analyses and KEGG pathway analyses.
The upregulated mRNAs associated with molecular function were enriched in the modulation of structural constituent of chromatin, chemokine activity and metallopeptidase activity (Figure 3A). In terms of cellular component, the upregulated genes were tightly corresponding to chromosome passenger complex, Ndc80 complex and condensed chromosome kinetochore (Figure 3B). Additionally, in the analysis on biological process, spindle assembly, chromosome segregation and negative regulation of enzyme activity were the most enriched terms mediated by upregulated genes (Figure 3C). From the prospective of biological pathways, the high expression genes were closely related to aurora B signaling, mitotic prometaphase and PLK1 signaling events (Figure 3D)
Meanwhile, downregulated genes related to molecular function were enriched in lipase activity, serotonin degradation and chemokine activity (Figure 4A). Through the investigation on cellular component, the most enriched terms were voltage-gated sodium channel complex, lipid particle and keratin filament (Figure 4B). In the exploration of biological process, downregulation genes in breast cancer were enriched in regulation of membrane potential, lipid storage and regulation of transport (Figure 4C). Besides, analysis on biological pathways proved that downregulated genes were associated with noradrenaline and adrenaline degradation, serotonin degradation and HSL-mediated triacylglycerol hydrolysis (Figure 4D).
Construction and validation of risk prognostic scoring system in the training set.
2805 protein coding genes were further selected using LASSO regression analysis, and cross validation was used to select the penalty parameters (Figure 5A). Five genes were identified for the construction of the prognostic model by a multivariate Cox regression analysis (Figure. 5B). The genes obtained in the above steps were inserted into the multivariate Cox regression model. The expression values of five independent prognostic factors and their correlation coefficients in a multivariate regression model were then used to construct prognostic signatures. Detailed information and the significance of survival prediction by the five genes are presented in Table 2.
Risk score= (expression status of EDN2× 0.014) C (expression status of CLEC3B× -0.196) + (expression status of SV2C× 0.227) + (expression status of WT1× 0.075) + (expression status of MUC2× 0.113).
Of five biomarkers, three genes (EDN2, WT1 and MUC2) were upregulated in breast cancer samples while CLEC3B and SV2C were decreased (Figure 5C-G). Moreover, the protein expression of the five genes were further explored in the Human Protein Atlas (HPA) and the representative pictures of them were shown in Figure 5H-I.
In the training cohort, the distributions of risk score of breast cancer patients and the relationships between risk score and survival time were visualized in Figure 6A and 6B. The patients’ mRNA expression levels of selected genes were shown in Figure 6C. Patients in the training cohort were then assigned to a high- or low-risk score group using the cut-off value (0.09) obtained with the “survival” and “survminer” package. 198 (56%) patients in the training cohort were categorized to the high-risk group (RS > 0.09) and 156 (44%) to the low-risk group (RS ≤ 0.09). High-risk patients also had markedly shorter OS (HR 1.88, 95% CI 1.07–3.31, p < 0.05) vs. low-risk patients in the training cohort (Figure 6D).
Validation of the five genes-model in the internal testing group and entire group.
To assess the stability and reliability of the five genes signature, the result was also tested in the internal validation cohort and entire cohort. In the internal validation cohort, according to the same median risk score that acquired from the training group, 389 breast cancer patients with follow-up information were divided into high- and low-risk group. Figure 7A and 7B showed the distributions of risk score of breast cancer patients and the relationships between risk score and survival time. Expressions of the five genes in the risk score formula in the testing group were provided in the Figure 7C. As for the survival analysis, using the “survival” and “survminer” package, the optimal cut-off value (0.14) was obtained to divide the patients in the internal validation cohort into high-risk group (160) and low-risk group (229). The relationship between the distribution of risk score and clinical information indicated that the higher patients ranking, predicted poorer overall survival (HR 1.78, 95% CI 0.96-3.31, p < 0.05) (Figure 7D).
In Figure 8A and 8B, patients in the entire group (n = 743) were divided into the high‐risk group and the low‐risk group in the same way by the median risk score. The risk score distributions and the survival status were exhibited. The expressions of five selected genes in 743 patients were also shown in Figure 8C. Calculated by the “survival” and “survminer” package, the optimal cut-off value was 0.09. Thus, patients in the entire cohort were then categorized to high-risk group (n=407) and low-risk group (n=336). The results of survival analysis were showed in the Kaplan-Meier plot (Figure 8D). With the extension of survival time, the survival rate of the high-risk group became lower, and had a poor prognosis effect (HR 1.72, 95% CI 1.15-2.59, p < 0.01).
Gain-of-function assay of selected genes
To evaluated the influence of these genes on breast cancer progression, MDA-MB-231 and MDA-MB-468 cell lines were used to assess the biological roles of the selected genes. EDN2, CLEC3B, SV2C and WT1 were overexpressed by transfection to conduct gain-of-function assay. The overexpression efficiency of selected genes was verified by qPCR (Figure 9A, B). We tested cell proliferative viability using the MTT assay in MDA-MB-231 and MDA-MB-468 cells. As shown in Figure 9C and 9D, overexpression of CLEC3B and WT1 could significantly promote the growth in both cell lines, while EDN2 and SV2C had no obvious influence on cell proliferation. The results were then further validated by the clone formation assay (Figure 9E, F). Overexpression of CLEC3B and WT1 could dramatically promote the formation of colonies in breast cancer cells.
Furthermore, cell migration assays were used to evaluate the regulation effects of selected genes on cell migration. As evidenced by scratch assay and transwell assay, the mobility of MDA-MB-231 and MDA-MB-468 cells overexpressed CLEC3B and WT1 were considerably increased compared with control group (Figure 9G-J). On the contrary, transfected with SV2C could significantly inhibited cell migration in both breast cancer cell lines, while the function of EDN2 was slight.
Influence of selected genes on epithelial-mesenchymal transition (EMT) signaling pathway in breast cancer
The EMT represented a biological process during which polarized epithelial cells lost cell identity and experienced various biochemical alterations that allowed it to assume mesenchymal phenotypes [26]. Normally observed during embryonic development, EMT could also be involved in various pathological conditions. Once hijacked by cancer cells, EMT often led to enhanced migration capability, acquisition of resistance to apoptosis, and increased cell proliferation [26-29]. Thus, we examined the role of selected genes in EMT signaling pathway in breast cancer cells. As shown in Figure. 9, gain-of-function of EDN2, CLEC3B and WT1 markedly increased the ZEB1 and β-catenin in MDA-MB-231. Besides, CLEC3B and WT1 could also enhance the expression of snail (Figure 10A). By contrast, SV2C seemed to play a key role as tumor suppressor in EMT signaling pathway. MDA-MB-231 cells transfected with SV2C showed low expression of EMT markers, such as ZEB1, vimentin, β-catenin and snail. In MDA-MB-468, EDN2, CLEC3B and WT1 were proved to be able to upregulate the protein level of β-catenin (Figure 10B). CLEC3B and WT1 transfection led to higher expression level of N-Cadherin. Moreover, a markedly increase of snail was also observed in the WT1 overexpressed MDA-MB-468 cells.