DEGs identification between normal tissue and tumor tissue
By comparing the expression levels of 113 normal tissues and 1103 breast cancer tissues from TCGA data with 33 pyroptosis-related genes, 19 differentially expressed genes (P < 0.05) were identified. Among them, nine genes are down-regulated and the other ten genes are up-regulated. The heatmap in Fig. 1A shows the RNA expression levels of these genes. Through protein–protein interaction network (PPI) analysis, the interaction of these pyroptosis-related genes is further explored and the results are presented in Fig. 1B. We found that CASP1, PYCARD, NLRP3, NLRC4, CASP5, NLRP1, AIM2, IL1B, and IL18 are hub genes (Fig. 1C). Except for NLRC4 and CASP5, others are DEGs between normal tissue and breast tumor tissue. Figure 1D shows the network of all genes related to pyroptosis.
Tumor classification based on DEGs
To explore the relationship between the expression of 19 pyroptosis-related DEGs and breast cancer subtypes, we performed a consensus cluster analysis on 1085 breast cancer patients with complete follow-up information in the TCGA cohort. By increasing the clustering variable (k) from 2 to 9, we found that when k = 3, the intra-group correlation is highest and the inter-group correlation is low. This indicates that the 1085 breast cancer patients can be divided into three clusters based on 31 DEGs (Figure 2A). Gene expression profile and clinical characteristics, including Age, Stage, ER (estrogen receptor) status, PR (progesterone receptor) status, HER2 (human epidermal growth factor receptor 2) status, Subtype, age (≤60 years or> 60 years old), and Subtype are shown in the heat map, but there is a difference between the three clusters. There is almost no difference in clinical characteristics (Figure 2B). The overall survival time between the three clusters was also compared, but no significant difference was found (P = 0.09, Figure 2C).
Establishment of prognostic gene model based on TCGA cohort
A total of 1035 breast cancer samples were matched to the corresponding patients with complete survival information. Univariate Cox regression analysis was used for the preliminary screening of survival-related genes. The eight genes (CASP1, CASP4, CASP8, CASP9, ELANE, GPX4, IL18, and PYCARD) that meet the criteria P <0.2 are retained for further analysis. These eight genes have HRs <1 and are all protective genes (Figure 3A)). By performing Least Absolute Shrinkage and Selection Operator (LASSO) Cox regression analysis, the markers of four genes were constructed according to the optimal λ value (Figure 3B, C). The risk score was calculated as follows: risk score = (−0.173×CASP9 exp.) + (−0.122×ELANE exp.) +(−1.239×GPX4 exp.) +(−0.052×IL18 exp.). Based on the median risk score, 1034 patients were divided into low-risk and high-risk groups (Figure 3D). Principal component analysis (PCA) showed that patients with different risks were satisfactorily divided into two clusters (Figure 3E). Patients in the high-risk group had more deaths and shorter survival times than those in the low-risk group (Figure 3F). Analysis of the Kaplan-Meier survival curve showed that the OS time between the two groups was significantly different (P <0.001, Figure 3G). Through a time-dependent receiver operating characteristics (ROC) analysis to evaluate the sensitivity and specificity of the prognostic model, we found that the area under the ROC curve (AUC) was 0.620 in three years, 0.619 in five years, and 0.644 in seven years (Figure 3H).
External verification of the risk model
We selected 390 breast cancer patients from the GEO cohorts (GSE7390, N=198; GSE42568, N=104; GSE20713, N=88) as the validation set. Before further verification, gene expression data were normalized and batch effects were removed. According to the median risk score of the TCGA cohort, patients in the three GEO cohorts were divided into high- and low-risk groups. (GSE7390: 113 cases of low risk, 85 cases of high risk; GSE42568: 22 cases of low risk, 82 cases of high risk; GSE20713: 23 cases of low risk, 65 cases of high-risk, as shown in Figure 4A, B, C ). The PCA of three datasets (GSE7390, GSE42568, and GSE20713) shows satisfactory separation between the two subgroups, as illustrated in Figure 4D, E, F. In addition, we found that low-risk patients have longer survival times and lower mortality than high-risk patients as illustrated in Figure 4G, H, I. The Kaplan–Meier analysis of the three GEO cohorts indicates that there were significant differences in OS between the low-risk and high-risk groups (all p <0.01, Figure 4J, K, L). The ROC curve analysis of three GEO cohorts shows that our model has good predictive performance (Figure 4M, N, O ).
Independent prognostic analysis of risk models
To evaluate whether the risk assessment of our model can be used as an independent prognostic factor, we used univariate and multivariate COX regression analysis methods. Univariate Cox regression analysis showed that in the TCGA, GSE7390, and GSE42568 cohorts, a risk score can be used as an independent factor for low survival (HR = 4.359, 95% CI: 1.694–11.221, HR: 9.129, 95% CI: 2.379–35.029 and HR: 12.190, 95% CI: 1.938−76.667 (Figure 5A, C, E). Concurrently, in the above three cohorts, multivariate COX analysis showed that our risk score was an independent prognostic factor (HR = 3.804 95% CI: 1.520–9.518, HR: 7.725, 95% CI: 2.045–29.183 and HR: 14.685, 95% CI: 2.502–86.180, Figure 5B, D, F). In the other cohort (GSE20713), we found that risk score cannot be used as a prognostic factor. We also drew a clinical heatmap of the TCGA cohort using the R package ComplexHeatmap. We found that the two clinical indicators of T stage and HER2 status were different between the high- and low-risk groups (p<0.05). (Figure 5G)
GO and KEGG analysis based on the risk model
To further explore the differences in gene function and pathways between groups, we extracted DEGs through the R package limma and set the threshold as FDR <0.05 and |log2FC | ≥ 1. We identified 278 DEGs between the high- and low-risk groups in the TCGA cohort, of which 267 genes were down-regulated and 11 genes were up-regulated (data are listed in Table S3). GO enrichment analysis and KEGG pathway analysis were performed on these DEGs. The results indicate that DEGs are mainly related to immune response, immune response-activating cell surface receptor signaling pathway, and Cytokine−cytokine receptor interaction (Figure 6A, B).
Comparison of immunological activity between subgroups
Through functional analysis, we further compared the enrichment scores of 16 types of immune cells and the activity enrichment analysis of 13 immune-related pathways (ssGSEA) between the TCGA cohort and the high- and low-risk patients in the three GEO cohorts by using a single-sample genome. In the TCGA cohort, we found that among the 16 immune cell types, the low-risk group had a higher number of immune cells, especially ADCs, B cells, CD8+ T_cells, DCs, IDCs, Mast_cells, Neutrophils, NK_cells, pDCs, T_helper_cells, Tfh, Th1_cells, Th2_cells, and TILTreg. (Figure 7A, B) Compared with the high-risk group, the low-risk group demonstrated higher activity in 13 immune pathways. When evaluating the immunization status of the three GEO cohorts, similar conclusions were reached. (Figure 7C, D, E, F, G, H)
Correlation analysis between chemotherapeutic drugs and model
We sought to determine the association between risk and the efficacy of chemotherapy in breast cancer treatment in the TCGA cohort. We found that the low-risk group was associated with the lower half inhibitory concentration (IC50) of chemotherapy treatments, such as Paclitaxel, Doxorubicin, Gemcitabine, Cisplatin, and PD.0332991 (Palbociclib) , and the higher IC50 with Lapatinib. Thus, this model can be used as a predictive indicator of chemical drug sensitivity (Figure 8).