Identification of DEGs
Three datasets of GEO (GSE138080, GSE52904, GSE67522) comprised of 951, 875 and 1048 DEGs were identified between CC tissues and normal cervix (Supplementary Fig. 1). Among them, 183 DEGs were present in all three datasets (Fig. 1a).
Functional Enrichment of the DEGs
GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichment analyses were applied to discover the functions of the 183 intersected DEGs (Fig. 1b-c). The results of GO analysis revealed that 183 DEGs were associated with DNA replication, cell cycle, chromosomal region, helicase activity from the categories of biological process, cellular component and molecular function, respectively. KEGG pathway analysis revealed that the DEGs participated in DNA replication, cell cycle, mismatch repair, prostate cancer, TNF signaling pathway, bladder cancer, microRNAs in cancer, NF-kappa B signaling pathway.
Identification of Survival-Related DEGs and Establishment of the Four-Gene Prognostic Signature
We randomly divided the TCGA-CESC dataset into training set and testing set. One hundred twenty-nine patients from the training set were included in subsequent survival analyses. By the use of univariate cox regression analysis, 8 DEGs were identified to be associated with OS (Fig. 2a). Patients were divided into high-expressed and low-expressed groups according to the median expression of these survival-related genes. The corresponding survival analysis for each gene is shown as Supplementary Fig. 2 Lasso-penalized Cox regression analysis was applied to further reduce the number of DEGs in the selected panel with best predictive performance based on glmnet package (Fig. 2b-c). After Lasso-penalized Cox regression analysis, a prognostic signature comprising four genes, including procollagen-lysine,2-oxoglutarate 5-dioxygenase 2 (PLOD2), spondin1 (SPON1), secreted phosphoprotein 1 (SPP1), ribonuclease H2 subunit A (RNASEH2A), was developed by multivariate Cox analysis. The risk score was calculated as follows:
[(0.62218) × Expression value of PLOD2] + [(0.24936) × Expression value of SPON1] + [(0.27333) × Expression value of SPP1] + [(− 0.88808) × Expression value of RNASEH2A].
Patients were divided into high-risk and low-risk groups according to median value of risk scores. Kaplan-Meier survival curves showed that patients in the low-risk group had a longer survival duration compared to the high-risk group (Fig. 3a). The 1-, 3- and 5-year survival rates were evaluated by risk scores and are displayed in Fig. 3b, with AUC values of 0.836, 0.806 and 0.823 respectively. Distribution of the risk scores, survival status and the mRNA expression heat map in the training set are displayed in Fig. 3c-e. These results demonstrated that the four-gene signature had both high sensitivity and specificity and can be used to predict the survival of patients with CC.
Validation of the performance of the four-Gene Signature
The testing set and entire TCGA-CESC dataset were used to validate the robustness of the four-gene signature respectively. Risk scores were calculated with the formula listed above for each patient. According to median value, patients were divided into high-risk and low-risk groups. The outcome of the low-risk group was significantly better than that of the high-risk group (Fig. 4a-b). In the testing set, the AUCs for 1-, 3-, and 5-year OS predictions were 0.589, 0.626, and 0.710 (Fig. 4c), while in the entire TCGA-CESC dataset, the AUCs for 1-, 3-, and 5- year OS predictions were 0.680, 0.699, and 0.755, respectively (Fig. 4d). The distribution of the risk scores, the associated survival data and the mRNA expression heat map are respectively displayed in Fig. 4e-j. These two datasets effectively demonstrate that the four-gene signature performs well in predicting OS in patients with CC.
Association between the signature and patients’ survival outcomes
The survival curve demonstrated that patients with high-risk were correlated with a trend toward worse OS outcomes in training, testing, and entire sets. Afterwards, patients were divided into different subgroups according to clinical characteristics for survival analysis, we found that the risk signature could predict the OS of subgroup of CC, including patients with G1-G2 grade, G3 grade, M0 stage, N1 stage, T1 stage, T2-T3 stage, FIGO stage I – II, FIGO stage III-IV (Fig. 5a-g). However, there were no correlation between risk scores and OS in N0 stage, N1 stage (Fig. 5h-i). We further used univariate and multivariate Cox regression analyses to evaluate prognostic significances of four-gene signature and various clinicopathologic characteristics. Univariate Cox regression analysis indicated that FIGO stage and risk scores were correlated with OS of CC patients (Fig. 5j). Subsequent multivariate Cox regression analysis showed that risk scores was independently associated with OS of CC patients (Fig. 5k). These results demonstrated that the model was an independent prognostic factor for CC.
Gene Set Enrichment Analysis (GSAE)
To investigate the underlying molecular mechanism of the prognostic signature, GSEA was performed between high-risk group and low-risk group in 259 patients of the entire set (Fig. 6a, b). In the high-risk group, the enriched hallmark gene sets were mainly focused on various processes associated with tumor progression (including epithelial mesenchymal transition, TNFA signaling via NFκB, hypoxia, apoptosis, inflammatory response). In the low-risk group, four biological processes signatures including E2F targets, oxidative phosphorylation, DNA repair and spermatogenesis were enriched.
Correlation between DEGs prognostic signature for CC and the infiltration of immune cells
Considering the interaction between tumor and host immune system influencing patient prognosis[14], we used ESTIMATE algorithm to analyze the differences in tumor purity, ESTIMATE scores, immune scores and stromal scores between high-risk and low-risk patients. As shown in Fig. 6c-e, we found that ESTIMATE scores and stromal score was higher in high-risk group. Subsequently, we analyzed the correlation between the prognostic signature and the infiltration of immune cell subtypes in CC using the data from Tumor Immune Estimation Resource (TIMER) database. As shown in Fig. 6f-k, the correlation values of B cells, CD4 + T cells, CD8 + T cells, dendritic, macrophages, neutrophils with risk score were − 0.087 (P = 0.162), − 0.04 (P = 0.518), − 0.041 (P = 0.510), 0.117 (P = 0.061), 0.120 (P = 0.054) and 0.132 (P = 0.034), respectively, suggesting that the infiltration of neutrophil was significantly positive correlated with the prognosis of CC.