Initial Screening of Genes using GSEA and GSVA
We obtained the clinical features from patients with lung squamous cell carcinoma, along with an expression data set for 56392 mRNAs from the TCGA database. The KEGG gene sets had expressed signatures derived by concentrating multiple gene sets from the Molecular Signatures Database (MSigDB) to represent well‐defined biological statuses or courses. GSEA was performed using the above detailed data to detect whether the identified gene sets showed statistically important differences between normal tissue and LUSC tissues patients. GSVA estimates variation of gene set enrichment over the samples independently of any class label.therefore,we got fifteen pathway(Fig.1),including DNA replication,Cell cycle,Homologous recombination,Mismatch repair,Proteasome,Base excition repair,Spliceosome,Aminoacyl trna biosynthesis,Pyrimidine metabolism,Nucleotide excition repair,P53 signaling pathway,Basal transcription factors,RNA degradation, RNA polymerase and Oocyte meiosis,and we selected cell cycle pathway based on the number of gene and the value of NES.
Identification of Cell cycle-related mRNAs Associated with the Survival of Patients
First, we employed univariate Cox regression analysis of the 125 genes for preliminary screening and obtained 20 genes with p values <0.1. Additionally, multivariate Cox regression analysis was used to further examine the relationship between the expression profiles of 20 mRNAs and the patient survival rate. Subsequently, 4 mRNAs (CDKN1A, CHEK2, E2F4 and RAD21) were verified as independent poor prognostic indicators. The filtered mRNAs were classified into a risky type (CDKN1A, E2F4 and RAD21), whose HR was >1 with shorter survival, and a protective type (CHEK2), whose HR was <1 with longer survival (Table 2).We made Pearson correlation coefficient among the 4 mRNAs on the basis of Table2 ,and we found the correlation between E2F4 and CHEK2, between E2F4 and RAD21, between CHEK2 and RAD21,and the value of R is greater than 0.3(Fig.2).
Construction of a four-mRNA Signature to Predict Patients’ Result
A prognostic risky score formula was established based on a linear combination of the expression levels weighted with the regression coefficients derived from multivariate Cox regression analysis. The risky score=0.2551*expression of HMMR+0.2160*expression of B4GALT1-0.1570*expression of SLC16A3+0.1238*expression of ANGPTL4+0.2381*expression of EXT1+0.1027*expression of GPC1+0.1820*expression of RBCK1+0.1874*expression of SOD1+0.2226*expression of AGRN. Each patient of LUSC had only one risky score. We calculated the scores and ranked them, and then classified the patients into high- and low-risk groups by the median value (Fig.3A). The survival length of time (in days) of each patient is shown in Fig.3B, and the patients with high-risk scores showed higher mortality rates than those with low-risk scores. Additionally, a heatmap (Fig.3C) was revealed to display the expression profiles of the four mRNAs.Then,we compared the risk score to the prognosis of the 4-mRNA ,and it was showed that the different expression of four gene between cancer tissue and adjacent normal tissue in Fig.4A,and the different expression of four gene in each stage was displayed in Figure 4B,and the graph of survival curve of between risk score and four gene were showed in Figure 4C,therefore,we can see that the value of P of risk score is dominant. With the increasing risk score of patients with LUSC, the expression of high-risk types of mRNAs (CDKN1A, E2F4 and RAD21) was obviously upregulated. By contrast, the expression of the protective type of mRNAs (CHEK2) was downregulated.
Generation of Risk Score from the Four mRNA Signatures as an Indicator of Prognosis
The prognostic values of the risk scores were compared with the clinicopathological information by univariate and multivariate analyses. Samples with completed clinical data were used for analysis.The median age of the 504patients with lung squamous cell carcinoma was 68 years and included 373 male patients and 131 female patients. Among 391patients, 79(20.2%) had a positive tumor during the follow-up visit. Among 439 patients, 41 (9.3%) had residual tumors. Among 503 patients, 184 (36.6%) patients had lymph node metastasis and 86 (17.3%) had distant metastases among 497 patients with lung squamous cell carcinoma. Among 154 patients,15(9.7%) received radiation therapy. Additionally, we found that the risk score, new event,tobacco smoking history,and neoplasm cancer status were independent prognostic indicators because they showed important differences in univariate analysis with p values <0.05 (Table 3). In the subsequent multivariate analysis (Table 3), we found that the risk score,new event,neoplasm cancer status and tobacco smoking history showed statistical significance in univariate and multivariate analyses (P<0.05). Whether univariate or multivariate analysis, the risky score had prominent prognostic values, with p values < 0.05 (HR = 1,566, 95% CI (confidence interval) = 1.073-2.288). Additionally, the most obvious clinical parameter to predict patient survival was "neoplasm cancer status", and patients with tumors are probably 4.871 times more likely to be exposed to death than those who were tumor free.From the value of P,we can draw that the risk score is more dominant than TNM classification. And the 4-mRNA expression-based survival risk score was used to assign patients into a low-risk or high-risk group using the median risk score as the cut-off. The ROC curve analysis score was 0.661 (Fig.5A), indicating good sensitivity and specificity of the 4-mRNA signature in predicting survival in LUSC. We also made ROC curve of important clinical parameter (Fig.5B-H)and found ROC curve of risky score was obviously higher than ROC curve of other clinical parameter,and superior to clinical parameter for prognosis indicator .
Validation of Four mRNA Markers for Survival Prediction by Kaplan-Meier Curve Analysis
Kaplan-Meier curves and the log-rank method showed a poor prognosis in patients with high-risk scores (p < 0.0010) (Fig.6A). Univariate Cox regression analysis of the overall survival showed that several clinicopathological data were effective in predicting the survival rate of lung squamous cell carcinoma, including Age,Sex,T classification, N classification ,M classification, new event, neoplasm cancer status, tobacco smoking history,radiation therapy and residual tumor. The K-M method was then adopted to confirm the above results. According to the curve, patients with age older than sixty-eight,with tumor after treatment,T classification greater than T1, distant organ metastasis, stage greater than stage I,a residual tumor,tobacco smoking history, or a positive tumor finding during the follow-up visit were correlated with poor prognosis(Fig. 6B,6C,6E,6G,6K,6M,6N,6O). These results provided further confirmation of the accuracy of our analysis. Hence, further stratified analysis was performed for data mining.And the risk score is superior to other clinical indicator from results.
Validation of differential gene between high risk team and low risk team
We classified the patiens of LUSC into two groups followed by risk score,and enriched them into pathway of high risk team and low risk team by GSEA and GSVA. And got differential gene,then put them into mass survival and got related genes. We made Pearson correlation coefficient between related genes and four genes ,and found the correlation between CDKN1AandKLK5, between CDKN1A and KLK7, and the value of R is greater than 0.3(Figure 7B).