Patient population
All 226 ,128 and 461 patients diagnosed with LUAD were collected from the GEO (GSE31210, GSE50081) and TCGA database, respectively. A total of 897 m6A-related genes out of 9057 expressed genes were identified in GSE31210 dataset. From Table 1, the median age of the enrolled patients was 61 years. The ratio of male vs female was 1.15:1, with 191 live cases and 35 death cases. The longest survival was 10 years. Gene expression data was mainly distributed in stage I-II of LUAD.
Construction of the risk score modle the m6A-related genes and m6A RNA methylation regulators risk score
Performing the cox regression and ROC analysis, a total of 129 genes were discovered, which were significantly associated with OS and had a good ability to predict survival (P < 0.05, AUC>0.6, Table S2). Further, we screened out ten prognostic genes by RSFVH analysis based on importance scores (Fig. 1a-b). Then we brought the prognostic genes into the risk prediction model and got 210-1= 1023 possible signatures in the training dataset. ROC analyses were performed in all the 1023 signatures to find out the signature with the strongest predictive ability (Table S3). The final signature including five genes (DENND1A, KBTBD6, KIF4A, BMPER, YTHDC2) was screened out with the maximum AUC (AUC signature=0.762; Fig. 1c; Table 2). The selected risk model is as follows: Risk score = (1.664×expression value of DENND1A) + (-1.249×expression value of KBTBD6) + (1.736×expression value of KIF4A)+ (-1.721×expression value of BMPER) + (-2.015×expression value of YTHDC2)
The validation of performance in predicting overall survival
We used the risk model to calculate the risk score for each patient. The median risk score divided patients of the training dataset into either the high-risk (n = 113) or low-risk group (n = 113). The Kaplan–Meier analysis results showed patients in the low-risk group lived longer than patients in the high-risk group ( P<0.001; Fig. 2a). Then we tested the prognostic value of the gene signature in test dataset (n=128). Kaplan–Meier analysis found the survival of patients with high risk scores was lower than that of patients with low risk scores in test group (P=0.009; Fig. 2b). In independent data (TCGA, n=461), the survival of patients with high risk scores was lower than that of patients with low risk scores in TCGA-test group (P<0.001; Fig. 2c). The relationship of gene expression, risk score and survival information was showed in Fig.3. With the increase of risk scores, death toll raised both in the training set (Fig. 3a) and test set (Fig.3b-c).
We found that the 5-year survival rate in the training group (GSE31210) was 57.52% in the low-risk group and 33.63% in the high-risk group (Fig. 2a). The overall survival rate was 45.58%, indicating that this model feature could basically differentiate the data better. Survival was also significantly improved in two independent data validation groups (GSE50081 and TCGA). In GSE50081, the low risk group was 50% and the high risk group was 35.94% (Fig. 2b). The overall survival rate was 42.97% and the grouping label was also obvious. In TCGA, we selected sample data of survival time less than 5 years (a total of 461 cases), so we saw the 3 year survival rate, respectively. In low-risk group was 22.08% and in high-risk survival rate was 14.35%(Fig. 2c). The overall survival rate was 18.22%. Through the analysis, we found that the queue sample, the effect of the model had subsided, but the original caused the change of survival rate was not high.
By using timeROC in five-year survival circumstances, we found that the label had a very good prediction effect (Fig. 2d-f) . In the training group GSE31210 , the AUC value is 0.799 . In the validation group GSE50081 , AUC value is 0.619. In the TCGA, we chose timeROC analysis of 4.9 years (because of the selection of data was not more than 5 years of patient samples), and the AUC is 0.716.
The relationship between the signature and clinical characteristics
The correlations were further explored between the prognostic model and various clinical features. In the training and test datasets, by chi-square test, and the gene signature was related with Pathological_stage (Table 3). Then we further performed univariate and multivariable Cox regression analysis to test the predictive independence of the gene signature (Table 4). Multivariable Cox regression results verified that the gene signature was an independent predictive factor and could independently predict patients’ clinical outcome in training and test datasets (High- vs. Low-risk, GSE31210, HR =17.48, 95% CI 4.16-73.53, P < 0.001, n=226; GSE50081, HR = 1.86, 95% CI 1.05–3.30, P=0.03, n=128; TCGA, HR = 1.65, 95% CI 1.20–2.23, P=0.0019, n=461,Table 4).
Functional annotation of the survival-related m6A RNA methylation-related gene set
Univariate analysis was used in GSE31210 to explore the prognostic potential of the candidate m6A RNA methylation-related gene set. The results showed that 129 candidate genes, including m6A RNA methylation regulatory factor ELAVL1, METTL14, ZC3H13 and YTHDC2, were significantly associated with OS of LUAD.
Four m6A RNA methylation regulatory factors were indicated to predict favorable overall survival (ZC3H13: HR=0.44 ; 95% CI, 0.21 to 0.90. METTL14: HR= 0.38; 95% CI, 0.19 to 0.78. YTHDC2: HR=0.13; 95% CI 0.05 to 0.34, ELAVL1: HR=0.26; 95% CI, 0.12 to 0.57). We then used the bioinformatic tool STRING to analyze functional protein association networks between these 129 candidate genes. The results indicated that lysine methyltransferase 2A (KMT2A), notch receptor 1(NOTCH1), collagen type III alpha 1 chain collagen type III alpha 1 chain (COL3A1) were the hub genes (Fig.4).
All statistically enriched terms (Gene Ontology (GO) ) by Metascape, and accumulative hypergeometric p-values and enrichment factors were calculated and used for filtering. The significant terms were then hierarchically clustered into a tree, based on Kappa statistical similarities among their gene memberships.The results indicated that basement membrane, inner ear receptor cell differentiation, and inner ear receptor cell stereocilium organization were the most significantly enriched (Fig.4).
External validation using online database about genes in the signature
Consistent with our results, DENND1A and KIF4A were found to be significantly overexpressed, while BMPER were significantly underexpressed in LUAD in the Oncomine (Fig. 5a), which was almost the same in both TIMER database (Fig. 6) and GEPIA database (Figure S1). Interestingly, the aberrant expression of these five genes was frequently observed in various cancer and showed some tissue-dependent pattern. For example, DENND1A was overexpressed in esophageal cancer, head and neck cancer. However, DENND1A was underexpressed in brain and CNS cancer.
Survival analyses for each gene in the signature (DENND1A, KBTBD6, KIF4A,and BMPER) were performed in the cohort of GSE31210, GSE50081 and TCGA datasets (Fig.7) . DENND1A high expression patients group displayed remarkable shorter OS than DENND1A low expression patients group in GSE31210. While, KIF4A high expression patients group displayed remarkable shorter OS than KIF4A low expression patients group not only in GSE31210, but also in GSE50081 and TCGA dataset . Compared to the KBTBD6 low expression groups, KBTBD6 high expression patient group had more OS in GSE31210 (Fig. 7a, P < 0.05). Compared to the BMPER low groups,BMPER high patient group had more OS in GSE31210 (Fig. 7a, P < 0.05). Compared to YTHDC2 low expression groups, YTHDC2 high expression patients group had more OS in GSE31210 and TCGA data (Fig. 7a and c , P < 0.05).
We then reviewed the proteomic data and found YTHDC2 protein was reported significantly down-expressed in non-small cell lung cancer(27). KIF4A protein was reported significantly up-expressed in non-small cell lung cancer(28). The representative protein expression of KBTBD6, KIF4A, and YTHDC2 was explored in the Human Protein Profiles and shown in Fig. 5b. However, DENND1A and BMPER was not found on the website. BMPER possessed the most frequent genetic alterations (6%) among the five genes. Meanwhile, amplification mutation and deep deletion were the most common alterations among the five genes (Fig. 5c).
Taking together, aberrant expression of the five genes were further validated in LUAD, and genetic alteration might help explain the aberrant expression of these genes to some extent.