3.1. Download of the raw data
The flow chart of this study was provided in Figure 1. The transcriptome and genome data of TCGA-LUNG were downloaded from the TCGA database (Table 1), while the hypoxia factor hypoxia genes (hallmark_hypoxia) were downloaded from the online website (http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp).
3.2 Screening of hypoxia candidate gene set 1 (cd-hypoxia-geneset1)
A total of 5238 differentially expressed genes (DEGs) were screened between normal and primary tumors. The corresponding volcano plots and heatmaps of DEGs were provided in Supplementary Figure 1A and 1B. KEGG pathway enrichment analysis performed on all DEGs showed that no hypoxia-related pathways were found. GO term enrichment analysis performed on all DEGs showed that two hypoxia-related pathways were found: GO:0001666 (response to hypoxia) and GO:0071456 (cellular response to hypoxia). The two pathways contained a total of 118 genes, named hypoxia candidate gene set 1 (cd-hypoxia-geneset1).
3.3 Unsupervised clustering analysis based on hypoxia-DE-genes
The DEGs and hallmark hypoxia gene lists between tumor and normal samples were intersected to obtain hypoxia-DE genes, including 75 genes. According to the expression levels of 75 hypoxia-DE genes, unsupervised clustering of tumor samples was performed, and then all tumor samples were divided into two clusters (Supplementary Figure 1C and Supplementary Figure 1D). KM survival analysis was performed on both cluster 1 and cluster 2 groups, and there was a significant difference in PFS between the two groups (P<0.0001, Supplementary Figure 1E).
3.4 Screening of hypoxia candidate gene set 2 (cd-hypoxia-geneset2)
DEG analysis was performed between the cluster 1 and cluster 2 subgroups, and 2960 DEGs were screened and named hypoxia candidate gene set 2 (cd-hypoxia-geneset2), (Supplementary Figure 1F-1L). Both the heatmap of cd-hypoxia-geneset2 (Supplementary Figure 1F) and the results of principal component analysis (Supplementary Figure 1G) indicated that there were obvious differences between cluster 1 and cluster 2.
3.5 Construction of WGCNA coexpression network
The hypoxia-DE-genes cd-hypoxia-geneset1 and cd-hypoxia-geneset2 were merged, and a total of 3195 genes were obtained. Using WGCNA, a coexpression network was constructed for these 3195 genes (Supplementary Figure 1H-1L).
3.6 Screening of hub genes related to prognosis
Since the turquoise module had a strong correlation with DFI and DSS, module membership and gene significance analyses were performed on the turquoise module and DFI (Supplementary Figure 2A). Subsequently, the connectivity within each gene module and the connectivity outside the module as well as the total connectivity were calculated. The genes in the turquoise module were selected, and |GS|≥0.08 and kwithin (connectivity within the module)≥5 were used as the screening thresholds, and a total of 73 genes were obtained (Supplementary Figure 2B). PPI analysis was performed on the genes in the turquoise module, and the genes with a degree of connection ≥ 5 in the PPI analysis were selected to intersect with the 73 genes screened by WGCNA. Then, 23 hub genes were obtained (SPRR2A, SPRR1A, PIK3CA, KRT6C, KRT6B, KRT17, KRT16, KRT14, JAG1, ITGA6, IRF6, GPC1, DVL3, DSG3, DSC3, DLX5, DLG1, CSTA, CLCA2, BMP7, ATP11B, ARRB1, ABCC5).
Univariate Cox regression analysis was performed on the 23 hub genes to screen genes whose expression level had a significant effect on PFS. Witha p value < 0.05 as the screening threshold, 22 genes were obtained (SPRR2A, SPRR1A, PIK3CA, KRT6C, KRT6B, KRT17, KRT16, KRT14, JAG1, IRF6, GPC1, DVL3, DSG3, DSC3, DLX5, DLG1, CSTA, CLCA2, BMP7, ATP11B, ARRB1, and ABCC5). The PFS KM curves of six representative genes, KRT16, CLCA2, ATP11B, CSTA, ABCC5 and DAPL1, were shown in Figure2A-2F.
3.7 Construction of the hub gene regulatory network
Based on the LncMAP database (http://biobigdata.hrbmu.edu.cn/LncMAP/index.jsp), the TF-gene pairs interacting in lung cancer were screened. According to the lncRNA-gene-TF interaction relationship in lung cancer, gene-lncRNA pairs were screened (both in the LUAD LUSC cohort and the probability both in the discovery set and the probability in the validation set>0.3). Then, the regulatory network of the hub gene and its related TFs and lncRNAs was further constructed (Supplementary Figure 2C).
3.8 Construction of the hypoxia-related risk prognostic model (hypoxia-risk score)
The 22 genes screened by univariate Cox analysis were subjected to LASSO regression analysis to remove redundant factors, and 15 genes were screened, (Supplementary Figure 3A, 3B and 3D). Multivariate Cox regression survival analysis was performed on the 15 genes screened above, and 5 genes were screened (Supplementary Figure 3C). Then, a risk score formula affecting PFS was constructed:
Hypoxia-risk Score = 0.07*KRT16- 0.05*CLCA2+ 0.293*ATP11B- 0.114*CSTA- 0.173*ABCC5
3.9 Analysis of prognostic efficacy of the hypoxia-risk score model
Hypoxia-risk scores were calculated for each sample from the TCGA-LUNG cohort, and they were divided into high-risk and low-risk groups according to the median. KM analysis was performed, and the high-risk group had a shorter PFS time than the low-risk group (Figure 3A), which was consistent with the results of the distribution plot of the hypoxia-risk score (Figure 3B) and the PFS distribution plot (Figure 3C). The expression heat plot showed the low expression of hypoxia-risk score-related 5 genes in the high-risk group (Figure 3D). The predictive prognostic ROC plot illustrated a good prognostic ability of the hypoxia-risk score model (AUC for 1-year survival: 0.615; AUC for 3-year survival: 0.642; AUC for 5-year survival: 0.64) (Figure 3E).
3.10 External data validation of the prognostic efficacy of the hypoxia-risk score
model
Dataset GSE74777 was used for external data validation. The hypoxia-risk score of each sample was calculated and divided into high-risk and low-risk groups according to the median. Similar to the TCGA-LUNG cohort, the KM analysis, distribution of the hypoxia-risk score and PFS distribution plot also showed that patients in the high-risk groups had shorter PFS than patients in the low-risk groups (Figure 3F-3H). Meanwhile, the expression heat plot of hypoxia-risk score-related 5 genes in the GSE74777 dataset showed a similar trend to that in the TCGA-LUNG cohort (Figure 3I).
3.11 Stability assessment of the hypoxia-risk score model
In the TCGA-LUNG cohort, PFS survival analysis was performed on LUAD samples, LUSC samples, male samples, female samples, age ≤65 samples, and age >65 samples. These results showed that in the above subgroups, the high-risk group of the hypoxia-risk score model indicated a worse prognosis than the low-risk group (Figure 4A-4F), which demonstrated the good prognostic efficacy of the hypoxia-risk score model.
3.12 Pancancer analysis of the hypoxia-risk score model
Based on the TCGA-CESC, TCGA-PAAD, TCGA-THYM and TCGA-UCEC cohorts, hypoxia-risk scores were calculated for each sample, and they were divided into high-risk and low-risk groups according to the median. The KM analysis showed that patients in the high-risk groups had shorter PFS than patients in the low-risk groups (Figure 4G-4J).
3.13 Comparison of predictive efficacy with existing hypoxia models
Based on the reported 26-gene hypoxia model, the hypoxia score for each sample in the TCGA-LUNG cohort was calculated, and a predictive prognostic ROC curve was drawn (AUC for 1-year survival: 0.555; AUC for 3-year survival: 0.489; AUC for 5-year survival: 0.458, Figure 4K). Compared with the reported 26-gene hypoxia model, the 5-gene hypoxia-risk score model (AUC for 1-year survival: 0.615; AUC for 3-year survival: 0.642; AUC for 5-year survival: 0.64, Figure 3E) trended toward good prognostic accuracy for lung cancer patients.
3.14 Analysis of the clinical characteristics of the high- and low-risk hypoxia risk scores
groups
In the TCGA-LUNG cohort, the clinical characteristics (age, pathological type, sex, clinical stage) of the hypoxia-risk score in the high- and low-risk groups were analyzed (Figure 5A-5D). Notably, patients with lung adenocarcinoma and high clinical staging had relatively higher hypoxia-risk scores.
3.15 Correlation of hypoxia-risk score with other scores
In the TCGA-LUNG cohort, the hypoxia risk score was positively correlated with mutation load and neoantigen load but negatively correlated with the stemness index, chromosomal instability, and homologous recombination deficiency (Figure 5E-5J).
3.16 Immune infiltration analysis of hypoxia-risk score high- and low-risk groups
Differences in the infiltrating immune populations were observed between the hypoxia-risk score high- and low-risk groups, with a higher frequency of activated NK cells, resting memory CD4+ T cells, and Tregs present in the hypoxia-risk score high group and a higher frequency of resting NK cells present in the hypoxia-risk score low group (Figure 5M).
3.17 Analysis of immune score, stroma score, and tumor purity corresponding to
The hypoxia-risk score high- and low-risk groups
In the TCGA-LUNG cohort, the high-risk samples showed higher immune scores, matrix scores, and ESTIMATE scores than the low-risk samples (Figure 5K, 5L and 5N).
3.18 Somatic mutation analysis of hypoxia-risk score high- and low-risk groups
Based on the TCGA-LUNG cohort, somatic mutation analyses of hypoxia-risk score high- and low-risk groups showed that there were 8 genes with higher mutation frequencies in common between the high- and low-risk groups, including TP53, TIN, CSMD3, MUC16, RYR2, LRP1B, USH2A, and ZFHX4 (Supplementary Figure 4A and 4B). In addition, the samples in the low-risk group had specific gene mutations, such as KMT2D, while the samples in the high-risk group had specific gene mutations, such as KRAS (Supplementary Figure 4A and 4B).
3.19 Copy number variation analysis of hypoxia-risk score high and low risk
groups
The copy number variation analysis of the hypoxia-risk score high- and low-risk groups showed that the most frequent CNVs were 12 gain and 16 loss in both the low-risk group and the high-risk group (Supplementary Figure 4C-4E).
3.20 Immunotherapy efficacy analysis of high- and low-risk hypoxia-risk scores
groups
The GSE135222 dataset with immunotherapy data was used. The KM analysis of PFS indicated no significant differences between the high-risk and low-risk groups (Figure 5O). Immunotherapy efficacy analysis indicated that 36% of patients in the high-risk group benefited from immunotherapy, while 30% of patients in the low-risk group benefited from immunotherapy (Figure 5P). Although patients who benefited from immunotherapy showed a slightly higher hypoxia-risk score, there was no significant difference in hypoxia-risk score compared with those who did not benefit from immunotherapy (Figure 5Q).
3.21 Chemotherapy efficacy analysis of high- and low-risk hypoxia-risk scores
groups
Furthermore, chemotherapy efficacy analysis indicated that 74% of lung cancer patients in the low-risk group achieved a CR or PR to chemotherapy, compared with 65% in the high-risk group (Supplementary Figure 5A). Accordingly, patients who achieved CR to chemotherapy had a lower hypoxia-risk score than patients with PD after chemotherapy (Supplementary Figure 5B). Moreover, the hypoxia-risk score had an ROC-AUC of 0.621 in terms of predicting chemotherapy response (Supplementary Figure 5C).
3.22 Chemotherapy resistance assessment in the hypoxia-risk score high and low
risk groups
The IC50 values of four different chemotherapeutic drugs (cisplatin, docetaxel, gemcitabine and paclitaxel) involved in each sample were calculated, and the results showed that a lower IC50 value corresponded to a lower hypoxia-risk score (Supplementary Figure 5D-5G).
3.23 Independent prognostic efficacy assessment of the hypoxia-risk score
Univariate Cox analysis was performed for clinical characteristics and hypoxia-risk scores and indicated that histological type, clinical stage and hypoxia-risk score could provide prognostic prediction of PFS in the TCGA-LUNG cohort (Supplementary Figure 6A). Then, multivariable analysis confirmed the independent predictive value of the hypoxia-risk score (HR, 2.6; 95% CI, 1.7 to 4.1; P <0.001) (Supplementary Figure 6B).
3.24 Construction of the nomogram
A nomogram prognostic model of clinical information and hypoxia-risk score versus PFS were constructed to predict patient outcomes in lung cancer (Supplementary Figure 6C-6F).
3.25 Overall survival analyses of KRT16 protein expression in patients with NSCLC
In total, samples from 32 NSCLC patients were evaluable for KRT16 protein expression using immunohistochemistry (IHC). KRT16 staining was mainly localized in the cytoplasm, and the staining intensity was scored from 0 to 3+, corresponding to negative (0), weak (1+), moderate (2+) or strong (3+) staining (Figure 6A-6D). The proportion of positive cells with any intensity was estimated as the percentage. The H-score, which involved multiplying the percentages of stained cells by their staining intensity, was defined as the KRT16 IHC score (ranging from 0 to 300) in this study. A KRT16 IHC score of <100 was set as low KRT16 protein expression, and a KRT16 IHC score of ≥100 was set as high KRT16 protein expression. The clinicopathological characteristics of KRT16 protein expression in NSCLC patients were shown in Table 2. Twenty-three patients (71.9%) had tumors with low KRT16 protein expression, and nine patients (28.1%) had tumors with high KRT16 protein expression. Normal alveolar epithelium also had certain KRT16 protein expression, but they had lower KRT16 protein expression than tumor cells (Figure 6E), and the paired plots also showed that KRT16 protein was highly expressed in NSCLC compared to paired normal tissues (Figure 6F). The KRT16 protein high-expression patients had poorer survival than KRT16 protein low-expression patients (HR (95% CI) =3.05(0.60–15.44), log-rank p=0.041) (Figure 6G). Together, these results validated the prognostic value of KRT16 in NSCLC, that is, high KRT16 protein expression served as a poor prognostic factor for NSCLC patients.