2.1 Data resource
From the Cancer Genome Atlas (TCGA, https://portal.Gdc.Cancer.gov) database, we downloaded the TCGA-head and neck squamous cell carcinoma (HNSC). Further, the sample of OSCC was selected and named TCGA-OSCC. The RNA-sequencing (RNA-seq), clinic characters, and overall survival (OS) profiles of the TCGA-OSCC cohort were downloaded. TCGA-OSCC contained 359 tumors and 32 paracancerous tissues of OSCC patients. GSE41613 (GPL570) included the OS profile of 97 OSCC patients, which was resourced from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). Moreover, 9 LRGs were obtained from the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb), with ‘lipophagy’ as the keyword.
The genome-wide association study (GWAS) data, ieu-b-4961, was downloaded from the Integrative Epidemiology Unit (IEU) Open GWAS database (https://gwas.mrcieu.ac.uk/) as an outcome variable in two-sample MR. There were 7,723,107 single nucleotide polymorphisms (SNPs) between oral cavity cancer (OCC) patients (n = 357) and healthy controls (n = 372,016) in ieu-b-4961, and all participants were Europeans. Moreover, the expression quantitative trait loci (eQTL) data for differentially expressed LRGs (DE-LRGs) were also obtained from the IEU Open GWAS database as exposure variables.
2.2 Differential expression analysis
By the ‘DEseq2 (v. 3.4.1)’ package[11], the differentially expressed genes (DEGs) and differentially expressed lncRNAs (DE-LncRNAs) were screened between the tumors and paracancerous tissues of OSCC patients in TCGA-OSCC (|log2fold-change (FC)|>1, adj.P < 0.05). The volcano maps of DEGs and DE-LncRNAs were drawn via the ‘ggpubr (v. 0.6.0)’ package (https://CRAN.R-project.org/package=ggpubr), meanwhile, the heatmaps of DEGs and DE-LncRNAs were generated via the ‘ComplexHeatmap (v. 2.14.0)’ package[12] .
2.3 Weighted gene co-expression network analysis (WGCNA)
Based on the 9 LRGs, the LRGs score of each sample in TCGA-OSCC was calculated by the single sample gene set enrichment analysis (ssGSEA) algorithm of the ‘GSVA (v. 1.46.0)’ package[13]. The LRGs score difference between tumors and paracancerous tissues was compared by the Wilcoxon test (P < 0.05). In order to obtain the key module genes highly related to the LRGs score, the WGCNA was performed by the ‘WGCNA (v. 1.71)’ package [13]. Firstly, hierarchical clustering was used to detect and filter out outlier samples. Next, the optimal soft threshold was determined to make the co-expression network more compatible with the scale-free topology. Ultimately, a co-expression network was constructed that contained several modules based on the dynamic tree cutting (minModuleSize = 100). Pearson’s correlation between the modules and LRGs score was analyzed, and then the strongest correlated module with LRGs score was selected as a key module (|cor|>0.3, P < 0.05). Moreover, the key module genes were screened from the key module based on the module membership (MM) and gene significance (GS) (|MM|>0.4, |GS|>0.4).
2.4 Function enrichment analysis and protein-protein interaction (PPI) network
Utilizing the ‘VennDiagram (v. 1.7.3)’ package (https://CRAN.R-project.org/package=VennDiagram), the intersection of DEGs and key module genes was used to obtain the DE-LRGs. To further explore the function of DE-LRGs, the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were executed by the ‘clusterProfiler (v. 4.7.1.3)’ and ‘org.Hs.eg.db (v. 3.16.0)’ packages (P < 0.05) [14]. The results of GO and KEGG analyses were displayed via the ‘enrichplot (v. 1.18.0)’ package (https://yulab-smu.top/biomedical-knowledge-mining-book/). Furthermore, a PPI network of DE-LRGs was established to understand the interactions at protein levels based on the STRING database (https://string-db.org) (medium confidence > 0.7).
2.5 Two-sample MR analysis
The two-sample MR analysis was performed by the ‘TwoSampleMR (v. 0.5.8)’ package [15]. In two-sample MR analysis, the filtering criteria for instrument variables (IVs) were as follows: (1) IVs must be highly associated with exposure variables; (2) IVs must be not associated with confounding factors; (3) the IVs should only influence the outcome variable through the exposure variables. At that point, the IVs in this study were selected by the extract instruments (P < 5×10− 8, clump = TRUE, r2 = 0.001, kb = 50) and extract outcome data (proxies = TRUE, rsq = 0.8) functions, meanwhile, the weak IVs were filtered out (F < 10). Especially, the MR Egger test [16], weighted median [17], Inverse variance weighted (IVW) test [18], Simple mode [19], and weighted mode[20] were used for the two-sample MR analysis, among them, the IVW test was the main algorithm (P < 0.05). The odds ratio (OR) > 1 was a risk factor and OR < 1 was a protective factor. The scatter plots, forest plots, and funnel plots were applied to display the above results.
2.6 Sensitivity analysis and MR Steiger filtering
The sensitivity analysis was carried out to again test the above results by the ‘TwoSampleMR (v. 0.5.8)’ package [19], including heterogeneity, pleiotropy, and Leave-one-out (LOO) tests. More specifically, the heterogeneity test was performed by the MR heterogeneity function (IVW, P > 0.05). The pleiotropy test was proceeded to see if there were confounding factors by the mr_pleiotropy_test and the ‘MRPRESSO (v. 1.0)’ package (P > 0.05) [Verbanck M (2017). MRPRESSO: Performs the Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO) test. R package version 1.0.]. LOO analysis was used to check the effect of each SNP on outcomes using the MR LOO. What’s more, MR Steiger filtering was applied to verify the directionality of the above results. In the end, the exposure factors that passed all the tests were further studied as candidate genes.
2.7 Identification of prognostic genes
In TCGA-OSCC, a total of 359 tumor samples with OS were considered as a training dataset. Later, the signature genes related to prognosis were screened through the ‘survival (v. 0.4.9)’ package (P < 0.2) [21], and then the proportional hazards (PH) test of signature genes was performed (P > 0.05). Next, the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis was performed to further screen the prognostic genes using the ‘glmnet (v. 4.1.4)’ package [22].
2.8 Gene set enrichment analysis (GSEA)
To learn more about potential pathways of prognostic genes, the GSEA was executed by the ‘clusterProfiler (v. 4.2.2)’ package [23]. ‘C2: KEGG gene sets’ was downloaded from the MSigDB database as background gene sets. Spearman’s correlation coefficients between the prognostic genes and all other genes were calculated and ranked. Subsequently, the prognostic genes were enriched in the KEGG pathways (P < 0.05). In the end, the top 10 KEGG pathways of each prognostic gene were displayed by the ‘enrichplot (v. 1.18.0)’ package (https://yulab-smu.top/biomedical-knowledge-mining-book/).
2.9 Establishment of a prognostic risk model
Based on the prognostic genes, the risk coefficients of them were obtained by the multivariate Cox analysis. The formula for the risk score was as follows:
$$\text{R}\text{i}\text{s}\text{k}\text{s}\text{c}\text{o}\text{r}\text{e} = \sum _{\text{i} = 1}^{\text{n}}\text{c}\text{o}\text{e}\text{f}\left({\text{g}\text{e}\text{n}\text{e}}_{\text{i}}\right)\ast \text{e}\text{x}\text{p}\text{r}\left({\text{g}\text{e}\text{n}\text{e}}_{\text{i}}\right)$$
The ‘coef’ represented the risk coefficients of prognostic genes, and the ‘expr’ represented the expression of prognostic genes. The risk scores of each sample in the training dataset were calculated and ranked. Subsequently, the 359 tumor samples fell into high- (n = 179) and low-risk (n = 180) teams based on the median risk score (median risk score = 0.9661). The risk curves and survival status of the two risk teams were drawn based on the ‘survminer (v. 0.4.9)’ package (https://CRAN.R-project.org/package=survminer). Meanwhile, the Kaplan-Meier (K-M) curve was generated to compare the survival differences between high- and low-risk teams by the Wilcoxon test (P < 0.05). Eventually, the receiver operating characteristic (ROC) curve was applied to assess the forecast accuracy of the prognostic risk model.
2.10 Verification of a prognostic risk model
After the prognostic model was constructed, an external dataset GSE41613 was introduced to evaluate the prognostic model. Similarly, the 97 OSCC patients were divided into high- (n = 48) and low-risk (n = 49) teams based on the based on the median risk score (median risk score = 2.4196). In GSE41613, the risk curves, survival status, K-M curve, and ROC curves were generated by the same method as above.
2.11 Construction of nomogram model
Here, we introduced eight clinic characters to further explore their impact on prognosis in patients with OSCC, including age, race, gender, pathologic TN, stage, grade, radiation therapy, and examined lymph node (ELN). Based on the univariate Cox analysis, the features highly related to prognosis were selected from the clinic characters and risk score (P < 0.05). Afterwards, the PH tests of the above features were conducted (P > 0.05), and then the multivariate Cox analysis was used to screen independent prognostic factors (P < 0.05). Next, a nomogram was generated to predict the OS survival of OSCC patients using the ‘rms (v. 6.5-0)’ package (https://CRAN.R-project.org/package=rms). Meanwhile, the calibration curves, ROC curves, and DCA curves were used to evaluate the predictive accuracy of the model. Moreover, the risk score differences of different clinic subgroups were contrasted by the Kruskal-Wallis (P < 0.05). K-M curves were used to evaluate the survival probability of different clinic subgroups (Kruskal-Wallis, P < 0.05).
2.12 Tumor microenvironment
Firstly, we assessed the differences of the immune score, stromal score, ESTIMATE score, and tumor purity between the two risk teams in TCGA-OSCC using the ‘estimate (v. 1.0.13)’ package [24]. Subsequently, the infiltration abundances of 28 immune cells in the two risk teams of TCGA-OSCC were calculated based on the ssGSEA algorithm of the ‘GSVA (v. 1.46.0)’ package [25]. The infiltration differences between the two risk teams were contrasted by the Wilcoxon test (P < 0.05), and the differential immune cells were confirmed. Spearman’s correlation between the differentially immune cells and prognostic genes was analyzed (|cor|>0.3, P < 0.05). What’s more, a total of 48 immune checkpoints were collected from the published literature [26], and the expression differences of 47 immune checkpoints between the two risk teams contrasted by the Wilcoxon test (P < 0.05). Meanwhile, Spearman’s correlation analysis of differentially expressed immune checkpoints was performed (|cor|>0.3, P < 0.05).
2.13 Molecular regulatory network
Here, the miRNAs were predicted based on the prognostic genes from the miRDB website (https://mirdb.org/). Subsequently, based on these miRNAs, the lncRNAs were predicted from the starBase database (http://starbase.sysu.edu.cn/). The hub lncRNAs were obtained by overlapping the lncRNAs and DE-LncRNAs using the ‘VennDiagram (v. 1.7.3) package (https://CRAN.R-project.org/package=VennDiagram). Finally, based on the hub lncRNAs, miRNAs, and prognostic genes, a lncRNA-miRNA-mRNA network was constructed by the Cytoscape (v. 3.10.0) software [27]. In addition, transcription factors (TFs) were predicted on the basis of the prognostic genes by the NetworkAnalyst (https://www.networkanalyst.ca/). Likewise, a TF-mRNA network was generated.
2.14 Drug sensitivity analysis
In TCGA-OSCC, we assessed differences in response to chemotherapy drugs in the two risk groups. A total of 138 chemotherapy drugs were collected from the Genomics of Drug Sensitivity in Cancer (GDSC) database (https://www.cancerrxgene.org/), the biochemical half maximal inhibitory concentration (IC50) differences were compared between the two risk groups by the Wilcoxon test (P < 0.05). In the end, the results were displayed by the Hiplot platform (https://hiplot.com.cn/).
2.15 Statistical analysis
All statistical analyses were carried out by the Rstudio (v. 4.2.2) software. In this study, the networks were constructed by the Cytoscape (v. 3.10.0) software [27]. P < 0.05 was considered statistically significant in all conditions (two-tailed).