Identification of DE-OSRlncRNA in THCA patients
A total of 296 DE-OSRGs (163 up-regulated and 133 down-regulated) were identified from TCGA-THCA, while a total of 51 DE-OSRGs (17 up-regulated and 34 down-regulated) were identified in GSE53157 and only 21 DE-OSRGs (8 up-regulated and 13 down-regulated) were identified in GSE27155. The DE-OSRGs in the 3 datasets The volcano and heat maps are shown in Supplementary Figure 1. To obtain more accurate DE-OSRGs, the DE-OSRGs in the 3 datasets were intersected to obtain 7 key DE-OSRGs (Figure 1A). 189 DE-OSRlncRNA were identified (Figure 1B) (Pearson correlation coefficient > 0.6, p < 0.01).
Screening of characteristic genes
In the training set, 41 DE-OSRlncRNA were obtained using univariate Cox regression (Figure 1C). To prevent over-fitting of prognostic features, 4 prognostic DE-OSRlncRNA were obtained using LASSO regression analysis (Figure 1D). Finally, multivariate Cox was used to identify the most robust features, and 2 lncRNA (SAP30-DT, DPP4-DT) were identified as characteristic genes (Figure 1E) and used for prognostic risk model construction.
Construction and validation of the prognostic risk model
The formula for calculating the risk score: Risk Score =exp(1.8*SAP30-DT+(-0.76)*DPP4-DT). The risk score was calculated for each THCA patient in the training set and THCA was divided into high-risk (n=171) and low-risk groups (n=171) using a cut-off=0.87 value. The distribution of risk scores, patient survival status, and expression levels of the 2 characteristic genes between high and low-risk groups are shown in Figure 2A. Kaplan-Meier (K-M) results indicated that high-risk patients had a worse prognosis (Figure 2A). the AUC values of the ROC curves at 1, 3, and 5 years were greater than 0.6, indicating the feasibility of the model (Figure 2A). To further validate the stability of the model, we used the test set and the entire TCGA-THCA cohort for validation (Figure 2B and Figure 2C).
Differences of risk scores in clinicopathological characteristics
As shown in Figure 3A, risk scores differed significantly in Age, T, zone N, and Location. the SAP30-DT correlation was higher in the high-risk group and the opposite was true for the DPP4-DT (Figure 3B). Survival analysis was performed between the high- and low-risk groups according to clinical characteristics, and we found that OS was lower in the high-risk group than in the low-risk group (Figure 4)
Construction of the nomogram model
A univariate Cox analysis was performed combining the clinical characteristics and risk scores of the entire TCGA-THCA cohort (Figure 5A). The results showed that RiskScore, Stage, Age, T, and M were associated with patient recurrence. In contrast, the multifactorial Cox analysis showed that only Risk Score and Age were independent prognostic factors (Figure 5B). Constructing a nomogram model based on the independent prognostic factors (Figure 5C), it was found that the higher the total score, the lower the survival rate. The calibration curve c-index = 0.942 for this nomogram model indicated that the prediction of it was true and reliable (Figure 5D).
Biological characteristics of high and low-risk groups
To explore the role played by DEGs between high and low-risk groups. We identified 854 DEGs using the DESeq2 package (Supplementary Figure 2). The DEGs were subsequently analyzed for GO and KEGG enrichment. BP results indicated that regulation of membrane potential, regulation of ion transmembrane transport, and regulation of metal ion transport plasma transport were the main biological processes. CC results showed that synaptic cell components such as presynapse, neuronal cell body, and neuron-to-neuron synapse were mainly enriched by DEGs. Various channel activities such as passive transmembrane transporter activity and channel activity were demonstrated in the MF results (Figure 6A). The GSEA was performed by sorting the fold change of DEGs in high and low-risk groups, and only TOP10 enrichment results are shown. We found that T cell-mediated immunity was suppressed in GO (Figure 6B), Thyroid hormone synthesis, Gastric acid secretion, and Neuroactive ligand-receptor interaction were activated in KEGG (Figure 6C). Also, we took the intersection of the two enrichment results (Figure 6D), and 108 GO-enriched functions and 5 KEGG pathways were particularly important in THCA (Supplementary Table 1), but the importance of 365 GO-enriched functions and 48 KEGG pathways cannot be excluded.
Construction of lncRNA-mRNA network
The progression of THCA was influenced by DEGs (mRNA) in the high- and low-risk groups influenced, so we mapped the lncRNA-mRNA co-expression network to explore how two characteristic genes regulate these mRNAs and thus influence THCA. 88 mRNAs were regulated by DPP4-DT as shown in (Figure 6E), especially DPP4, LIPH, SLC22A31, LGALS3, and SYTL5, which were highly correlated and significant mRNAs. Only 33 DEGs were associated with SAP30-DT. DPP6 could be affected by both signature genes. 121 mRNAs are shown in Supplementary Table 2.