2.1. The m7G-clusters
This study included 220 patients with PSC and 320 healthy individuals. The expression levels of m7G-related genes between the two groups are illustrated in Figure 1A. Out of the 27 m7G-related genes, 20 showed differential expression. We designate these 20 differentially expressed genes as m7G-DEGs. Additionally, a distinct transcription profile was derived from the expression levels of m7G-DEGs between PSC and healthy patients (Figure 1B). The PAM algorithm was employed to ascertain the optimal consensus matrix. The optimal consensus matrix (k=2) was achieved (Figure 1C). Through clustering, PSC patients were segregated into two distinct groups: m7G-cluster A and Cluster B. Cluster A displays elevated expression levels of DCP2, NUDT16, NUDT4, AGO2, EIF4E3, EIF4G3, IFIT5, and LSM1. Meanwhile, cluster B showcases heightened expression of METTL1, NUDT3, EIF4E2, GEMIN5, EIF3D, EIF4A1, and SNUPN (Figure 1D). Different transcription profiles were generated based on the expression levels of the 20 m7G-DEGs between Cluster A and Cluster B (Figure 1E).
PCA elucidates discernible boundaries between groups (Figure 2A). Following that, we examined the expression of immune cells within the two groups. The analysis of immune infiltration revealed distinct profiles between cluster A and B. In cluster B, there was a significant presence of activated B cells, activated CD8+ T cells, CD56dim natural killer cells, and immature B cells. Conversely, cluster A displayed a higher abundance of activated dendritic cells, gamma delta T cells, myeloid-derived suppressor cells (MDSCs), macrophages, neutrophils, plasmacytoid dendritic cells, and type 2 T helper cells (Figure 2B). Additionally, the investigation examined the connections between 20 m7G-DEGs and immune cell populations (Figure 2C). EIF4E3 exhibited a stronger positive relationship with immune cells than the remaining genes under consideration. Patients were categorized into high and low EIF4E3 expression groups based on the gene expression levels of EIF4E3. Subsequently, the profiles of immune cells were compared between these two groups to analyze the differences. Patients with low EIF4E3 expression showed increased infiltration of activated B cells and immature B cells. In contrast, those with high EIF4E3 expression had a significant increase in the infiltration of activated dendritic cells, CD56bright natural killer cells, gamma delta T cells, MDSCs, macrophages, mast cells, monocytes, neutrophils, and plasmacytoid dendritic cells (Figure 2D).
We conducted an analysis of the DEGs between m7G cluster A and B, identifying a total of 4,535 genes that exhibited differential expression. Subsequently, we carried out gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis for the DEGs identified in our study. The results of the GO enrichment analysis indicated that the DEGs were primarily associated with processes such as ribosome biogenesis, immune response-regulating signaling pathways, immune response-activating signaling pathways, mononuclear cell differentiation, rRNA metabolic processes, and immune response-regulating cell surface receptor signaling pathways (Figure 2E). The KEGG pathway analysis indicated that the enriched pathways were primarily within the domain of human infectious diseases, Coronavirus disease, the NF-kappa B signaling pathway, osteoclast differentiation, Th17 cell differentiation, as well as pathways related to PD-L1 expression, the PD-1 checkpoint pathway in cancer, and the hematopoietic cell lineage (Figure 2F).
2.2 The gene clusters
Then, 220 patients with PSC were re-grouped according to the expression level of DEGs, and the PAM algorithm was used to determine the two groups (Figure 3A). The patients were segregated into two clusters, designated as Gene Cluster A and Gene Cluster B, and a heat map was constructed to illustrate the expression patterns of the transcriptional profiles within these clusters (Figure 3B). The analysis of the expression of 20 m7G-DEGs in the two clusters revealed that METTL1, NUDT3, EIF4E2, EIF3D, and EIF4A1 were upregulated in gene cluster A. Conversely, NSUN2, NUDT4, AGO2, EIF4E3, and IFIT5 showed higher expression levels in gene cluster B (Figure 3C). Following this, the immune infiltration profiles of the two groups were also assessed. Gene cluster A displayed a higher level of immune infiltration by activated B cells, immature B cells, and regulatory T cells. In comparison, and gene cluster B had increased immune infiltration by CD56bright natural killer cells, MDSCs, and monocytes (Figure 3D).
2.3 Development of m7G-score
Furthermore, we developed the m7G-SCORE, a scoring system that calculates a composite score of m7G gene expression for PSC patients, assigning an individual m7G score to each patient. In the m7G-cluster grouping, cluster A is associated with a higher m7G score (p<0.001) (Figure 3E). Conversely, in the gene-cluster grouping, cluster B exhibits a higher m7G score (p<0.05) (Figure 3F). Based on the median m7G score, the 220 PSC patients were categorized into a high-score group and a low-score group. To enhance the differentiation of these three subtypes, we analyzed the demographics of the three patient groups. This comparison is presented in Figure 3G.
2.4 Construction of disease risk model
Prior to developing a nomogram model for assessing disease risk, it is necessary to identify characteristic genes using machine learning techniques. We aimed to determine whether RF or SVM is the more appropriate algorithm for gene selection. The selection is based on a comparison of the residual and the reverse cumulative distribution of the residuals. The findings indicate that RF machine learning outperforms SVM in screening for PSC-specific characteristic genes (Figure 4A-B). RF demonstrated a superior performance with an AUC of 1, while SVM achieved an AUC of 0.964 (Figure 4C). Consequently, we selected RF machine learning as the preferred method for identifying PSC-associated genetic signature genes. According to Figure 4D, we chose the ntree value that gives the lowest error rate (ntree = 428) (Figure 4D).
Subsequently, we determined the importance scores for m7G-related genes. Based on a threshold of an importance score greater than 10, we identified 8 genes: NSUN2, EIF4G3, EIF4E3, EIF3D, AGO2, LSM1, DCPS, and DCP2, as being significant in the screening process (Figure 5A). Utilizing these eight genes, we constructed a nomogram model designed to predict the risk of PSC disease (Figure 5B). In order to calibrate the nomogram, a calibration curve was used. The calibration curve showed satisfactory results (Figure 5C). The results of the CICA indicated that the high-risk PSC patients identified by the nomogram model closely aligned with the actual positive cases (Figure 5D). The DCA result demonstrated that nomogram model for predicting PSC risk provides substantial net benefit (Figure 5E).