3. 1.PBC differential genes
A total of 827 down-regulated genes and 639 up-regulated genes were obtained and their volcano maps were plotted (Figure. 2A).All differentially expressed up-regulated genes and down-regulated genes were extracted from the gene expression matrix, which were categorized into the PBC group and the control group and plotted in their heat maps (Figure. 2B).
3. 2. Differential enrichment analysis
DEGs were enriched in BP for cellular respiration, aerobic respiration, and the electron transport chain; in CC for the inner mitochondrial membrane, mitochondrial protein complex, and ribosome; and in MF for ribosome structural composition, primary active transmembrane transporter activity, and electron transfer activity. In the KEGG analysis, DEGs were significantly enriched in multiple neurodegenerative diseases, including amyotrophic lateral sclerosis and Alzheimer's disease. In the DO disease analysis, DEGs were significantly enriched in primary immune diseases, bronchial diseases, and bacterial infectious diseases. The visualization results are shown in Fig. 3.
3. 3. Immune infiltration analysis
Gene expression was extracted from the GSE119600 dataset for PBC and controls. The CIBERSORT algorithm was applied to evaluate the immune cell types of PBC patients and draw a scale diagram (Fig. 4A). Then, immune infiltration analysis was performed between PBC and control groups using the CIBERSORT (Fig. 4B), EPIC (Fig. 4C), and quanTIseq (Fig. 4D) algorithms, respectively. In the CIBERSORT algorithm, memory B cells, CD8 + T cells, resting memory CD4 + T cells, and regulatory T cells were lowly expressed in PBC patients, while immature CD4 + T cells, M0-type macrophages, and dendritic cell activation were highly expressed. In the EPIC algorithm, CD8 + T cells were likewise lowly expressed in PBC patients. In the quanTIseq algorithm, M2-type macrophages and neutrophils were highly expressed in PBC patients, and monocytes were lowly expressed.
3. 4.WGCNA
The PBC patients in the GSE119600 dataset were evaluated for scale independence and average connectivity after the outlier removal process (Figure. 5A, B). The optimal power value of 10 was selected based on a correlation coefficient greater than 0.85 (Figure. 5C), and the topological overlap matrix (TOM) was constructed with a soft threshold power of 10.A total of 12 gene modules were identified (Figure. 5E), and the topological overlap of the gene network was shown by a heat map (Figure. 5D). The correlation value of the blue module with gene significance was 0.18 (p = 1.2e-15) (Figure. 5F). Therefore, further studies focused on the genes within the blue module, which contained a total of 1,949 genes.
3. 5.Machine learning
Candidate genes for diagnosing PBC were obtained by taking the intersection of DEGs, secreted protein-related genes, and WGCNA blue module genes and plotting a Venn diagram (Fig. 6A). A series of machine learning analyses were performed on the expression matrices of the candidate genes to filter out the hub genes.
First, using Random Forest, we identified 18 candidate genes (Fig. 6D) and visualized the prediction accuracy (Fig. 6B) and gene importance shubs (Fig. 6C) of the Random Forest model. Second, by SVM-RFE analysis, we constructed the model and achieved the highest accuracy of 0.13 (Fig. 6F) and the lowest error rate of 0.87 when the model included 18 hub genes (Fig. 6E). Then, LASSO regression analysis identified 8 hub genes at the lowest lambda value and showed the regression coefficients (Fig. 6G) and the results of regression cross-validation (Fig. 6H). Finally, by GMM analysis, we identified 14 hub genes at the highest AUC value of the ROC curve (0.96) and visualized the AUC value of the individual regression models (Fig. 6I). The best models for diagnosing PBC constructed by the above machine learning methods were validated with the validation group, respectively. The corresponding ROC curves were plotted and the AUC values were calculated (Figure. 6J, K, L). The AUC values for RF, SVM-RFE, and LASSO were 0.919, 0.839, and 0.861,respectively.
The hub genes screened by each of RF, SVM-RFE, LASSO, and GMM were intersected to obtain the hub genes for diagnosing PBC: CSF1R, PLCH2, SLC38A1, and CST7, and their Venn diagrams were plotted (Fig. 7A). The hub genes were subjected to Bayesian testing and modeling, and the diagnostic efficacy of the model was tested using validation and test groups. Their ROC curves were plotted and the AUC values were calculated (Figure. 7B, C), which were 0.867 and 0.722, respectively. The diagnostic error rates were calculated using LDA and QDA algorithms. The total error rate was 0.32 for both, and heat maps were plotted to visualize the results (Figure. 7D, E).
3. 6.Hub gene expression and ROC
The expression matrix of hub genes was extracted from the GSE11960 dataset, divided into PBC and control groups. CSF1R, PLCH2, and SLC38A1 were highly expressed in PBC patients, while CST7 was lowly expressed. Box plots were drawn to show the differences in hub gene expression (Figure. 7F). The diagnostic potency of the respective hub genes for PBC patients in the GSE11960 dataset was calculated, and their ROC curves were plotted with AUC values calculated (Fig. 7G), with CSF1R having the highest AUC at 0.736.
3.7.Nomogram
Nomogram diagnostic plots of hub genes were drawn (Fig. 7H), and ROC curves (Fig. 7I), correction curves (Fig. 7J), and DCA curves (Fig. 7K) of the Nomogram were plotted. The AUC was 0.778 and the corrected curve as well as the DCA curve showed a good diagnosis.
3. 8. ssGSEA
Immunoinfiltration analysis using ssGSEA was performed for each hub gene, categorizing PBC patients into high and low expression groups based on whether their expression levels were above the median. Effector memory CD8 T cells, natural killer T cells, plasma-like dendritic cells, T follicular helper cells, and type 2 T helper cells were more predominant in the CSF1R high-expression group than in the low-expression group. In the CST7 high-expression group, activated dendritic cells were more predominant than in the low-expression group. Central memory CD4 T cells, immature dendritic cells, and type 2 T helper cells were more predominant in the CST7 low-expression group than in the high-expression group. Natural killer T cells and T follicular helper cells were more predominant in the PLCH2 high-expression group than in the low-expression group, and macrophages, activated dendritic cells, and neutrophils were more predominant in the PLCH2 low-expression group than in the high-expression group. Activated B cells, activated CD4 T cells, activated CD8 T cells, central memory CD4 T cells, effector memory CD8 T cells, young B cells, type 1 T helper cells, and type 17 T helper cells were more predominant in the SLC38A1 high-expression group than in the low-expression group. Eosinophils, mast cells, neutrophils, and myeloid-derived suppressor cells were more predominant in the SLC38A1 low-expression group than in the high-expression group. The results are visualized in the heat map shown in Fig. 8.
3. 9.CSF1R
The best-expressed CSF1Rs among the hub genes were subjected to GSEA analysis. PBC patients were divided into CSF1R high and low expression groups based on whether their CSF1R expression was higher than the median, and were subjected to GSEA enrichment analysis using hallmark gene sets. The absolute value of NES was taken as the top five positive values and the two negative values to plot the GSEA enrichment analysis shubs, as shown in Fig. 9. The CSF1R high-expression group was mainly enriched in allograft rejection, inflammatory response, interferon α response, interferon γ response, and unfolded protein response. The CSF1R low-expression group was mainly enriched in heme metabolism and oxidative phosphorylation. CSF1R gene expression was extracted, and the results of the CIBERSORT immune infiltration analysis of PBC patients were plotted as immune component correlation scatter plots, as shown in Fig. 10. Figure 10 showed that CSF1R gene expression was significantly correlated with B memory cells, CD4 + memory resting T cells, CD8 + T cells, and regulatory T cell components.
3. 10. Animal experiment results
HE staining results showed that in PBC mice, multiple inflammatory cell aggregates could be seen in the hepatic portal vein area. Granulomas were observed in the area of inflammatory cell aggregates, and a large number of lymphocytic infiltrations were present in the biliary epithelium. The results of Western blot and qRT-PCR showed that the mRNA and protein levels of PLCH2, CSF1R, p-CSF1R, and SLC38A1 were significantly increased compared to the Normal group, whereas those of CST7 were decreased, consistent with previous bioinformatics analysis results, As shown in Fig. 11.