Determination of DEGs in SLE/AF training datasets
A total of 2388 DEGs were detected between the SLE and control groups using GEO2R. Among these, 995 genes were found to be up-regulated, while 1393 genes were down-regulated. These findings were visually represented in Fig. 2A through a volcano plot. To eliminate inter-batch discrepancies before merging the gene expression datasets from GSE41177 and GSE79768, we use the 'sva' R package. Using the R package 'Limma', we identified a total of 1131 DEGs between healthy control individuals and patients with AF. Among these DEGs, 566 were found to be significantly down-regulated, while 565 were significantly up-regulated in AF (Fig. 2B).
DEGs identified in the SLE/AF training datasets. (A) The volcano plot of SLE training dataset. Downregulated genes are marked in light blue; upregulated genes are marked in red. (B) The volcano plot of AF training dataset. Downregulated genes are marked in green; upregulated genes are marked in red.
WGCNA of DEGs in GSE50772 datasets
The soft threshold power in the GSE50772 dataset was recorded at 14 (Fig. 3A). Afterwards, we proceeded to identify dynamic modules, making sure that each module contained a minimum of 30 genes (Figs. 3B). A total of seven important modules were clustered. Out of all the modules analyzed, the gray module was labeled as non-functional as it failed to effectively group genes together. Additionally, both the brown module and the turquoise module showed a significant negative correlation with SLE (r = -0.60, p = 8.4 × 10 − 9 and r = -0.50, p = 2.0 × 10 − 6 respectively, as shown in Fig. 3C). In SLE, there is a correlation between the brown/turquoise module and gene significance, as depicted in Fig. 3D.
WGCNA of DEGs in SLE training dataset. (A) The soft threshold selection. β = 0.9 was choose as the most appropriate threshold. (B) Gene cluster tree of different modules. (C) The association between brown/turquoise module membership and gene significance in SLE. (D) Correlation analysis of modules and groups.
Identification of common DEGs between SLE and AF
To delve deeper into our investigation, we isolated a combined total of 932 genes from both the brown and turquoise modules for further scrutiny. By intersecting the DEGs related to AF (n = 1131) and the genes in the brown/turquoise module associated with SLE (n = 932), we were able to identify a total of 71 DEGs that were shared between the two diseases. After excluding genes that displayed contrasting patterns of expression from the original group of 71 common DEGs, we ended up with a final tally of 29 DEGs (Fig. 4A, B, Additional file 1: Table S1).
The common DEGs of SLE and AF. (A) 71 DEGs were identified by intersecting the DEGs in AF and the brown/turquoise module genes in SLE. (B) 29 DEGs with same expression trends were retained.
Analysis of functional enrichment for common DEGs
Figure 5A displays the KEGG functional enrichment analysis findings for the 29 DEGs. The pathways associated with the 'FoxO signaling pathway', 'ras signaling pathway' and 'MAPK signaling pathway' stood out as the primary enrichment points for the 29 DEGs. The GO terms for BP that showed significant enrichment included 'Hepatocyte apoptotic process', 'positive regulation of carbohydrate metabolic process' and 'regulation of epithelial cell apoptotic process'. Furthermore, the terms 'bicellular tight junction', 'tight junction' and 'apical junction complex' exhibited enrichment for CC. Furthermore, the expression levels of the 29 DEGs showed a strong correlation with 'histone methyltransferase activity (H3-K9 specific)', 'growth factor binding', 'histone methyltransferase activity (H3-K36 specific)', and 'protein tyrosine kinase activity' (Fig. 5B-D). In addition, the co-expression network built by the GeneMANIA database showcased an intricate protein-protein interaction (PPI) network, comprising of 53.20% co-expression, 2.73% co-localization, 42.29% physical interactions and 1.77% genetic interactions (Fig. 5E).
Functional enrichment analysis and co-expression network of 29 common DEGs. (A) KEGG pathway analysis of common DEGs. (B-D) GO analysis (BP, CC, MF) of common DEGs. (E) DEGs and their co-expression genes were analyzed via GeneMANIA.
undefined
3.1 Determination of DEGs in SLE/AF training datasets
A total of 2388 DEGs were detected between the SLE and control groups using GEO2R. Among these, 995 genes were found to be up-regulated, while 1393 genes were down-regulated. These findings were visually represented in Figure 2A through a volcano plot. To eliminate inter-batch discrepancies before merging the gene expression datasets from GSE41177 and GSE79768, we use the 'sva' R package. Using the R package 'Limma', we identified a total of 1131 DEGs between healthy control individuals and patients with AF. Among these DEGs, 566 were found to be significantly down-regulated, while 565 were significantly up-regulated in AF (Figure 2B).
3.2 WGCNA of DEGs in GSE50772 datasets
The soft threshold power in the GSE50772 dataset was recorded at 14 (Figure 3A). Afterwards, we proceeded to identify dynamic modules, making sure that each module contained a minimum of 30 genes (Figures 3B). A total of seven important modules were clustered. Out of all the modules analyzed, the gray module was labeled as non-functional as it failed to effectively group genes together. Additionally, both the brown module and the turquoise module showed a significant negative correlation with SLE (r = -0.60, p = 8.4 × 10−9 and r = -0.50, p = 2.0 × 10−6 respectively, as shown in Figure 3C). In SLE, there is a correlation between the brown/turquoise module and gene significance, as depicted in Figure 3D.
3.3 Identification of common DEGs between SLE and AF
To delve deeper into our investigation, we isolated a combined total of 932 genes from both the brown and turquoise modules for further scrutiny. By intersecting the DEGs related to AF (n = 1131) and the genes in the brown/turquoise module associated with SLE (n = 932), we were able to identify a total of 71 DEGs that were shared between the two diseases. After excluding genes that displayed contrasting patterns of expression from the original group of 71 common DEGs, we ended up with a final tally of 29 DEGs (Figure 4A, B, Additional file 1: Table S1).
3.4 Analysis of functional enrichment for common DEGs
Figure 5A displays the KEGG functional enrichment analysis findings for the 29 DEGs. The pathways associated with the 'FoxO signaling pathway', 'ras signaling pathway' and 'MAPK signaling pathway' stood out as the primary enrichment points for the 29 DEGs. The GO terms for BP that showed significant enrichment included 'Hepatocyte apoptotic process', 'positive regulation of carbohydrate metabolic process' and 'regulation of epithelial cell apoptotic process'. Furthermore, the terms 'bicellular tight junction', 'tight junction' and 'apical junction complex' exhibited enrichment for CC. Furthermore, the expression levels of the 29 DEGs showed a strong correlation with 'histone methyltransferase activity (H3-K9 specific)', 'growth factor binding', 'histone methyltransferase activity (H3-K36 specific)', and 'protein tyrosine kinase activity' (Figure 5B-D). In addition, the co-expression network built by the GeneMANIA database showcased an intricate protein-protein interaction (PPI) network, comprising of 53.20% co-expression, 2.73% co-localization, 42.29% physical interactions and 1.77% genetic interactions (Figure 5E).
3.5 SVM-RFE algorithm of common DEGs
According to the SVM-RFE findings, the 27 top features of DEGs exhibited the greatest precision and the least amount of mistakes when identifying potential biomarkers for AF (Figure 6). The gene ranking of the 29 common DEGs is demonstrated in Additional file 2: Table S2. The top 27 genes were extracted for follow-up analysis.
3.6 Identification of candidate biomarkers
In each dataset, the expression of the 27 DEGs obtained above in each dataset were assessed employing the Wilcoxon rank-sum test. The R packages 'ggplot2' and 'ggpubr' were utilized to illustrate the outcomes in a visual format (Figure 7A-D). Candidate biomarkers were identified by isolating genes with significant expression (statistically significant defined as a p-value less than 0.05) from all datasets and then intersecting them. Finally, ANKRD36B, SLC4A4, ANKRD12, MTUS1 and DSC1 these five genes were identified as candidate biomarkers and underwent subsequent analysis (Figure 7E).
3.7 Single-sample gene-set enrichment analysis (ssGSEA)
According to the information provided in Figure 8A-D, it was observed that the five potential biomarkers exhibited significant down-regulation in both the training and validation datasets when compared to the control. The correlation analysis (Figure 8E) showed that several pathways in SLE, including 'CHOLESTEROL_HOMEOSTASIS', 'MTORC1_SIGNALING', 'INFLAMMATORY_RESPONSE', 'P53_PATHWAY', 'FATTY_ACID_METABOLISM', 'GLYCOLYSIS' and 'REACTIVE_OXYGEN_SPECIES_PATHWAY' had a significant negative association with the five candidate genes. These hallmark gene sets were widely acknowledged for their close association with lipid metabolism, the inflammatory response, as well as cell proliferation and differentiation. AF was significantly influenced by the inflammatory response, which was a pivotal element in its initiation and advancement. Additionally, these findings indicated that the occurrence of AF in individuals diagnosed with SLE was linked to the expression of the 5 potential biomarkers.
3.8 Assessment of the prognostic significance of the 5 biomarkers and nomograms
The training and validation datasets were subjected to ROC analysis to assess how effectively the 5 (ANKRD36B, SLC4A4, ANKRD12, MTUS1 and DSC1) biomarkers can predict the diagnostic accuracy for SLE and AF. In the SLE training dataset, the AUC values and the related 95% CI of the 5 biomarkers were depicted in Figure 9A: ANKRD36B (AUC 0.925, 95%CI 0.858–0.993), SLC4A4 (AUC 0.791, 95%CI 0.696–0.886), ANKRD12 (AUC 0.884, 95%CI 0.804–0.965), MTUS1 (AUC 0.787, 95%CI 0.680–0.893), DSC1 (AUC 0.815, 95%CI 0.772–0.908). The AUC values and the corresponding 95% CI of the biomarkers in AF training dataset were illustrated in Figure 9B: ANKRD36B (AUC 0.702, 95%CI 0.541–0.862), SLC4A4 (AUC 0.808, 95%CI 0.693–0.923), ANKRD12 (AUC 0.678, 95%CI 0.518–0.837), MTUS1 (AUC 0.728, 95%CI 0.582–0.875), DSC1 (AUC 0.685, 95%CI 0.541–0.830). In Figure 9C, we could see that the AUC value and 95%CI in SLE validation dataset of the biomarkers were: ANKRD36B (AUC 0.767, 95%CI 0.639–0.895), SLC4A4 (AUC 0.933, 95%CI 0.867–0.999), ANKRD12 (AUC 0.707, 95%CI 0.567–0.848), MTUS1 (AUC 0.739, 95%CI 0.604–0.873), DSC1 (AUC 0.853, 95%CI 0.746–0.961). The Figure 9D represented the AUC value and 95%CI of the biomarkers in AF validation dataset were: ANKRD36B (AUC 0.753, 95%CI 0.578–0.929), SLC4A4 (AUC 0.73, 95%CI 0.562–0.898), ANKRD12 (AUC 0.753, 95%CI 0.583–0.924), MTUS1 (AUC 0.837, 95%CI 0.689–0.984), DSC1 (AUC 0.88, 95%CI 0.77–0.99). A predictive nomogram was developed for each dataset, assigning a score to the relative expression of each gene. The total score was then calculated by adding up the individual scores of all the genes. (Figure 9E-H). The results of ROC analysis of the nomogram in the 4 datasets were shown in Figure 9I-L: SLE training dataset (AUC 0.995, 95%CI 0.985–1.000), AF training dataset (AUC 0.837, 95%CI 0.709–0.965), SLE validation dataset (AUC 0.955, 95%CI 0.907–1.000), AF validation dataset (AUC 0.943, 95%CI 0.862–1.000).
3.9 Immune infiltration analysis in SLE/AF training datasets
The boxplot (Figure 10A) and barplot (Additional file 3: Figure S1 A) indicated that the proportion of Monocytes, Neutrophils, B cells memory and Plasma cells was greater in SLE than in control groups, whereas the percentage of NK cells resting, T cells CD4 memory resting and Mast cells resting was lower. While in AF, there was a higher abundance of Monocytes and Neutrophils compared to the control groups, while T cells CD4 memory resting, T cells CD4 memory activated and Mast cells activated showed a decrease. This was demonstrated by both the barplot (Additional file 3: Figure S1 B) and boxplot (Figure 10B).
In the study on SLE, the correlation analysis highlighted a strong positive relationship between the activation of NK cells and Mast cells, while it was found that Macrophages M1 exhibited a significant positive correlation with resting Mast cells (Additional file 3: Figure S2 A). However, in AF many immune cells had significant correlations between them, according to correlation analysis. The Mast cells activated showed a notable and robust positive correlation with T cells CD4 naive. Moreover, Plasma cells exhibited a notable and powerful positive correlation with B cells memory. Furthermore, NK cells resting exhibited a significant and strong positive correlation with Neutrophils. In contrast, A clear and striking inverse connection was observed between monocytes and T cells gamma delta, while T cells regulatory displayed a noteworthy and powerful negative correlation with T cells gamma delta (Additional file 3: Figure S2 B).
It is worth noting that Figure 10C provides compelling evidence of a positive association between the five candidate biomarkers and the infiltration of T cells CD4 memory resting in SLE. In contrast, the study conducted in AF revealed a association between the five potential biomarkers and the infiltration of Monocytes, showing a negative correlation. However, a positive relationship was observed between these biomarkers and the infiltration of Macrophages M1 and T cells gamma delta. (Figure 10D).
3.10 Consensus clustering analysis
Our approach involved employing consensus clustering analysis to examine the DEG expression profile within the SLE dataset GSE50772. Figure 11A indicated that the strongest clustering effect between different groups was observed when the parameter k was assigned a value of 3. By employing the CDF, the k value was determined as the distribution reached its maximum, signifying the attainable stability in the given context. In Figure 11B, when k=3, the curve had a plateau. Moreover, the tracking plot (Figure 11C) provided insights into how projects were clustered together using consensus clustering at various k values. It could be seen from the figure that the sample clustering is clear when k=3. The maximum number of clusters was determined by calculating the average consistency within each group of clusters. According to Figure 11D, it became evident that when k=3, every cluster displayed a significantly high value. Therefore, we divided SLE into A, B and C three unsupervised clustering subgroups. In Figure 11E's t-SNE plot, it is worth mentioning that there was a significant differentiation observed among the three subtypes A, B, and C.
3.11 Analysis of subtypes of SLE
The expression of the five potential biomarkers in three subtypes was shown to vary significantly in Figure 12A. More precisely, when it came to gene expression, ANKRD12 and MTUS1 were found to be more prominent in subtype A as opposed to subtypes B and C, while subtype B exhibited a higher expression of SLC4A4 than subtypes A and C. Furthermore, in subtype C, there was a noticeable increase in the expression levels of ANKRD36B and DSC1 compared to subtypes A and B. The boxplot in Figure 12A and the heatmap in Figure 12B both displayed the same trend. In subtype A, we observed a significant increase in the presence of T cells CD4 memory resting and Neutrophils compared to subtypes B and C. On the other hand, subtype B showed a higher level of infiltration of NK cells resting when compared to subtypes A and C (Figure 12C). Furthermore, subtype C exhibited higher levels of infiltration by memory B cells and Plasma cells in comparison to subtypes A and B.