Data download and filtering
The gene expression datasets downloaded from the GEO were manually checked to remove duplicates and irrelevant studies, after which 20 studies remained. Out of the 20 studies, we selected the studies that reported gene expression in whole blood, PBMC or blood cell components, and excluded studies regarding other tissues such as synovial fluid or other tissues to ensure comparable gene expression. This filtering also ensured the removal of any potential bias due to tissue-specific gene expression. We also excluded studies based on drug-treated samples. The filtering, as mentioned above, led to 14 definitive studies to be included for our analysis.
After a thorough search and excluding datasets as specified above, seven SLE datasets (GSE11909, GSE50772, GSE22098, GSE8650, GSE4588, GSE61635, and GSE10325) and seven RA datasets (GSE56649, GSE15573, GSE4588, GSE17755, GSE11827, GSE57383, and GSE68689) were selected for further analysis. A total of 904 samples were considered for downstream analysis, representing data from 366 SLE patients, 168 healthy controls for SLE, 222 RA patients and 148 healthy controls for RA.
Data division and meta-analysis
To achieve an extensive, unbiased study of the shared signatures underlying RA and SLE, we identified and downloaded the publicly available GEO gene expression datasets. From the initial datasets obtained from public databases, we randomly selected 14 studies from both the diseases that matched the study inclusion criteria (see methods). The seven datasets for RA (5 PBMC, 1 WB, 1 CD4 T cells and B cells) and 7 of SLE (5 PBMC, 1 WB and 1 CD4 T cells and B cells) were processed for further analysis, contained samples from 222 RA patients with 148 controls, and 366 SLE patients with 168 controls, the complete workflow is shown in Fig. 1a and Fig. 1b. A detailed summary of the included datasets and samples can be found in Table 1. The previous findings suggest that discovery datasets with a sample size of 250-300 are enough to identify robust gene signatures for any disease using the MetaIntegrator framework. The Discovery dataset includes samples from PBMC, whole blood and CD4 T and B cells of RA and SLE patients with the respective controls. Based on the meta-analysis using the MetaIntegrator R package, using Random Effect Model (REM), 160 RA genes and 1630 SLE genes were identified as significantly differentially expressed in the RA and SLE datasets versus healthy controls (FDR<0.05 in the REM). A list of 50 most significantly up or downregulated genes for RA and SLE are shown in Additional file 1: Supplementary Table S1.
Identification of common gene signatures in RA and SLE
For meta-analysis of the study dataset, the MetaIntegrator R package was used, with a selected cut-off for effect size (ES) and false discovery rate (FDR) to determine the DEGs for both the diseases, among healthy controls vs. diseased. ES is the ratio of the mean difference between groups to the standard deviation denoted as d, the convention values for small, medium, and high effect size were 0.20, 0.40 and 0.60, respectively. We identified 162 (56 upregulated, 106 downregulated) and 1630 (574 upregulated, 1056 downregulated) DEGs (ES>0.4, FDR<0.05) of RA and SLE, respectively. As per our hypothesis, we identified the 33 common gene signatures (9 upregulated and 24 downregulated) in both RA and SLE with a false discovery rate (FDR<0.05), effect size >0.4 and a minimum of 3 studies should support the same (see Additional file 2: Supplementary Table S2). Venn diagram highlights the unique and common gene signatures of RA and SLE (Fig. 2). In discovery datasets of RA and SLE Meta-scores distinguish patient samples from the healthy controls with an Area Under the Receiver Operating Characteristics (AUROC) of 0.871 (95% confidence interval (CI): 0.70-1) and 0.935 (95% CI: 0.73-1) respectively (Fig. 3a, Fig. 3b). Amongst the 33 common gene signatures, 20 are already reported to be associated with the diseases. However, to the best of our knowledge, 13 are not reported to be associated with these diseases. Particularly striking was the highly significant up-regulation of the genes NEXN, IRF9, B4GALT5, MMP8, HIT1H1C, NFIL3, HIST2H2AA4, DDX60L and VMP1. Out of these, IRF9 is an interferon regulatory factor involved in the IFN alpha signaling pathway and bone remodeling pathways [25,26]. NEXN was recently found deregulated between RA and TB [24]. Similarly, MMP8 is a matrix metalloproteinase associated with many autoimmune diseases, including RA [27]. Another important gene, NFIL3, is a nuclear factor interleukin 3 regulated gene involved in many autoimmune disorders, including SLE [28]. In brief, Interferons (IFN), Interleukins (IL), lymphokine and Tumor Necrosis Factor (TNF) are the types of cytokines that regulate the development of different cells of the immune system. In agreement with the previous studies, we find that these genes play a regulatory role in the pathogenesis of both diseases. A heatmap of the highly differentially regulated genes of RA, SLE and common gene signatures is shown in Additional file 3: Supplementary Figure S1. The complete description of the common signatures is listed in Additional file 4: Supplementary Table S3. In our study, we have also evaluated the gained and lost genes, gained are those DEGs identified in the meta-analysis, not present in the individual analysis and lost are those DEGs found in any individual study, but not in the meta-analysis, they may show low signals but show consistent expression pattern throughout the datasets. In the case of RA, we obtained 82 gained and 1313 lost genes, whereas, for SLE, the gained genes were 903 and 1631 lost genes. A list is shown in Additional file 5: Supplementary Table S4 for all the gained and lost genes via this meta-analysis.
Hub genes network analysis
Based on the methodology, we generated three interaction networks as described in the methods section. The interaction network for the identified common gene signatures comprise of 33 seeds with 630 connecting nodes and of 980 edges representing the interaction between these proteins. To arrive at more biologically relevant information, this analysis identified key hub genes among the common signatures, and the top differentially regulated genes for RA and SLE. The protein-protein interaction network for the common gene signatures is shown in Fig. 4. We analyzed the protein-protein interaction network for the common gene signatures and found 15 hub genes with a high degree of centrality and betweenness. The top hub genes were RPS2 (degree: 156, betweenness: 34190), PABPC1 (degree: 152, betweenness: 63040), RPL5 (degree: 146, betweenness: 49501), RPS8 (degree: 145, betweenness: 31190), and EEF1G (degree: 122, betweenness: 55075).
From the interaction network of differentially regulated genes in RA, it was found that for the upregulated genes, the hub genes with the highest degree and betweenness are PKM (degree: 99, betweenness: 50885), IRF9 (degree: 47, betweenness: 17423), and OS9 (degree: 42, betweenness: 15352) for downregulated genes- DHX9 (degree: 185, betweenness: 165422), RPL5 (degree: 146, betweenness: 60548) and RPS11 (degree: 117, betweenness: 38181).
For the SLE interaction network, the notable hub genes include STAT1 (degree: 223, betweenness: 183938), ISG15 (degree: 188, betweenness: 167205), and MYD88 (degree: 87, betweenness: 60392) were the topmost differentially regulated hub genes. For the downregulated genes, the significant hub genes are DDX5 (degree: 156, betweenness: 85071), NAP1L1 (degree: 88, betweenness: 47262), CD44 (degree: 56, betweenness: 31368).
All the details of hub genes for the common gene signatures and disease-specific genes (RA and SLE) are listed in Additional file 6: Supplementary Table S5. The networks of RA and SLE for the top 50 DEGs are shown in Additional file 3: Supplementary Figure S3 and Supplementary Figure S4. The common gene signatures showed expression in all the studies, shown by forest plots in Fig. 5. The forest plots represent very similar expression for the common (upregulated and downregulated) genes of RA and SLE across studies suggests their important role in both diseases. Many of these genes such as NFIL3, NMT2, EIF4B, PTGDS etc. identified as hub genes in our network analysis too, which further proves their importance in underlying mechanism of RA and SLE.
Identification of overrepresented biological pathways and Gene Ontology terms
Pathway enrichment analysis was performed using the Enrichr web-based tool by uploading the list of RA and SLE differentially regulated genes obtained by meta-analysis. For RA upregulated genes, TNF signaling pathways, Interleukin related pathways, osteoclast differentiation, bone remodeling, haemoglobin’s chaperon-related pathways, TNFR1 signaling, and virus infection-related pathways were enriched. For the top downregulated genes, legionellosis, mTOR signaling pathway, skeletal muscular hypertrophy regulation related pathway, SODD/TNFR1 signaling related pathways were enriched. For SLE upregulated genes, interferon-alpha/beta signaling, immune system signaling by interferons, cytokine signaling in the immune system, cell cycle checkpoints related pathway, NOD-like receptor signaling, toll-like receptor signaling pathways and many viral infections related signaling pathways were found enriched. For the downregulated genes, T cell receptor signaling pathway, hematopoietic cell lineage, mTOR signaling pathway, multi-drug resistance factor-related pathways, and skeletal muscular hypertrophy regulation related pathway and various gene expression and transcription factors related pathways were enriched. The common gene signatures identified by us include the enriched pathways related to the mTOR signaling pathway, necroptosis, bone remodeling, Regulation of eIF4e and p70 S6 kinase pathway, translational silencing related and virus infection-related pathways. The enriched pathways complete information for common and disease-specific are listed in Additional file 7: Supplementary Table S6. For common genes, the enriched Gene Ontology (GO) term for GO: biological processes, nuclear-transcribed mRNA catabolic processes (GO: 0000956), viral gene expression (GO: 0019080), viral transcription (GO: 0019083), and the regulation of cytoskeleton organization (GO: 0051493). For GO: Molecular function, mRNA binding (GO: 0003729), ubiquitin-protein transferase inhibitor activity (GO: 0055105), beta-N-acetyl glucosaminylglycopeptide beta-1, 4-galactosyltransferase activity (GO: 0003831), and translation elongation factor activity (GO: 0003746). The details about the GO-terms for common and specific to RA and SLE are listed in Additional file 8: Supplementary Table S7.