3.1 Weighted co-expression networks (WGCNA) construction and key modules identification
The workflow of this work was shown in Fig. 1. We combined the most comprehensive sets so far of MetS-associated high-throughput transcriptomics data from GEO database (Additional file 1: Table S1). Gene expression profiles from GSE98895 were used to perform WGCNA analysis. After pre-processing and batch effect removal, we identified 25148 gene expression data of 20 MetS and 20 control group patients’ peripheral blood. Sample clustering analysis base on Pearson's correlation approach and average linkage approach showed no outliers (Fig. 2A). Then, in order to achieve scale-free topology (scale free R2 > 0.9), we selected soft-thresholding power β = 2 (Fig. 2B). Subsequently WGCNA network construction and average linkage hierarchical clustering detected 14 gene modules. Detailed hierarchical clustering information was shown in (Fig. 2C,D). Finally, through correlation analysis between these modules and MetS, we found red module (618 genes) and black module (546 genes) are highly associated with MetS (Fig. 2E). Hence, these two modules were identified as key module of MetS for further analysis. Scatter diagrams contain key module GS and MM information were shown in (Fig. 2F,G).
3.2 Gene ontology and pathway enrichment analysis
The results of Gene ontology (GO) functional enrichment analysis showed that these MetS-associated genes in red and black module mainly enriched in biology process (BP) such as receptor guanylyl cyclase signaling pathway, central nervous system neuron differentiation, response to calcium ion, platelet activation and so on. Besides, these genes were associated with molecular function (MF) such as tumor necrosis factor (TNF) receptor and lipid transporter activity. Cellular components (CC) like cellular junction and guanyl-nucleotide exchange factor complex may correlated with the development of MetS. As for KEGG signaling pathway enrichment analysis, our results indicated that these genes were significantly enriched in signaling pathways such as cell adhesion molecules, leukocyte transendothelial migration, calcium signaling pathway and so on (Fig. 3A,B). In addition, we performed GSEA analysis to further reveal the function and signaling pathway of these genes which showed the similar result as CC gene ontology and KEGG pathway enrichment analysis. BP GSEA analysis and MF GSEA analysis suggested that BP such as regulation of lymphocyte activation, drug metabolic process and MF such as lyase activity, hydrolase activity, molecular transducer activity, G-protein coupled receptor activity may involved with the development of MetS (Fig. 3C-F).
3.3 PPI network construction and hub gene identification
PPI network were established by STRING database based on genes in red and black modules. We calculated the connectivity degree and selected the top 5% genes (51 genes) with highest connectivity degree as hub genes associated with MetS. Hub genes with high connectivity degree such as MYC, UBE2E2, MIB2, ANAPC1, TCEB1, CTLA4, SPI1 may play important roles in the development of MetS and serve as potential biomarkers and therapy targets. The visualization of hub gene PPI network were shown in Fig. 4. These 51 hub genes (Additional file 2: Table S2) were used for further feature reduction analysis and model construction.
3.4 Hub gene expression level in plasma
A total of 12954 genes were discovered in our high-throughput sequencing data of plasma samples of 5 patients in metabolic syndrome group and 5 patients in healthy control group. Genes with P value < 0.05 and |log2FC| ≥ 1 were considered as differentially expressed genes (DEGs). Finally, we identified 45 up-regulated and 186 down-regulated DEGs in metabolic syndrome group compared with control group (Additional file 3: Table S3). We found that the expression of these 51 hub genes in plasma have no significant difference between metabolic syndrome patients and healthy control patients (Additional file 4: Table S4). Our results indicated that we should focused on the potential function and diagnostic value of these 51 hub genes in peripheral blood cellular components instead of plasma component which defined the sampling type of further noninvasive MetS screening or diagnostic tools.
3.5 Novel hub gene feature selection strategy
In this study, LASSO regression analysis and random forest approach were used to achieve feature selection. The expression data of these 51 hug genes were entered into LASSO regression models and 10-fold cross-validation was performed to detect the optimal classification accuracy (Fig. 5A,B). Hence, we obtained 15 hub gene features based on LASSO regression analysis including ADRA2A, CXCR5, FZD1, HLA.DPA1, HSPA5, KCTD7, KLHL9, P2RY14, P2RY2, PRKACG, PSMD1, PTTG1, REEP4,SPTAN1, TSPAN14. In addition, We constructed random forest model via these 51 hug genes expression profiles and measured the classification importance of hub gene feature by mean decrease in Gini coefficient (MeanDecreaseGini). Then, 15 hub genes features were chosen by random forest approach including SPTAN1, KCTD7, IL2RG, ITPR3, PSMD1, ITGB7, FZD1, DCTN4, KLHL9, PTTG1, TSPAN14, RNF19B, XCR1, P2RY2, CXCR5 (Fig. 5C). Finally, we combined the results of these two gene feature selection method by taking the intersection and selected 9 hub gene features (SPTAN1, KCTD7, PSMD1, FZD1, KLHL9, PTTG1, TSPAN14, P2RY2, CXCR5) for further analysis.
3.6 Web nomogram calculator construction and validation of 9-hub-gene signature
The expression profiles of these 9 selected gene features were entered into a logistic regression and then, in order to validate the diagnostic value of this 9-hub-gene signature, we established a web nomogram calculator for MetS risk based on training set (https://xjtulgz.shinyapps.io/DynNomapp/). Then, ROC curve analysis (Fig. 6A) showed this MetS diagnostic nomogram had excellent classification ability (AUC = 0.968 in training set, AUC = 0.883 in internal validation set, AUC = 0.861 in external validation set). The ROC curves of every single hub genes are shown in Additional file 5: Figure S1. In addition, we performed calibration curve analysis and Hosmer-Lemeshow good of fit test (P = 0.915) which showed good calibration of this nomogram (Fig. 6B). Furthermore, decision curve ploted the standardized net benefit of our MetS diagnostic nomogram in different decision thresholds (Fig. 6C). These results indicated that the application of this MetS diagnostic nomogram would lead to ideal clinical outcomes.