1. Access of expression patterns and identification of DEGs.
The study calculated the correlation based on the gene expression matrix in the chip and performed cluster analysis in samples from GSM2183532 to GSM2183548 to explore the composition of the dataset, the potential classification and the internal relationship using the corrplot package to draw the correlation matrix. The results show that 17 samples can be basically distinguished in hierarchical clustering, and samples with the same phenotype are mainly clustered together (Fig. 1A). Meanwhile, dimensionality reduction is performed by PCA, and the distribution of sample points is displayed on two-dimensional plane. The study shows that the samples with different conditions (HC and OA) are obviously distinguished, and the distance between biological replicates is relatively close, indicating that the consistency of biological replicates and the difference of groups are comparatively obvious (Fig. 1B). Subsequently, the gene expression profiles of 17 samples in the gene expression dataset GSE82107, including 7 control and 10 OA samples, were analyzed. A total of 1676 DEGs between the control and OA samples were identified, including 218 upregulated and 1458 downregulated genes (Fig. 1C).
2. GO function and KEGG pathway enrichment analysis.
GO and KEGG pathway enrichment analyses were performed using DAVID and GSEA in order to gain a comprehensive understanding of the functions of the genes. DAVID focuses on the gene sets contained upregulated or downregulated genes, while GSEA can observe the consistency of gene expression in entire dataset with specific functional gene sets to interpret the biological information. The results of the GO analysis from DAVID indicated that upregulated and downregulated genes were enriched for various BP terms, which are showed in Supplemental Fig. 1 and Supplemental Table 1 (top 5). For upregulated DEGs these included ‘response to organic substance’, ‘protein modification process’, ‘cellular protein modification process’, ‘cellular response to chemical stimulus’ and ‘cellular response to organic substance’, and for downregulated DEGs they included ‘regulation of signaling’, ‘regulation of cell communication’, ‘positive regulation of metabolic process’, ‘positive regulation of cellular metabolic process’, ‘regulation of signal transduction’. In the molecular functions category, the upregulated genes were enriched for ‘receptor binding’, ‘protein complex binding’, ‘glycosaminoglycan binding’, ‘cytokine activity’ and ‘anion binding’, while the downregulated genes were enriched for ‘adenyl ribonucleotide binding’, ‘adenyl nucleotide binding’, ‘ATP binding’, ‘molecular function regulator’ and ‘cytoskeletal protein binding’. A cellular components analysis further demonstrated that the upregulated genes were enriched for ‘extracellular region’, ‘extracellular region part’, ‘membrane-bounded vesicle’, ‘extracellular exosome’ and ‘extracellular vesicle’, while the downregulated genes were enriched for ‘intrinsic component of plasma membrane’, ‘integral component of plasma membrane’, ‘cell junction’, ‘neuron part’ and ‘plasma membrane region’. In addition, 10 KEGG pathways were identified, as listed in Supplemental Fig. 2 and Supplemental Table 2 (top 5), involving the ‘cytokine-cytokine receptor interaction’, ‘chemokine signaling pathway’, ‘ECM-receptor interaction’, ‘TNF signaling pathway’ and ‘osteoclast differentiation’ for upregulated DEGs, and ‘neuroactive ligand-receptor interaction’, ‘calcium signaling pathway’, ‘cAMP signaling pathway’, ‘oxytocin signaling pathway’ and ‘dopaminergic synapse’ for downregulated DEGs.
GSEA analysis revealed that 8 GO terms, including ‘bone development’ and ‘bone morphogenesis’, ‘bone remodeling’, ‘cartilage development involved in endochondral bone morphogenesis’, ‘chondrocyte development’, ‘chondrocyte differentiation’, as well as ‘positive regulation of osteoblast differentiation’, and ‘regulation of bone development’, shown significantly differential enrichment in OA phenotype based on NES and NOM P value (Fig. 2, Table 1).
Table 1
Gene Set Enrichment Analysis of genes associated with osteoarthritis.
Term | Size | NES | NOM P value | FDR q value | Leading edge |
GO_BONE_DEVELOPMENT | 151 | -1.63 | 0.000 | 0.115 | tags = 36%, list = 19%, signal = 44% |
GO_BONE_MORPHOGENESIS | 77 | -1.95 | 0.000 | 0.038 | tags = 36%, list = 15%, signal = 43% |
GO_BONE_REMODELING | 35 | -1.65 | 0.008 | 0.114 | tags = 40%, list = 14%, signal = 47% |
GO_CARTILAGE_DEVELOPMENT_ INVOLVED_IN_ENDOCHONDRAL_ BONE_MORPHOGENESIS | 19 | -1.73 | 0.010 | 0.089 | tags = 53%, list = 15%, signal = 62% |
GO_CHONDROCYTE_DEVELOPMENT | 21 | -2.39 | 0.000 | 0.003 | tags = 57%, list = 13%, signal = 66% |
GO_CHONDROCYTE_DIFFERENTIATION | 57 | -1.57 | 0.006 | 0.135 | tags = 33%, list = 13%, signal = 38% |
GO_POSITIVE_REGULATION_ OF_OSTEOBLAST_DIFFERENTIATION | 58 | -1.91 | 0.000 | 0.047 | tags = 38%, list = 15%, signal = 45% |
GO_REGULATION_OF_BONE_DEVELOPMENT | 17 | -2.23 | 0.000 | 0.010 | tags = 65%, list = 16%, signal = 77% |
NES: normalized enrichment score. Size represents the total number of genes in the gene set. NOM P value indicates the credibility of enrichment results. FDR q value represents the corrected P value of multiple hypothesis testing. Tags represents the proportion of core genes in the total number of genes in the gene set, while list represents the proportion of core genes in the total number of genes. For a gene set, when the number of core genes is the same as the total number of genes under the gene set, the signal value is the largest. |
3. Construction of weighted gene correlation network analysis.
In this study, WGCNA was used to construct the gene correlation module associated with the sample trait. In total, 5879 genes were included for analysis, and the soft threshold β was calculated before construction the weighted co-expression network. We set the correlation coefficient to 0.9 as screening criteria and calculated the value of soft threshold β as 6 (Fig. 3A). Then a total of 13 gene modules with different colors were recognized by hierarchical clustering (Fig. 3B). Each module was assigned a unique color: black, blue, brown, green, greenyellow, magenta, pink, purple, red, salmon, tan, turquoise and yellow represented 231, 835, 472, 298, 84, 132, 134, 96, 257, 48, 72, 2829 and 326 genes, respectively. Of all genes, 65 were not assigned. We clustered modules based on gene expression and obtained a correlation heatmap between modules (Fig. 3C), and then transformed them into a topological overlap matrix (TOM) and visualized the system clustering tree of gene (Fig. 3D). Eigengenes was correlated with external traits in order to search for the significant associated modules. It was clear that the MEgreen (298 genes) was most positive associated with OA (Fig. 3E).
4. Sub-modules in the PPI network and identification of core genes.
We detected densely linked regions in large PPI networks evaluated by STRING online tool that may represent molecular complexes using an MCODE analysis, and several significant modules were identified. The 3 modules with the highest score were selected (Fig. 4A ~ Fig. 4C). The study also calculated degree using NetworkAnalyzer, and degree > 30 was set as the thresholds. So far, the gene sets obtained by the analysis of MCODE, Degree, GSEA and WGCNA were put together and the common genes were detected (Fig. 4D). The results showed that golgi glycoprotein 1 (GLG1), secreted frizzled related protein 2 (SFRP2), secreted protein acidic and cysteine rich (SPARC), 3'-phosphoadenosine 5'-phosphosulfate synthase 2 (PAPSS2), vitamin K epoxide reductase complex subunit 1 (VKORC1) and cathepsin K (CTSK) were coexisted in GSEA and WGCNA gene sets, as well as TIMP metallopeptidase inhibitor 1 (TIMP1) and syndecan 1 (SDC1) were commonly contained in Degree and WGNCA gene sets, and C-C motif chemokine receptor 5 (CCR5) was covered by Degree, MCODE, and WGCNA gene sets. The CTD database showed that these genes targeted several musculoskeletal diseases and these data appear in Fig. 4E. ROC analysis showed that 5 genes such as GLG1, PAPSS2, CTSK, TIMP1 and SDC1 could serve as valuable biomarkers for distinguishing patients with OA from healthy controls (Fig. 4F, Table 2).
Table 2
Receiver operator characteristic analysis of 9 candidate genes.
Genes | AUC | Std. Error | P value |
GLG1 | 0.8143 | 0.1116 | 0.0318 |
SFRP2 | 0.7429 | 0.1251 | 0.0971 |
SPARC | 0.6286 | 0.1385 | 0.3798 |
PAPSS2 | 0.8571 | 0.09201 | 0.0147 |
VKORC1 | 0.5143 | 0.1495 | 0.9223 |
CTSK | 0.8286 | 0.1018 | 0.0248 |
TIMP1 | 0.8429 | 0.09748 | 0.0192 |
SDC1 | 0.9571 | 0.04869 | 0.0018 |
CCR5 | 0.7857 | 0.1192 | 0.0510 |
AUC: area under curve. AUC value is the area covered by the ROC curve |