Identification and analysis of differentially expressed genes in idiopathic pulmonary fibrosis
Three microarray datasets were chosen to identify DEGs in pulmonary fibrosis from the GEO. The GSE485 dataset contained 2 control samples and 2 pulmonary fibrosis samples. The GSE97546 dataset contained 3 lung samples of healthy mice and 3 lung samples of mice with lung fibrosis. The GSE37635 included 6 timepoints: the control (n = 7), 1 week (n = 7), 2 weeks (n = 6), 3 weeks (n = 6), 4 weeks (n = 6), and 5 weeks (n = 6) after bleomycin-treatment to induce lung fibrosis. We first extracted gene expressions changes at each time points in the GSE37635 dataset (Fig. S1), then the results were intersected with the other two datasets.
61 genes overlapped across the three datasets as shown in the Venn map. Genes with opposite trends were then excluded. Finally, we identified the 37 overlapping DEGs in the 3 datasets, consisting of 32 genes that were up-regulated and 5 genes that were down-regulated (Fig. 2A).
Differentially expressed gene ontology and KEGG pathway analysis in pulmonary fibrosis
DAVID was used to perform GO term and KEGG pathway enrichment analysis for further insight into the function of the identified DEGs. We selected the top 5 terms of biological processes and cell components according to p value. We found cell components of DEGs mainly enriched in extracellular regions, extracellular exosome and extracellular space (Fig. 2B). The biological processes were mainly related to inflammation and immune pathways (Fig. 2C). Further, molecular function mainly involved chemokine activity, cytokine activity and protein homodimerization activity (Fig. 2D). Moreover, eight KEGG pathways were overrepresented in DEGs, including staphylococcus aureus infection, prion diseases, systemic lupus erythematosus, chemokine signaling pathway, pertussis, complement and coagulation cascades, cytokine-cytokine receptor interaction and chagas disease (Fig. 2E).
Protein-protein interaction network construction and hub genes selection
The PPI network, which was consisted of 25 nodes and 74 edges, was constructed using STRING and was visualized by Cytoscape software to explore the association between the DEGs. The MCODE plugin was used to identify hub genes. C1qa, Fcgr1, C1qb, C1qc, Ccr5, Slc11a1, Aif1, Emr1 and Cxcl10 were identified as the most closely connected module which were highlighted in yellow, including 9 nodes and 32 edges (Fig. 2F), and the genes in this region were upregulated in pulmonary fibrosis.
Screening and analysis of genes related to hub co-expressed genes in lung cancer
Considering IPF increased the risk of LC development especially NSCLC(22), we selected the GSE31013 dataset containing 6 samples of lung tumors and 6 samples of age-matched normal lung tissue. After the DEGs between lung tumors and control samples were identified via GEO2R online tools with |logFC| > 1 and p value < 0.05, it was surprising to find that nearly half of the DEGs in pulmonary fibrosis also exhibited had significant difference in spontaneous lung tumors formations (Fig. 3A-3B, Table S2).
The DEGs gene ontology and KEGG pathway of these genes were re-analyzed via DAVID to better understand their functions. The results showed that genes of the cellular component of this module were also mainly enriched in the extracellular region and the molecular function mainly enriched in protein homodimerization activity. The biological process analysis results suggested complement activation and cellular potassium ion homeostasis together with immune and inflammatory pathways might contribute to the occurrence and development of pulmonary fibrosis and LC (Fig. 3C-3D). The top hub (C1qa, C1qb, C1qc and Ccr5) genes were also identified by the MCODE plugin, including 4 nodes and 6 edges (Fig. 3E-3F).
Validation of C1q in IPF and lung cancer
As C1q ranked high among differentially expressed genes (Table S3), we focused on the role of C1q in IPF and LC. First, pulmonary fibrosis was induced by intratracheal instillation of bleomycin, while the control group was given normal saline in the same condition. The levels of fibrosis were measured by QPCR (Fig. 4A) and Masson staining (Fig. 4C-4D) in IPF group increased, indicating the accomplishment of modeling. We further detected the levels of C1q by QPCR and immunohistochemistry, and found that the expression of C1q increased in IPF (Fig. 4B-4C, Fig. 4E), consistent with previous analysis. Besides, we found the concentrations of C1q in the blood and balf of IPF increased as well (Fig. S2, Fig. S3).
The immunohistochemical stainings from the PHA also showed C1q expression of lung adenocarcinoma was higher than that of normal lung, while squamous cell carcinoma showed no significance up-regulated trend (Fig. 5A, Fig. S4). We detected a relationship between C1q and TGF-β1 as well as other mesenchymal markers (ACTA2, COL1A1, and CTGF) in lung adenocarcinoma using the online tool TIMER2.0. These factors involved in pulmonary fibrosis were positively related to the level of C1q expression (Fig. 5B).
As methylation induced aberrant epigenetic regulation is an important common pathogenic mechanism of IPF and lung cancer(23), we explored DNA methylation levels of C1q in NSCLC (LUAD and LUSC) and IPF separately using the SurvivalMeth database and GSE63704. The results were shown in the box plot. We found most probes of hub genes decreased significantly in LUAD (n=438) compared to normal samples (n=32) (p < 0.05) (Fig. S5A-D). In LUSC, decreased probe of C1q also showed significant difference (Fig. S6A-D). Moreover, 4 probes of C1qa, 5 probes of C1qb, and 6 probes of C1qc were significantly decreased in IPF group (Fig. S7A-D). Common significant probes were listed in Table 3 and Table 4.
The prognostic value of C1q in IPF and LUAD
Expression of C1qa, C1qb and C1qc genes were examined and their clinical prognostic significance was investigated within the Kaplan–Meier Plotter database including 1925 cases of NSCLC. We noted that NSCLC patients with elevated C1q (C1qa, C1qb, C1qc) levels had lower OS (p < 0.05) (Fig. 6A). However, patients with higher concentrations of Ccr5 showed higher survival probability, which was inconsistent with previous results (Fig. S8). Furthermore, we investigated the Kaplan–Meier curve in 672 patients with lung adenocarcinoma (LUAD) and 271 patients with lung squamous cell carcinoma (LUSC). The results demonstrated that high levels of C1q expression revealed worse prognosis of adenocarcinoma (p < 0.05) (Fig. 6B) compared with squamous cell carcinoma (p > 0.05) (Fig. 6C).
We further investigated the expression of C1q in IPF human patients to verify its role by using GSE124685 which had 84 samples including control samples (n=35) and IPF samples (n=49) from 10 control patients and 6 IPF patients (Table S4). The results were consistent with previous studies. The FPKM of C1q was also significantly increased in IPF patients (Fig. 6D). And then we divided samples IPF into two groups according to alveolar surface density (ASD) which was negatively correlated with degree of fibrosis. We identified samples with ASD > 6/µm as early or progressive fibrosis and grouped them to IPF1. Other samples (ASD < 6/µm) with end-stage fibrosis were grouped to IPF2. Although the difference between the IPF1 and IPF2 was not significant, the average expression levels of C1q in IPF2 group had an uptrend compared to IPF1 group (Fig. 6E). These results suggested that C1q might also have prognostic value for IPF progression.