In our study, we developed diagnostic models based on asthma-associated gene signatures using RNAseq data from a total of 109 AECs and 393 NECs subjects. Initially, we performed the gene differentially expressed analysis for asthma vs. controls using AEC and NEC datasets, resulting 235 DEGs in the AECs and 4802 DEGs in the NECs datasets. Meanwhile, the WGCNA analysis was performed and identified a number of modules significantly correlated with asthmatic and control subjects in either AECs or NECs datasets. Co-expressed genes within the modules that are significantly associated with asthma were extracted. The overlapping analysis between DEGs and co-expressed genes within modules revealed a total of 150 and 2399 asthma-associated differentially co-expressed genes from the AECs and NECs datasets, respectively. Then, using RNAseq data derived from AEC data, the four machine learning methods combined with WGCNA (WGRF, WGRFE, WGBoruta and WGLasso) prioritized 23, 26, 22 and 60 potentially differentially co-expressed genes strongly associated with asthmatic subjects, respectively. Model diagnostic performance comparison analysis with multicriteria performance metrics identified that WGRF algorithm with 23 gene sets showed consistent performance in distinguishing asthmatic from control subjects in the AECs data with high accuracy and independent validation datasets from different tissue/cell types. Moreover, WGRF algorithm prioritized the 34 most discriminative signatures in the NECs data that can efficiently classify asthmatic from control subjects.
Despite that it is ideal to develop gene-signature-derived asthma diagnostic models from target tissues of diseases development (e.g. from lung tissue), which is difficult specifically when a large sample size is needed for developing diagnostic tools with robust statistical power. Our asthma diagnostic classifiers were developed based on the common biology of surrogate cell/tissue types and target cell/tissue, as previous findings suggested the same concept for clinical practice 9–10, 30. An experimental study suggested to use nasal epithelial cells as surrogate for bronchial epithelium cells for asthma 10. Despite several previous studies developed classification models to predict asthma, most of the studies focused on gene expression data from single tissue 11, 13. Previous study compared different tissue types including AECs, NECs and peripheral blood mononuclear cells to predict asthma and suggested that AECs and NECs tissue/cell types based diagnostic models provided better prediction for asthma compared with diagnostic model derived from peripheral blood mononuclear cells15.
To the best of our knowledge, our study has developed asthma diagnostic models using co-expression network combined with machine learning based on largest RNAseq dataset of AECs and NECs tissue sample types in asthma. Prioritizing and identifying potential gene signatures to construct asthma classification model from easily accessible tissue sample types are vital to elucidate pathological process of asthma at molecular level, and to extend adequate evidence for the development of therapeutic target. The main contribution of the current study is to identify potential gene signatures and compare diagnostic performance of different machine learning methods in classifying asthmatic from control subjects based on AECs and NECs tissue/cell datasets and validate the developed diagnostic models as well as suggest the most suitable diagnostic models, which are stable and robust performance in classifying asthmatic from control subjects, which is rarely discussed by the previous research works. Our method prioritized and identified potential asthma-associated differentially co-expressed genes, which suggesting several of these genes may signify in the pathogenesis of asthma.
The top ten prioritized AECs data-derived asthma correlated genes, including PDPN, SORL1, PRKAA2, TMPRSS11E, HTRA1, COL1A1, CLIP2, DKK3, PRSS27 and MUC1, are linked with the progression asthma pathogenesis. Abnormal expression of various collagen genes including COL1A1 can contribute the disposition of subepithelial fibrosis, which is a hallmark of asthmatic airways31. Another top ranked gene in our method is mucin 1 (MUC1) is involved in many pathological processes in asthma and acts as an anti-inflammatory molecule in chronic rhinosinusitis, chronic obstructive pulmonary disease, and severe asthma 32. The list of top ranked genes from NEC data includes CDH26, CTSC, ELOVL5, PRR15, CEP72, ANO1, LRRC8D, PCSK6, TSPAN3, and MOCS. A recent study identified abnormality of cadherin-26 (CDH26) characterize IL-13 stimulation of the airway epithelium and T2 inflammation of the airway epithelium in asthma development 33. Previous study by Yang et al. (2017) showed that CTSC was overexpressed in asthma and associated with methylation marks in asthma and allergy subjects 34. A study also demonstrated that CTSC is maturated by a multistep proteolytic process and secreted by activated cells during inflammatory lung diseases 35. Our study also confirmed that CTSC upregulated and co-expressed in nasal epithelium of asthmatic subjects. We also observed upregulation of CTSC in multiple tissue/cell types of asthmatic subjects, which reflects that upregulated of CTSC gene in multiple tissue/cell may have functional association in the development and progression asthma disease.
Next, we characterized the functional enrichment analysis of prioritized and differentially co-expressed genes in asthmatic subjects derived from AECs and NECs data. The functional enrichment analysis showed that correlated genes derived from AECs data were mainly enriched 29 pathways. For example, AEC-derived gene signatures including COL1A1, COL8A2, COL4A2 and MUC1 were found to be involved in pulmonary fibrosis idiopathic signaling pathway, which is associated with the development and progression of the pathological process of asthma 36–37. Similarly, the enrichment analysis of NEC-derived gene signature modules specifically associated with six asthma-correlated modules (yellow, midnightblue, green, blue, greenyellow and grey) were enriched for Th1 and Th2 activation pathway that has a known roles in asthma pathobiology 38. Co-expression genes and associated module approaches are of particular importance to identify large sets of genes that are specifically important for a particular biological process beyond known candidate genes, or when the process has not been studied with genetic methods before.
More recently, machine learning and statistical methods haven been commonly used in RNA-seq data analysis of biomedical studies 39–40. However, the analysis of high-dimensional RNA-seq data has a number of challenges including model overfitting and multicollinearity problems (e.g., existence of co-expressed genes). To address such problems, appropriate statistical machine learning methods are required. Here, to select the appropriate model in classifying asthmatic from control subjects, we evaluated different feature selection methods based on the results of DEGs and WGCNA in the derivation datasets and independent validation datasets. From classification performance, the random forest-based model, WGRF, was identified as robust method to select potential gene features to improve the diagnostic performance. Notably, all methods showed better diagnostic performance in the derivation set and therefore, the robustness of the model should be validated. In our study, the developed diagnostic models were validated to examine whether they can perform well in external datasets with different tissue/cell types including bronchial epithelial cells (BECs), airway smooth muscle (ASM) cells and whole blood (WB) and the validation analyses showed that the diagnostic models exhibited a better performance in the BECs and ASM dataset compared with WB dataset. Similarly, diagnostic models/genes derived from NECs data showed better performance when the model was tested with BECs and ASM datasets as compared with WB dataset. The reason was that gene expression level derived from WB tissue may not be stable and hence, identified gene signatures derived from AECs and NECs data resulted relatively low diagnostic performance in WB tissue data. Whereas validation of diagnostic model based on gene expression comes from the target tissue sources-BECs and ASM tissue/cell types showed better performance, where these target tissue/cell types have well known role in asthma exacerbations and remodeling 7, 41.
Most models perform better prediction in derivation dataset but predict poorly in external validation dataset 42, may be due to weak extrapolation possibility resulted from overfitting problem. The best model should have high AUC, F-measure and MCC values28. Our gene-signature based diagnostic models derived from AECs and NECs data showed higher accuracy and stable performance in external different tissue/cell type data. The multiple tissue/cell validation datasets circumvent overoptimistic results and assurance general reproducibility. Despite our developed diagnostic models showed promising performance in predicting asthma, the current study has still some limitations. Since this study focused on computational analysis based on retrospective samples, future validation of the identified signatures should be performed with functional experiments. The sample size in some public dataset is small, which may hide potential correlations between gene expression signatures and outcome variable. In the future, we could consider other feature selection strategies to improve diagnostic prediction performance of asthma disease.
In conclusion, we identified small number of co-expressed gene signatures and established stable and powerful diagnostic model based on an integrated analysis of bioinformatics and machine learning methods to predict asthma diagnosis using airway epithelium gene expression data. Based on multiple-diagnostic performance criteria, we found that comparable diagnostic performance between AECs and NECs, which highlight the importance of gene-signature –based diagnostic models derived from AECs and NECs data can be as suitable surrogate model in predicting asthma diagnosis. More importantly, our diagnostic models are promising tool to improve decision making, which may provide potential gene signatures for diagnosis of asthma and other airway diseases.