Training the denoising autoencoder
We used a one-layer denoising autoencoder model comprised of an encoder and a decoder (Fig 1). The encoder embeds the original input data into a lower-dimensional space, the hidden layer, and the decoder reconstructs the input from the values of the hidden layer. We tuned the parameters of the model using cross-validation by training on 90% of the randomly selected sample and testing on the remaining 10% and repeated this process 10 times. Both the training and testing loss showed proper convergence (Fig S1, S2), indicating that we largely avoided overfitting. We then projected the input data from all non-control input samples to the embedding space of the trained model, obtaining a set of 50-dimensional vectors.
Hidden units associate with TEA clusters
By encoding the original data into the hidden vector space, the model produced a sparse embedding space; moreover, we could observe distinct patterns related to clinical traits (Fig 2a). Some hidden units showed approximately complementary behaviors (e.g., H26 and H38), indicating their distinct relevance to key molecular mechanisms and clinical subgroups of patients.
To better interpret the learned patterns, we evaluated the embeddings of all input data for their correlations with identified clinical traits. We first removed hidden units with variance < 0.001. Using the embeddings of the highly variable hidden units, the samples formed several small clusters corresponding to the previously defined TEA clusters, with high homogeneity (Fig 2a). We found that TEA1 contained a larger proportion of severe patients than TEA3 (Fig S4). TEA2 was relatively harder to distinguish because it was somewhat of an intermediate between TEA1 and TEA3, as some of the TEA2 samples were spread across several clusters.
Generally, the values of hidden units showed gradual monotonic changes, either increasing or decreasing, from TEA1 to TEA3 (Fig S5), indicating associations of the hidden units with clinical traits of the samples. We selected five hidden units (H26, H27, H36, H38, H45) that were significantly correlated with TEA cluster labels (Spearman correlation > 0.65), namely Hsig. Based on the performance of these five hidden units, we further categorized them into two major classes: one that was negatively correlated with TEA cluster labels (H26, H27; Fig 2b) and one that was positively correlated (H36, H38, H45; Fig 2c).
Annotation of relevant hidden units
To further understand the biological significance of the hidden units, we tried to associate them with functional enrichment based on the weights of the encoder network that mapped the input to the hidden unit space. This weight represents the contribution of the gene to the value of the hidden unit and can be considered the component of the hidden unit. Thus, we could infer biological significance of the hidden units from the weights of the encoder.
The weight distribution of the encoder layer showed similar patterns for hidden units belonging to the same set (Fig 3a). The negative set showed a nearly symmetric distribution with a slight negative skew, whereas the positive set showed a highly positively skewed shape. Using encoder weights for all 22,148 genes as the ranking score, we performed gene set enrichment analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway gene sets to obtain functional enrichments of Hsig. Similar to the weight distribution, hidden units from the same set showed strong resemblance with respect to enrichment of functional terms (Fig S6). Notably, many of the enriched functional terms were related to molecular functions and signaling pathways associated with disease pathogenesis and autoimmune response, whereas deficient terms included “gene expression machinery” and “metabolism”. Specifically, the enrichment of the term “asthma” showed high significance in all five hidden units (Fig 3b, Fig S7).
We then extracted the top-weighted genes of Hsig as potential clinical markers for downstream analysis. For each hidden unit, we extracted the top 200 genes with the highest weights; we removed ribosome-related genes from the analysis to prevent ribosome functional roles from dominating the selected gene set. Concordant with previous observations, the gene sets for hidden units belonging to the same class were highly similar. By merging the top-weighted genes from all five hidden units, we obtained a list of 330 genes. We identified several genes that showed distinct differential patterns of weights between the positive and negative set (Fig 3c). Gene ontology (GO) analysis revealed an enrichment of disease-related terms like “antigen processing and presentation”, “immune response” and “interferon-gamma signaling pathway” (Fig S8).
Top-weighted genes associate with asthma severity
In order to assess the clinical relevance of the learned model, we investigated the association between hidden unit values and clinical data. Generally, the performance of hidden units showed distinct correlations with various clinical traits (Fig S9). We then tested the predictive performance of some clinical traits using the embedding values of Hsig and the expression of the combined list of their top-weighted genes.
We first associate the value of all hidden units, Hsig and the top-weighted genes with the asthma severity levels of the patients. We only used samples labeled as “mild” or “severe” for the classification analysis. Given each training dataset, we trained a random forest classifier and assessed the predictive accuracy on the test data, represented by the area under the receiver operating characteristic curve (AUROC) value (Fig 4a).
From the top-weighted genes, we then performed feature selection to obtain the most relevant genes. We considered 50 genes with the top average importance as the most relevant to the prediction of severity (Fig 4b, Table S1). Expression of these genes reached an average AUROC of 0.8066 (average AUROC is 0.6221 for randomly selected 50 genes from all expressed genes) in predicting severity level. This indicates the top weighted genes are related to severity and the latent patterns unsupervised learned by the autoencoder are biological relevant.
We also evaluate the performance of differentially expressed genes (DEGs) from severe versus mild patient group (AUROC: 0.9318, see Methods for details) (Fig S10). There is no overlapping between our selected genes with DEGs, but network analysis showed that our selected genes have significantly higher interconnectivity and higher centrality in the network (Wilcoxon test, p-value = 0.003) compared to the DEGs (Fig S11). This is not surprising since the severity related DEGs is specifically selected to be discriminative. In contrast, the top-weighted genes are defined from the most representative patterns among all the samples (including control, mild, moderate and severe groups) learned by unsupervised learning. They are not necessarily associated with severity, but can still achieve high predictive performance, though not as high as specific DEGs. Also, they play more central roles through network interactions. This further indicates our framework can capture the biological relevant information.
Our list of selected genes included genes related to autoimmune responses, such as antigen processing and presentation, T-cell toxicity, interferon signaling and cell adhesion (Fig 4c). In the context of a protein-protein interaction network, we identified genes residing in various functional modules (Fig 4d). Specifically, several genes, such as human leukocyte antigen genes and cathepsin L1, belonged to a module related to immune response. In addition, the list included genes with potential relevance to asthma pathogenesis, such as immunoglobulin heavy chain (IGHG1, IGHA1 and IGHV3-48), inflammation (S100A9) (13) and iron response (IREB2) (14, 15) genes.
Prediction of clinical measurements
To evaluate an individual’s lung function, clinicians use the ratio of the volume forcefully exhaled in one second versus the maximum volume of a forceful exhale (i.e., the FEV1/FVC ratio) (16). A significantly reduced FEV1/FVC ratio is a sign of airflow obstruction, and is considered a criterion for asthma severity.
We found that the hidden units in Hsig generally showed a higher absolute value of correlation with the FEV1/FVC ratio, in both the positively and negatively correlated subsets, compared to other hidden units, indicating clinical relevance of these hidden units (Fig 5a).