Identification of significant immune-related genes associated with childhood asthma
Sample quality was detected using the training set GSE40732 by the ‘flash cluster’ package in R software. The results indicated that there were no outlier samples in the data set (Fig. 1 A). Immune-related gene expression profile data in the training set were used to establish the co-expression network. First, we used the degree of correlation between immune-related genes to obtain the similarity matrix. Next, we converted the similarity matrix into an adjacency matrix, according to the formula, amn =|cmn|b. Finally, we converted the adjacency matrix into a topological matrix using an appropriate soft threshold (b=4; R2= 0.98; Figs. 1 B-E). Five gene modules were obtained via the dynamic shear tree (Fig. 2 A). Module-trait relationships indicated that the yellow module showed the highest correlation with childhood asthma (Fig. 2 B; p=0.01). All 69 genes in the yellow module were selected as significant immune-related genes suitable for further research (Fig. 2 C). To explore the potential mechanisms underlying the role of significant immune-related genes in childhood asthma, GO analysis was performed using the ‘clusterProfiler’ package in R software. The results revealed that significant immune-related genes may participate in childhood asthma by inducing immune activation (Fig. 3).
Random forest model construction
The 69 significant immune-related genes were screened to determine the optimal variables capable of predicting the occurrence of childhood asthma, using the generalized linear (GLM), RF, and support vector machine (SVM) models, all of which were established separately using the ‘DALEX’ package. The RF model, which showed the least residual distribution, was selected as the best model to predict childhood asthma (Figs. 4 A, B). We evaluated the importance of significant immune-related genes in predicting the occurrence of childhood asthma based on the established RF model. We visualized the top 30 genes after ranking these genes according to their importance (Fig. 4 C), where ‘%IncMSE’ represents an increase in mean squared error. The more important a predictor gene is, the greater the prediction error will be, when its value is randomly replaced. ‘IncNodePurity’ represents the increase in node purity, which indicates the effect of each gene on the heterogeneity of observations at each node of the classification tree. A 10-fold cross validation was performed to select genes. These results indicated that the predicted results were more stable and exhibited lower cross validation errors when the number of genes was in the top 10. Selection of the top 10 genes was based on the ‘IncNodePurity’ values corresponding to these genes, and the 10 genes selected were CD244, DDX58, CMTM2, NRG1, HLA-DPA1, TNFSF13B, TNFSF13, CMKLR1, PSME2, and SEMA6B.
Nomogram model construction
A nomogram model based on the selected top 10 genes was constructed, using the ‘rms’ package in R, to predict the prevalence of childhood asthma patients (Fig. 5 A). Calibration curves revealed that the predictivity of the nomogram model was accurate (Fig. 5 B). The ‘rmda’ package was used to plot the DCA curve. The red line remained above the grey and black lines from 0.2 to 0.8, indicating that decisions based on the nomogram model may benefit childhood asthma patients (Fig. 5 C). We further plotted the clinical impact curve to more intuitively assess the clinical impact of the nomogram model. The predicted number of high-risk patients was greater than that of high-risk patients per event, indicating that the predictive power of the nomogram model was remarkable. (Fig. 5 D).
Identification of two molecular subtypes of childhood asthma
Consensus clustering enables childhood asthma patients to be divided into two molecular subtypes based on the expression profiles of the top 10 genes selected from the training dataset. Matrix heat maps were clearly separated at k = 2 (Fig. 6 A). The rows and columns of the matrix heat map represent samples. The values of the consistency matrix are colored white to dark blue from 0 (impossible to cluster together) to 1 (always clustered together). The cumulative distribution function (CDF) curve indicated that the CDF reaches a value that approximates the maximum value at k=2 (Fig. 6 B). The item cluster membership across different k numbers that tracks cluster history is presented (Fig. 6 C). Classification stability based on the selected top 10 genes was successfully verified using the testing dataset (Supplementary Fig. 1).
Landscape of immune infiltration in childhood asthma
ssGSEA was used to evaluate the abundance of 28 infiltrating immune cells in childhood asthma patients. Myeloid-derived suppressor cells (MDSCs) exhibited the highest infiltration abundance among these 28 immune cell types (Fig. 7 A; Supplementary Fig. 2 A). Differential analysis of the training dataset indicated that activated CD8 T, gamma delta T, T follicular helper, CD56bright natural killer, immature dendritic, and natural killer T cells displayed higher levels of infiltration abundance in asthma patients compared to those in non-asthma patients (Fig. 7 B). However, the abundance of activated CD8 T cell infiltration in asthma patients in the testing dataset was lower than that in non-asthma patients (Supplementary Fig. 2 A). The imbalance between the numbers of non-asthma and asthma patients in the testing dataset may lead to immune infiltration results that are different from those in the training dataset. Moreover, we explored relative immune cell infiltration levels in the following two molecular subtypes: Sub1 and Sub2. Childhood asthma patients in Sub1 displayed higher immune cell infiltration levels than those in Sub2 (Figs. 8 A & B). The major histocompatibility class 1 (MHC1) complex is necessary to provide endogenous cellular antigens to circulating T cells. In the training dataset, the expression levels of HLA-DMB, HLA-DOA, and HLA-DOB in Sub1 were higher than those in Sub2 (Fig. 8 C). In the testing dataset, the expression levels of HLA-DRA, HLA-J, HLA-DPB1, HLA-DQB2, HLA-C, HLA-DMA, HLA-DPA1, HLA-DMB, HLA-DPA2, and HLA-DRB1 in Sub1 were higher than those in Sub2 (Fig. 8 D).