Identification of differentially expressed genes between ARDS and COVID-19
Principal component analysis (PCA) was used to visualize the distribution of these samples before and after correction for batch effects (Fig. 1A, 1B). We performed data correction and normalization on three datasets (GSE32707, GSE66890, and GSE213313) and identified 1114 DEGs in ARDS, including 575 up-regulated and 539 down-regulated genes, and 3587 DEGs in COVID-19, including 1738 up-regulated and 1849 down-regulated genes. Meanwhile, by plotting the Wayne diagram to screen the common DEGs between ARDS and COVID-19, the results showed that 180 and 61 overlapping DEGs were found in the up-regulated and down-regulated DEGs, respectively (Fig. 1C, 1D).
Enrichment analysis
To explore the biological functions and pathways of the common DEGs, we performed GO enrichment analysis (Fig. 1E, 1F).The results of GO analysis showed that upregulated DEGs were mainly enriched in mitotic cell cycle, cytoplasmic vesicle lumen, and protein kinase regulator activity, while downregulated DEGs were mainly enriched in cellular defense response, endocytosis vesicles, and T cell receptor binding.
Weighted gene co-expression network analysis of ARDS and COVID-19
We used co-expression analysis to construct co-expression networks to explore the correlation between clinical traits and genes. In this study, clustering analysis was performed using the "Flash clust" function. When the threshold was set to 75, 6 outlier samples were detected and deleted, and 145 samples were retained (Fig. 2A, 2B).
The "Select Soft Threshold" function of "WGCNA" filters out power parameters from 1 to 30. A power of β = 5 is selected as the most appropriate soft threshold to ensure a scale-free network, and the results show that the optimal soft power value is 10, and a total of 11 modules are identified.
"Cutree" dynamics and module characterizing gene functions were used to construct cluster maps (Fig. 2C, 2D), and a total of 11 modules consisting of genes with similar co-expression traits were obtained. Heat maps of module-trait relationships were then plotted according to Spearman correlation coefficients to evaluate each module with disease clinical features (Fig. 2E). The purple-red module indicated high connectivity between ARDS and COVID-19 (ARDS: r = 0.16, p = 0.05; COVID-19: r = 0.17, p = 0.04). The purple-red module contained positively associated genes of ARDS and positively associated genes of COVID-19 (Fig. 2F, 2G).GO analysis of the module genes of ARDS and COVID-19 showed that the co-expressed genes of BP were mainly enriched in blood coagulation, trauma repair, and regulation of humoral level, and CC were mainly concentrated in the actin cytoskeleton, platelet α-granules, and cytoplasmic vesicle lumen. MF was mainly associated with actin binding, adhesin binding, and aminoglycans, and KEGG analysis of ARDS and COVID-19 showed that it was mainly associated with focal adhesion, PI3K-Akt pathway, and regulation of actin cytoskeleton (Fig. 3A).
Machine Learning Identification of Intersecting Genes in ARDS and COVID-19
We used RF method to screen the intersecting genes of ARDS and COVID-19, and further screened the intersecting genes by RF algorithm while visualizing them in the order of gene importance.
The significance of the top 30 significant genes was also visualized (Fig. 3B, 3C). We further performed dimensionality reduction by LASSO to obtain the last 5 genes, which were HIST1H2BK, TCF4, OLFM4, KIF14, and HK1 (Fig. 3D, 3E).We constructed candidate gene diagnostic models from the GSE66890 training set using XGBoost and validated them in the GSE32707 dataset. In the GSE66890 dataset of ARDS, the AUC of ROC curves was 0.952 and the AUC of PRCurves was 0.961, while in the validation set GSE32707, the AUC of ROC curves was 0.725 and the AUC of PRCurves was 0.543; the AUCs of ROC curves of ARDS models were all greater than 0.7, indicating that the model has good diagnostic value (Fig. 3F, 3G). Meanwhile, to verify whether it could identify COVID-19 patients, we used the COVID-19 dataset GSE213313 in the same model, and the results showed that the AUC of ROC curves was 1 and the AUC of PR curves was 1, indicating that the model was also used in COVID-19 patients with a high diagnostic effect (Fig. 3H).
We further evaluated the diagnostic value of the five central genes screened, and the ROC curves of HIST1H2BK (AUC = 0.802), OLFM4 (AUC = 0.716), KIF14 (AUC = 0.812), and HK1 (AUC = 0.740) were all greater than 0.7, which had high diagnostic value, and TCF4 (AUC = 0.692) had a diagnostic value that was relatively slightly worse (Fig. 4A-E).
Immune infiltration analysis
The pathomechanisms of both ARDS and COVID-19 are related to inflammatory responses due to overstimulation of the immune system, and therefore we performed immune infiltration analyses on the datasets of both. We analyzed disease and immune infiltrating cell correlations based on 22 types of immune cells using the CIBERSORT method (Fig. 4F-G). Violin plots showed that naïve B cells and regulatory T cells appeared increased in combined ARDS samples compared to control samples; in COVID-19 samples, activated mast cells, macrophages, resting NK cells and plasma cells showed an upward trend compared to normal samples, and memory B cells, CD8 T cells, resting memory CD4 T cells, activated NK cells and activated dendritic cells showed a downward trend.
Single-cell sequencing in patients with combined ARDS and sepsis
We downloaded the single-cell RNA sequencing dataset (GSE168522) and selected one healthy and one AD patient in the dataset as a pruning subject for analysis. First, we performed data quality control. We retained cells with less than 10% mitochondrial genes and less than 3% erythrocytes. Cells with gene number (nFeature RNA) greater than 2000 or less than 200 were filtered out (Fig. 5A-C). The batch effects were also merged and corrected by the "Harmony" package (Fig. 5D), with a high overlap between samples, and then the number of PCs with smooth curves was selected, and the top 10 dimensions were taken for the subsequent analysis, and the downscaling effects of UMAP and tSNE were shown.
We further clustered the cells using the FindCluster function, which showed that the percentages of monocytes, B cells, T cells, NK cells and platelets increased in the ARDS group. The cell ratio analysis also showed that T cells, monocytes, and B cells were more abundant in patients with sepsis combined with ARDS, and the number of NK cells tended to be higher in patients with sepsis combined with ARDS(Fig. 6A-B).
We performed cellular annotation of five genes previously screened by machine learning and diagnostic prediction modeling in the merged ARDS group and the unmerged ARDS single-cell RNA sequencing dataset. The results showed that TCF4, HK1, and HIST1H2BK were annotated in all five cell clusters, and KIF14 was mainly annotated by monocytes (Fig. 6C).TCF4-annotated cells were mainly concentrated in B cells and monocytes, HK1-annotated cells were mainly concentrated in monocytes and T cells, and HIST1H2BK-annotated cells were mainly concentrated in monocytes. TCF4, HK1 were highly expressed in both groups, but the gene expression was higher in the combined ARDS group than in the unincorporated ARDS group, and the expression of HIST1H2BK and KIF14 was lower in both groups, and KIF14 was almost not expressed (Fig. 6D). Due to the number of samples and the sequencing method, OLFM4 was not detected in the single-cell sequencing of the dataset.
Next, we analyzed the cell ratio and expression of the four genes; TCF4 had elevated expression in B cells, monocytes and platelets in the combined ARDS group, and B cells in the uncomplicated ARDS group, but the cell percentage was smaller than that in the combined ARDS group; HK1 had elevated expression in monocytes, NK cells and platelets in the combined ARDS group, and the cell percentage was larger than that in the uncomplicated ARDS group (Fig. 7A); HIST1H2BK had elevated platelet expression in the uncomplicated ARDS group and a greater cell percentage than in the uncomplicated ARDS group (Fig. 7B). Previous studies have shown that oxidative stress plays an important role in the pathogenesis of sepsis and ARDS, and that oxidative stress and inflammatory response interact with each other to promote disease progression. Analysis of the ssGSEA metabolic pathway showed that there was a difference in oxidative scores between the two groups, with the combined ARDS group having a higher oxidative score than the uncomplicated ARDS group (Fig. 7C), suggesting that the patients with combined ARDS had intense oxidative stress and a more severe disease.
TCF4 and HK1 were further analyzed due to their higher expression and percentage in the five cell clusters. The results of the analysis showed that the expression of TCF4 and HK1 was increased in monocytes and B cells, and the gene coexpression overlap was higher in both monocytes and B cells in patients with combined ARDS compared with patients without combined ARDS (Fig. 7D). Correlation analysis of TCF4 and oxidation scores showed a high correlation between TCF4 and oxidation scores in B cells of patients with combined ARDS and a high correlation between TCF4 and oxidation scores in monocytes of patients without combined ARDS (Fig. 7E).