In total, 41 combinations between six BEC methods (ZINB-WaVE13, MNNCorrect14, scMerge15, Seurat v316, limma_BEC10 and ComBat17), covariate models, four meta-analysis methods (Fisher, weighted Fisher (wFisher)12, fixed effects model (FEM)11 and random effects model (REM)11), and eight DE methods (DESeq29, edgeR18, edgeR_DetRate19, limma10, limmatrend20, moderated t-test21, MAST22 and the Wilcoxon ranksum test) were tested. These 41 methods are denoted as integrative (DE) methods. We note that all the six BEC methods tested here yield BEC data that can be used for DE analysis. Refer to Supplementary Information on how each integrative method was implemented. We focused on comparisons between two cell groups (test vs. control groups) and considered two, four, or seven batches. For each integrative method, a threshold of q-value < 0.05 was used to select differentially expressed genes (DE genes). For simulated data, F-score and area under precision-recall curve (AUPR) were compared between methods. In particular, we compared \({\text{F}}_{0.5}\)-scores and partial AUPR (denoted as pAUPR) for recall rates < 0.5, both of which weighed precision higher than recall, considering the high noise rate and sparsity of scRNA-seq data (see Methods). For real scRNA-seq data (LUAD), the ranks of known disease genes and prognostic genes, false-positive rates (p-value < 0.05), and false discoveries (q-value < 0.05) were compared. The real data contained seven batches for which we additionally tested four pseudobulk analysis methods23. Throughout this study, unless otherwise stated, sparsely expressed genes (zero rate > 0.95) were filtered for reliable DE analysis. We observed similar trends when a less strict threshold (zero rate > 0.99) was used; however, we focus on the results using the more strict condition, considering that genes that are very rarely expressed in a cell type are less likely to play a crucial role in disease.
Model-based simulation tests
ScRNA-seq data were simulated on the basis of negative binomial (NB) model24. Sparse data with a high overall zero rate (> 80%) after the gene filtering were simulated for each batch. We used a rather challenging condition with large batch effects and small group differences to analyze the effects of each integrative strategy (Fig. 1a-b). The sizes of batch effects and group differences were compared using the principal variance component analysis (PVCA)25. The \({\text{F}}_{0.5}\)-scores and precision-recall analysis results for the two-batch case are shown in Fig. 1c-d. The experiment was performed for several combinations of “dropout” parameter values and test-control ratios, and the resulting \({\text{F}}_{0.5}\)-scores and pAUPRs were represented as boxplots and averaged curves, respectively. Parametric methods based on MAST, DESeq2, edgeR, and limmatrend showed good \({\text{F}}_{0.5}\)-scores and pAUPRs. Wilcoxon (ranksum) test applied to uncorrected data (denoted as Raw_Wilcox) is currently the most widely used for scRNA-seq DE analysis23; however, its performance was relatively low, with or without BEC method.
Next, we checked whether each integrative strategy truly improved the analysis of pooled uncorrected data. ZINB-WaVE provides the “observation weights” (dropout probability) that are used to “unlock” bulk RNA-seq tools to analyze single-cell data26. These weights were applied to the parametric methods, edgeR and DESeq2 (denoted as ZW_edgeR and ZW_DESeq2, respectively). First, incorporating the weights of ZINB-WaVE considerably improved edgeR and the use of BEC data from MNNCorrect moderately improved limmatrend (denoted as MNN_limmatrend), in both \({\text{F}}_{0.5}\)-scores and pAUPRs. However, no other BEC data improved DE analysis for these sparse data. Second, most covariate models (tagged with “_Cov”) improved their corresponding parametric methods, such as ZW_edgeR, DESeq2, limmatrend, and MAST. In particular, the performance of ZW_edgeR_Cov was among the highest. We also tested four batches and obtained similar results (Supplementary Fig. 1a, c-d).
Model-free simulation tests
We then devised a model-free simulation using real scRNA-seq data to incorporate realistic and complex batch effects and avoid potential bias toward parametric methods. First, we used the two batches from the human pancreas data27 (named as “human1” and “human2”) that were produced by the same laboratory using the same sequencing platform (inDrop28). The alpha cells were used for our simulation. Second, we used the two batches from Mouse Cell Atlas (MCA), which were produced by different laboratories using different sequencing platforms (Illumina HiSeq 250029 and NovaSeq 600030). For MCA data, the T-cells were used for our simulation. Because these cell types contained several subtypes, the largest clusters that were matched between batches were selected for our simulation. After removing sparsely expressed genes, the overall zero rates of the pancreas and T-cell data were 83% and 73%, respectively. Each batch dataset was randomly split into test and control groups with several different ratios, and then 20% of DE genes (10% up and 10% down) were simulated by downsampling positive counts using binomial distribution (see Methods). As expected, small and large batch effects were observed for the pancreas and MCA data, respectively, after PVCA (Fig. 2a-b). The corresponding \({\text{F}}_{0.5}\)-scores and precision-recall analysis results are shown in Fig. 2c-f. For the pancreas data, the integrative strategies rarely improved the DE analysis of uncorrected data, and parametric methods, such as limmatrend, DESeq2, and edgeR, performed well with minor differences in pAUPR. However, for MCA data that exhibited large batch effects, several integrative methods were effective. For example, edgeR-based methods exhibited relatively low pAUPRs compared to other parametric methods and the weights of ZINB-WaVE considerably improved edgeR in pAUPR, and incorporating batch covariate further improved the method, making ZW_edgeR_Cov the top-performer in both \({\text{F}}_{0.5}\)-scores and pAUPRs (Fig. 2f). The use of BEC data for ComBat and MNNCorrect also enhanced the performance of limmatrend. However, no improvement was observed for Wilcoxon test with any BEC method.
Whereas BEC methods can be used to reduce batch effects and denoise the data, their benefit in DE analysis was realized for only large batch effects and parametric methods. Covariate modeling also improved their corresponding parametric methods for large batch effects. Many parametric methods outperformed Wilcoxon test, and ZW_edgeR_Cov was among the best performers in both model-based and model-free tests.
Preservation of signs of differential expression
For the simulation cases described above, we counted the DE genes that reversed their signs by using an integrative method to compare the extent of data distortion. For the simulated DE genes, the signs of the computed logFC values in each integrative method were compared with the known ground truth. Meta-analyses were performed in both right- and left-tailed tests and the sign for the smaller p-value was used for each gene. Figure 3 and Supplementary Fig. 2 show the ratios of DE genes that altered their signs by each integrative method. Methods based on voom-transformation, including limma, and many meta-analysis methods tended to show relatively high error ratios. Relatively accurate and consistent results were obtained by using Wilcoxon test, MAST, and edgeR-based methods. The performance of Wilcoxon test worsened using BEC methods. We then compared the error ratios among the significantly detected DE genes (q-value < 0.05). In this case, most methods showed a much improved sign prediction. In summary, most integrative methods provided quite accurate results for the analysis of individual genes; however, some of the methods may yield inaccurate results for gene-set level analysis31,32, as it reflects the expression of all genes.
Effect of sparsity
We also compared the methods for less sparse data. The scRNA-seq data with 60% and 40% zero rates were simulated (see Methods). An overall increase for \({\text{F}}_{0.5}\)-scores and pAUPRs was observed as the data became less sparse (Supplementary Fig. 3). For these data, DESeq2 and limmatrend combined with integrative methods performed well. In particular, the best pAUPRs were achieved using DESeq2 combined with the FEM meta-analysis. Intriguingly, the use of weights from ZINB-WaVE did not improve edgeR anymore for 40% zero rate, whereas the use of BEC data from MNNCorrect consistently improved limmatrend. The pAUPR for Raw_Wilcox ranked only 22nd at the 80% zero rate, but its rank was rather improved to 12th and 15th at the 60% and 40% zero rates, respectively. Whereas all BEC methods tested here did not improve the results of Wilcoxon test at the 80% zero rate, some of the BEC methods improved the results of this nonparametric method for less sparse data. For example, at the 40% zero rate, Wilcoxon test combined with limma_BEC or ComBat performed better than Raw_Wilcox in both \({\text{F}}_{0.5}\)-scores and pAUPRs, and ranked as high as 5th and 8th in pAUPR, respectively. Overall, the sparsity of data considerably affected the integrative methods. Conventionally used BEC or meta-analysis methods developed for bulk sample data performed well for less sparse data, whereas single-cell-specific strategies such as MAST or ZINB-WaVE achieved a high performance for very sparse data.
Test for heterogeneous samples
Kim and colleagues33 conducted a comprehensive analysis of scRNA-seq data for LUAD with over 200K cells containing various cell types. Supplementary Fig. 4a shows the distribution of cancer and normal epithelial cells for seven patients with LUAD (stage I). Whereas the clusters of normal cells comprised multiple patients, those of cancer cells were clearly separated between patients. This heterogeneity in cancer samples seems to be related to the different causes or progression of the disease between patients. In other words, different patients may not share exactly the same set of DE genes; this case was referred to as “incomplete association” 12. We tested this case using four batches with 4% batch-specific DE genes simulated in each batch in addition to 15% common DE genes across the batches. We attempted to detect all the genes that were DE genes in at least one batch (Supplementary Fig. 1b, e-f). The best pAUPR was achieved using MAST_Cov, and ZW_edgeR_Cov also performed well. wFisher that combined the p-values of DESeq2 (denoted as DESeq2_wFisher) ranked 2nd in both \({\text{F}}_{0.5}\)-scores and pAUPR. These three methods can be considered when analyzing data complicated with both batch effects and biological variation between samples.
Control of false positives and false discoveries
We used the data for normal epithelial cells in the seven patients with LUAD (stage I ) to compare false-positives and false discoveries between integrative methods. The data for each patient were randomly split into two groups with several different ratios (e.g., 2:8, 3:7, 4:6, 5:5), and integrative DE analysis was performed with no DE genes included. We repeated this experiment four times, and the numbers of genes with p-value < 0.05 (false-positive) and q-value < 0.05 (false discovery) were compared between methods (Fig. 4a). The worst false-positive controls were obtained using the p-value combination methods, Fisher and wFisher combined with edgeR or limmatrend. These methods also yielded tens to hundreds of false discoveries. edgeR or edgeR_DetRate showed relatively poor controls of false positives and false discoveries, whereas ZW_edgeR improved the results. Other methods showed a reasonable control of false positives and false discoveries. We obtained similar results for a weaker gene filtering (zero rate < 0.99) (Supplementary Fig. 5).
We then performed the same test for four batches generated by model-based simulation (Fig. 4b). These data represent a simplified condition without subtypes, whereas cell types in real data often include subtypes. Moreover, these data do not represent correlations between genes. However, these two results exhibited some similarity: (1) Poor controls of false-positives and false discoveries were observed using Fisher- and wFisher-based methods. (2) A number of false discoveries were observed using edgeR- and DESeq2-based methods. (3) Good controls of false-positives and false discoveries were achieved using Wilcoxon test, analysis of pseudobulk data, and single-cell-specific methods such as MAST and ZW_edgeR. We note that an increasing number of false discoveries were observed for increasing variations between samples using Wilcoxon test when “independent” samples were analyzed23; however, reliable results were obtained for Wilcoxon test when analyzing “paired” samples that included both test and control conditions in each batch.
Detection of known disease genes
We selected the cells from seven patients with LUAD (stage I), and performed DE analysis between tumor and normal cells for three main cell types: epithelial cells, myeloid cells, and immune cluster composed of T lymphocytes and natural killer cells. These cell types together occupied 68.8% and 74.6% of normal and tumor cells in the scRNA-seq data, respectively (Supplementary Fig. 4b). Because true DE genes are not known for real data, we used the known lung cancer-related genes as the “standard positives”. In total, 221 standard positive genes were obtained from two disease gene databases, DisGeNET34 and CTD35. These genes were weighted by the disease-association score (gda_score > 0.3) provided by DisGeNET (see Methods). All the genes analyzed were sorted by the DE p-values in each integrative method, and the cumulative sum of gda_scores of standard positive genes, denoted as cumulative score, was compared between methods in the respective cell types (Fig. 5a-c). In other words, we compared the weighted counts of known disease genes included in the top-k DE genes in each method.
To assess the ranks of known disease genes, we devised a truncated Kolmogorov–Smirnov (KS) test that only reflected the ranks of standard positives within the top 20% DE genes, with those in the remaining 80% forced to be evenly distributed. This approach can be particularly useful in selecting methods that are capable of prioritizing important genes in high ranks (see Methods), whereas the conventional KS test risks assessing a large number of middle ranks as significant36. Even with this conservative test, many integrative methods exhibited significantly high ranks of the standard positives when epithelial cells were analyzed (p-value < 0.01) (Fig. 5d). To compare the performance of integrative DE methods, the area under the cumulative score curves for the top 20% DE genes, denoted as pAUC, was used. The top-performer was ZW_edgeR_Cov, which was closely followed by edgeR_Cov, Raw_Wilcox, limmatrend_Cov, and MAST_Cov (Supplementary Fig. 6a). Covariate modeling marginally improved the corresponding parametric methods, but other integrative strategies hardly improved the DE analyses. This seems to be attributed to the small batch effects between patients (Supplementary Fig. 4c). Interestingly, when myeloid cells and immune cluster were analyzed, none of the integrative methods showed a significance (Fig. 5d).
We then performed DE analysis with the bulk RNA-seq data for LUAD in The Cancer Genome Atlas (TCGA)37 comprising 493 cancer and 53 normal samples. The corresponding cumulative scores for the known disease genes are also represented in Fig. 5a-c. Remarkably, integrative analysis of epithelial cells for only seven patients outperformed the analysis of hundreds of bulk samples, demonstrating the high potential of scRNA-seq DE analysis to discover disease genes. Figure 5e compares the ranks of 12 genes with high disease scores (gda_score > 0.5) obtained using five integrative methods and four bulk sample analysis methods. The five integrative methods for scRNA-seq data detected the 12 genes with the average rank percentiles of 33.6% – 40.7% with ZW_edgeR_Cov performing the best, whereas much worse percentiles of 64.2% – 69.7% were achieved using the four TCGA analysis methods. In particular, EGFR, KRAS, CTNNB1, and ERBB2 genes were captured within the top 20% rank by at least three integrative DE methods, and the two genes EGFR and KRAS, which are most common in lung cancer, were ranked in the top 5.3% and 9% by ZW_edgeR_Cov, respectively. In contrast, none of the 12 genes were included within the top 20% by analyses of TCGA data; specifically, EGFR and KRAS were only ranked 61% – 94.7% and 20.4% – 35.8%, respectively. These four genes are known to play important roles in the development of tumor malignancy related to RAS/RAF/MAPK and Wnt signaling pathways38–40 (see Supplementary Information). The top 20% DE genes in LUAD epithelial cells obtained using three integrative DE methods as well as TCGA analysis results are shown in Supplementary Table 1, which suggests novel LUAD-related genes.
We analyzed two more large-scale bulk sample expression datasets for LUAD that were obtained from GEO database41 (GSE31210 and GSE43458), where integrative analyses of scRNA-seq data still outperformed the analyses of these bulk sample data in detecting the known disease genes (see Methods and Supplementary Fig. 7a-b). Furthermore, integrative DE methods surpassed both the analyses of scRNA-seq data for individual patients and pseudobulk data23 (Supplementary Fig. 8). Whereas the analysis of pseudobulk data exhibited strict controls of false positives and false discoveries (Fig. 4), its predictive power for disease-related genes was not high in our analysis of paired data.
Detection of prognostic genes
Next, we performed the same analysis as above using another set of disease-related genes. These genes were selecteded from an integrated survival analysis of five microarray gene expression datasets for patients with LUAD (GSE29013, GSE30129, GSE31210, GSE37745, and GSE50081). The Cox proportional hazards model incorporating covariates of age, sex, and tumor stage42 was applied to each dataset, and the resulting p-values were combined for each gene using wFisher considering the signs of hazards ratios (HRs)12. These integrated p-values were adjusted for multiple testing correction, yielding 448 genes with q-value < 0.05, denoted as “prognostic (standard positive) genes”. We note that only seven of these genes were also included in the 221 known disease genes. Many integrative methods applied to epithelial cells detected the prognostic genes with significantly high ranks, and outperformed both the analyses of TCGA and pseudobulk data (Supplementary Fig. 9). Interestingly, several integrative methods applied to myeloid cells also detected the prognostic genes with significantly high ranks (p-value < 0.01), suggesting the association of DE genes in those cell types with the survival of patients.
Pathway analyses for lung epithelial cells and TCGA LUAD data
We further tested the pathway enrichments for scRNA-seq (epithelial cells) and TCGA data to compare the functional relevance of each DE analysis in cancer. The gene-set enrichment analysis (GSEA) was applied to the ranked genes in each DE method31,43. From the pathway database “wikipathway_2021”44, 192 pathways that were most relevant to cancer progression were selected as standard positives. These pathways were selected on the basis of the ten oncogenic signaling pathways45 and the seven cancer associated processes46 as well as those including the keyword(s) tumor, cancer, or carcinoma in their names (see Methods). We classified these pathways into 16 categories for detailed interpretation of the GSEA results (Supplementary Table S2). The cumulative counts of the standard positive pathways showed that GSEA for scRNA-seq data compared favorably with that for TCGA data (Fig. 6).
Interestingly, the analyses of scRNA-seq and TCGA data exhibited distinct functional categories. For example, “Ciliopathies (WP4803)” in the “cell polarity and migration” category ranked first in all the six scRNA-seq analyses, whereas it only ranked 22th to 48th in TCGA analyses. “Genes related to primary cilium development (based on CRISPR) (WP4536)”, which belongs to the same category, was also detected within the top five ranks by all the six scRNA-seq analyses, whereas none of the TCGA analyses detected this pathway. These results represent the cell-type-specific perturbation of pathways in lung epithelial cells. Several pathways in the “cell survival” category, including “Apoptosis (WP254)” and “Senescence and Autophagy in Cancer (WP615)”, were also detected in scRNA-seq analyses, whereas none of them in that category was detected in TCGA analyses. Moreover, the five categories “WNT”, “PI3K”, “HIPPO”, “NOTCH”, and “P53” in oncogenic signalling pathways were detected in scRNA-seq analyses, but none of them were detected in TCGA analyses. In contrast, GSEA for TCGA data detected five and seven pathways in the two categories “genomic instability” and “inflammation”, respectively, whereas GSEA for scRNA-seq data detected none and at most three in the respective categories. By analyzing the epithelial cell data, we were able to detect many canonical oncogenic pathways as well as cell-type-specific pathways that the bulk sample analyses missed. The GSEA results for selected integrative DE methods and TCGA analysis results are available from Supplementary Table 3.
A gross comparison
All the test results in this study, including the speed and scalability of integrative methods are summarized in Fig. 7. pAUPR shows how precisely a method can detect DE genes while maintaining a low type I error and the results are summarized in Fig. 7a. Other measurements are shown in Fig. 7b. We have classified the methods into three levels (“high”, “medium” and “low”) in each category based on the test results for all datasets used in this study: model-based simulation for three sparsity levels (40%, 60%, and 80%), four-batch cases with or without additional biological variations, three model-free simulation cases, and three cell types in the lung cancer data. See Methods for the criteria used in each category.