CircRNA expression data are characterised by a high proportion of small and zero counts
The signal to measure circRNA abundance from RNA-seq is limited compared to that available for linear transcripts and genes. CircRNAs are generally less abundant than linear transcripts and can be quantified unambiguously only by the reads encompassing the backsplice junctions [10,27] (Figure 1 a). Besides, gene expression abundance is measured by counting both the spliced and unspliced reads aligned to the whole gene region, thus summing the expression of all transcript isoforms of a gene [28]. In contrast, each circRNA represents one single transcript, and the backsplice junction reads (BJRs) originate only from the specific site of the circRNA sequence where the junction ends were joint [19] (Figure 1 a). Moreover, BJRs are computationally harder to identify than unspliced and linearly spliced reads as they require non-collinear alignments and additional processing to remove spurious hits [5], causing most circRNA detection tools to suffer from low detection rates [19,26].
The combination of circRNA biological features and computational hindrances in estimating their expression can result in data sets with a large fraction of small counts. We verified this characteristic in 34 RNA-seq data sets of matched ribosomal RNA-depleted and circRNA-enriched libraries from 17 human tissues (Table 1). CirComPara2 [26] was used to obtain linear and circular read mappings on circRNA-host genes. We discriminated four read alignment sets representing the expression signal available for (i) estimating gene expression, (ii) studying alternative splicing, (iii) comparing the abundance of circular and linear transcripts expressed by a gene, and (iv) estimating circRNA abundance. We compared the magnitude of the expression signals by counting (i) the unspliced and linearly spliced reads together, (ii) only the linearly spliced reads, (iii) only the linearly spliced reads aligned into backsplice junctions, and (iv) only the BJRs (Figure 1 a).
Regardless of the circRNA library enrichment, we observed that the highest signal was obtained for gene expression estimates, followed by the linearly spliced reads (Figure 1 b; Supplementary Figure 1). In turn, the spliced read counts slightly diminished if considering only those mapped on backsplice junctions. The BJRs showed the lowest values, also in the circRNA-enriched samples (Figure 1 b). Notably, median BJR counts were less than or equal to 10 in most samples (Supplementary Figure 1). These observations supported the hypothesis that, in RNA-seq data, circRNA expression estimates rely on a low signal biassed by the quantification procedure.
We further considered circRNA expression in RNA-seq data sets with multiple biological replicates to analyse the BJR count distribution of circRNA expression matrices. From the Sequence Read Archive public repository [29], we collected RNA-seq data of four independent circRNA studies that compared groups with at least five samples sequenced, more than 40 million paired-end reads, and composed of 10 to 50 samples of human tumours and healthy tissues (Table 1). In each data set, the samples showed high circRNA expression correlation within conditions, denoting homogeneity of the samples (Table 1).
In these data sets, most BJR counts laid below 10 (Figure 1 c), indicating that the circRNA small counts were not data set specific. Moreover, most circRNAs had a median BJR count of less than 10 (Figure 1 d), and the more samples in which a circRNA was detected, the higher the median BJR count (Figure 1 e).
These results suggested that the less expressed circRNAs might be undetected in some samples because of a sampling bias [36], which could inflate zero counts. Notably, the zero count fraction was large in all data sets (Figure 1 c). The unfiltered data set sparsity ranged from 44% to 72% null counts, but was not significantly correlated to the library size (rPearson = 0.05, p-value > 0.9) (Supplementary Figure 2).
To ascertain that these observations were not determined by some artefacts of the circRNA expression estimation algorithm, we computed the BJR counts with six additional circRNA quantification pipelines. We observed that the BJR count distribution was comparable among the quantification methods (Supplementary Figure 3), and the proportion of zero counts was high in all data sets regardless of the quantification pipeline (Figure 1 f).
Plus, as it is common practice in RNA-seq expression analysis [37], we applied five independent expression filtering strategies to the BJR count matrices, which considered discarding circRNAs according to the number of samples in which they were detected. As expected, the expression filters reduced the number of zero counts, but at the cost of discarding a significant fraction (from 30% up to 95%) of circRNAs (Figure 1 f). Moreover, the number of the low BJR counts remained high (Supplementary Figure 3), suggesting that the circRNAs detected in multiple samples also yielded a low expression signal.
Statistical modelling of circRNA expression count data
RNA-seq count data is often modelled with a negative binomial (NB) distribution [38]. However, when zero counts are in excess, a zero-inflated NB distribution (ZINB) may fit the data better [39]. We thus evaluated whether a ZINB distribution can model BJR counts better than an NB by calculating the goodness-of-fit (GOF) on the BJR count matrices, unfiltered and upon applying the expression filters. For each circRNA, the NB and ZINB GOF were compared according to the root mean square error (RMSE) of the mean counts and probability of observing a zero and the Akaike information criterion (AIC) scores.
Both NB and ZINB distributions obtained a small error for the mean count estimation (RMSE < 0.07) independently of the expression filtering procedure and data set (Supplementary Figure 4). The ZINB model provided better estimates of the observed zero proportion than the NB for each expression filter and data set except in the MS data set upon the application of the two filters discarding most circRNAs (i.e. “By condition” and “Half samples”) (Supplementary Figure 5). However, we observed small errors also for the NB (RMSE < 0.09). According to the AIC measure, the ZINB distribution modelled the BJR count data better than the NB for most circRNAs across data sets and expression filters (mean 68±17%) (Figure 1 g), suggesting that the model accounting for an excess of zeros might improve fitting circRNA expression data.
Comparison of differential expression assessment methods on circRNA data
In this work, we focus on the problem of assessing circRNA differential abundance. The traditional methods for bulk RNA-seq data analysis have been the primary choice when analysing circRNA expression, with DESeq2 [23], edgeR [24,37,40], and Limma-Voom [41] arguably the most used. However, the circRNA expression characteristics shown above suggest that circRNA BJR count data could not comply with the traditional differential expression methods (DEMs) assumptions and disrupt their performance.
The high proportion of small counts and the sparsity of circRNA expression data are comparable to those observed in single-cell RNA-seq (scRNA-seq) and whole metagenome shotgun sequencing (WMS). In particular, the circRNA data's small counts and library size are similar to droplet-based scRNA-seq data [36]. Further, the sparsity of circRNA data is analogous to scRNA-seq and WMS data, which range between 12 to 75% zeroes and 35 to 89%, respectively [17]. Finally, we observed that a ZINB distribution fits circRNA data better than NB in half the cases, as described in full-length scRNA-seq [36].
Therefore, we benchmarked 18 DEMs, including bulk RNA-seq DEMs and a few tools conceived for scRNA-seq and WMS data, selecting those freely available as R packages or functions. Furthermore, we explored different parameter settings, the ZINB-WaVE package weighting strategy [42,43], and normalisation approaches [44,45] coupled to DESeq2, edgeR, and Limma-Voom to handle small counts and sparse data specifically.
In total, we compared 38 differential expression analysis pipelines (Supplementary Table 1), evaluating their type I error control, false discovery rate (FDR), true positive rate (TPR, or recall), F1-scores, area under the precision-recall curve (AUCPR), and computation time. Moreover, we calculated the similarity of predictions between DEMs according to two similarity indexes.
Benchmark data sets simulation with a semiparametric approach
We generated 720 simulation data sets using SP-SimSeq [46], a semiparametric approach that preserves the real circRNAs and circRNA-circRNA correlations observed in real data. Specifically, for each of the four multiple-sample data sets, we simulated 30 expression matrices with an equal number of samples in two conditions considering three (N03), five (N05), and ten (N10) samples per group. ‘Null’ data sets with no differentially expressed circRNAs (DECs) and ‘signal’ data sets with 10% DECs were generated. We evaluated the simulated data sets’ quality according to expression levels, fractions of zeros, and the relation between the both, as in Soneson and Robinson [47]. All measures were not significantly different from the original data sets, confirming that the simulated data followed the original data characteristics (Supplementary Tables 2-4).
The following paragraphs show the results of the N05 size data sets for the 0.05 significance threshold. We reasoned that this is a common scenario for circRNA RNA-seq experiments. The results from the N03 and N10 simulations at 0.01 or 0.1 significance are available in the Supplementary Material.
Type I error control
We evaluated the type I error rate for each DEM, i.e., the probability of predicting a DEC when it is not, by computing the false positive rate (FPR) in the ‘null’ data sets. The methods could be grouped according to (i) liberal, (ii) conservative, and (iii) sufficient control of the type I error (Figure 2). Among the liberal methods, Seurat-BIM-LRT showed largely uncontrolled type I error (FPR = 0.38), consistently with a previous assessment by Soneson and Robinson in scRNA-seq [15]. Other methods with moderately liberal type I error control included Seurat-WLX and three edgeR pipelines (TWSP, RBST, and 50DF), with a median FPR between 0.10 and 0.12, whereas slightly liberal methods included edgeR-ZW, NBID, Voom-LF, Voom-QN (0.07 ≤ FPR ≤ 0.08). In contrast, conservative results (0 ≤ FPR < 0.03) were obtained by MAST, lncDIFF, DEsingle, the Wilcoxon test, limma-VST, and all the DESeq2 pipelines but DESeq2-ZW. The remaining methods achieved an FPR close to the nominal value (0.03 ≤ FPR < 0.07), and most of them (10 out of 16) were bulk RNA-seq methods. Results from NOISeqBIO were not suitable for type I error estimation because the NOISeqBIO’s scores are comparable to adjusted p-values, explaining its low FPR. We show NOISeqBIO in this analysis only for the completeness of the report. The quasi-likelihood framework [37,48], devised to improve type I error control when a linear model contains fitted values that are exactly zero, was effective using edgeR (edgeR-RBST-QFT) but not with the Limma-Voom pipeline (Voom-LF-MFT), which obtained slightly higher FPR than the other limma-voom versions. Interestingly, DESeq2 showed a type I error closer to the imposed α only when using the ZINB-WaVE weights.
The performance of the methods was consistent regardless of the α threshold (Supplementary Figure 6). Plus, a larger sample set improved the error rates only slightly for most methods except DESeq2, especially DESeq2-ZW, which showed much better results in larger data sets. Instead, Seurat-WLX, metagenomeSeq, and the three edgeR pipelines mentioned above showed larger FPRs with data sets of increased size (Supplementary Figure 6).
Expression estimate characteristics of the false-positive differentially expressed circRNAs
We calculated signal-to-noise statistics for each tool that reported five or more false-positive (FP) DECs in at least one ‘null’ data set. Similarly to a previous work by Soneson and Robinson [15], we compared the significant and non-significant DECs according to their average counts per million (CPM), coefficient of variation (CV), variance, and mean fraction of zeros (Figure 2). In general, we did not observe marked signal-to-noise statistics. The CV and variance of CPM were slightly positive for all methods, particularly the edgeR pipelines, except the CV for SAMseq and PoissonSeq, which were mostly negative. Likewise, the average CPM of FP DECs was slightly higher than not significant circRNAs, except for Seurat-BIM, ROTS, metagenomeSeq and the few FP DECs predicted by limma-VST. We observed a heterogeneous behaviour regarding the fraction of zeros: limma-VST, metagenomeSeq, ROTS, Voom-ZW, monocle, edgeR (except edgeR-QFT), and Seurat-BIM failed more on circRNAs with higher fractions of zero counts. FP DECs originated approximately equally from all scenarios except for circMeta-DT, which showed a higher FP number on the IDC and MS data set instances (Figure 2). Overall, the zero counts did not significantly affect the type I error control as much as the expression abundance and variance.
False discovery rate, power, F1-score, and area under the precision-recall curve
We used the ‘signal’ data sets to evaluate the methods’ false discovery and true positive rates. The Wilcoxon’s test and Seurat-WLX did not generate any significant prediction (Figure 3 a), thus resulting in a null TPR (Figure 3 b), and lncDIFF, MAST, and DEsingle returned significant predictions only in a few simulation instances. Similarly, DESeq2, PoissonSeq, and limma-VST did not predict any DEC in a relevant fraction of simulation instances, especially from the MS data sets. Only Seurat-BIM and circMeta-LC provided predictions below the imposed level in all the simulation instances.
Most methods (22 out of 38) showed higher FDR than the imposed 0.05 level: lncDIFF and Seurat-BIM scored the worst FDRs (FDR = 1 and 0.95, respectively), followed by PoissonSeq, circMeta, all edgeR pipelines, NBID, metagenomeSeq, monocle, ROTS, and all Voom pipelines but Voom-ZW (FDR > 0.09). On the contrary, NOISeqBIO, MAST, DEsingle, limma-VST, and all DESeq2 pipelines but DESeq2-ZW controlled the FDR lower than the nominal value (0 ≤ FDR ≤ 0.01). In DESeq2, a slightly more conservative FDR was obtained using the likelihood ratio test (DESeq2-LRT) compared to the Wald test (DESeq2-WaT). Voom-ZW, DESeq2-ZW, glmGamPoi, and SAMseq controlled the FDR close to the nominal value. Notably, every method except DEsingle and MAST presented FDP close to 1 in some instances. All methods obtained better FDR control on the N10 data sets but maintained the characteristics observed in the N05 data sets (Supplementary Figure 7).
The sensitivity was generally low, with a median below 50% for all methods (Figure 3 b). The highest TPRs (0.43 ≤ TPR ≤ 0.41) were obtained by three edgeR pipelines (TWSP, RBST, and 50DF). SAMseq, NBID, monocle, and four Voom pipelines(Voom-QN, Voom-LF, Voom-RBST, and Voom-DT) obtained TPRs between 0.36 and 0.31, whereas the remaining methods identified less than 30% true DECs. The choice of parameters greatly influenced sensitivity in edgeR pipelines. Interestingly, the quasi-likelihood framework produced opposite results when applied to edgeR or Limma-Voom, with the lowest and the highest TPR among the respective pipeline configurations. Similarly, ZINB-WaVE weights allowed higher recall rates with edgeR and DESeq2 but a lower TPR with Limma-Voom. Regarding the DESeq2 pipelines, the scRNA-seq-oriented pipelines obtained higher recall rates than the bulk RNA-seq configurations (Figure 3 b; Supplementary Figures 8-9). Notably, the adaptation to low counts of the circMeta test sensibly improved the recall rate (median TPR 0.23 and 0.03, respectively). Poor performance, close to zero, was achieved by limma-VST, lncDIFF, DEsingle, MAST, the Wilcoxon test, Seurat-WLX, and the bulk RNA-seq configurations of DESeq2.
All methods achieved significantly higher sensitivity with increased set sizes, except lncDIFF and Seurat-WLX, which did not detect any true DEC, and metagenomeSeq, which improved only a little (Supplementary Figure 9). The highest TPR among all settings (TPR = 0.9) was achieved by edgeR-RBST and edgeR-TWSP when allowing for a 0.1 adjusted p-value threshold in the N10 data sets (Supplementary Table 5). In the smallest data sets (N03), NOISeqBIO had the highest recall rate (TPR = 0.7 with 0.1 adjusted p-value), which was surprisingly higher than in the larger sets (Supplementary Figure 9).
We inspected the p-value distribution obtained in the ‘signal’ data sets to understand better the DEMs’ predictions (Supplementary Figure 10). CircMeta, edgeR, glmGamPoi, limma-VST, NBID, PoissonSeq, ROTS, SAMseq, and Voom showed p-value histograms as expected [49]. The other DEMs did not show a uniform distribution of the p-values, most having an overabundance of large p-values or a distribution skewed towards p = 1. Interestingly, the DESeq2 overabundance of large p-values was mitigated using the weights for zero counts. Comparing the p-value histograms between the N05 (Supplementary Figure 10) and N10 (Supplementary Figure 11) simulations, we observed better p-value distributions, indicating that the conservative p-value distributions were due to insufficient power of the methods with a small number of samples [16,49]. We observed a worse performance of Seurat-WLX compared to the simple Wilcoxon rank-sum test. Since Seurat-WLX implements an extended Wilcoxon rank-sum test that considers correlations between cases, the presence of positive correlations between circRNAs possibly increased the variance of the test, making the test more conservative.
Similarly to the analysis of type I error, we calculated the signal-to-noise statistics of the variability, fraction of zeros, and expression abundance, comparing for each method the false negative (FN) and true positive (TP) predictions, i.e. the circRNAs not detected as differentially expressed compared to those correctly identified. We did not observe significantly different characteristics of the FN compared to the TP predictions (Supplementary Figures 12-14). The poor recall rate could be related to an imprecise dispersion estimation of the models [50] or a systematic deviation from the theoretical null distribution of the test statistics [49].
We calculated the F1-score of each method to evaluate precision and recall simultaneously (Figure 3 c; Supplementary Table 5). Monocle and SAMseq obtained the highest F1-score (F1 = 0.61), followed by Voom-QN (F1 = 0.58), edgeR-TWSP and edgeR-RBST (F1 = 0.57 and 0.56, respectively).
We observed that the methods generally achieved better precision than recall and that precision scores were less spread than recall scores. In particular, edgeR-TWSP and edgeR-RBST owed their high F1-scores mainly to their high recall rates. Instead, SAMseq, monocle, and Voom-QN scores were driven mainly by a high precision (PPV ≥ 0.88) (Supplementary Table 5). Interestingly, the circMeta tests, designed explicitly for circRNA expression, ranked amongst the lowest F1-scores. SAMseq held the highest score also in the N03 data sets (Supplementary Figures 15-16). However, we observed a different ranking in the N10 data sets: Voom and monocle still achieved the top scores, but ROTS, glmGamPoi, DEsingle, and five edgeR configurations ranked ahead of SAMseq (Supplementary Table 5; Supplementary Figures 15-16).
Finally, we inspected the ability of the methods to rank true DECs ahead of not significant ones by computing the area under the precision-recall curve (AUCPR). The AUPRC is informative for data sets with a significant skew in the class distribution [51,52], as in our simulations. DEsingle obtained the highest AUPRC scores, notwithstanding its poor performance observed in the above analysis, indicating that DEsingle could almost perfectly rank true DECs on the top positions and suggesting an overly conservative assignment of the p-values (Figure 3 d). SAMseq, Voom-DT, Voom-RBST, and monocle obtained the next best scores (median AUPRC ≥ 0.7) (Supplementary Table 6). Interestingly, we observed that some methods showing poor performance according to the above metrics, including DESeq2-ZI, the Wilcoxon-based methods, and MAST, obtained AUPRC scores comparable to the best performing tools. In larger data sets, only lncDIFF and Seurat-BIM showed small AUPRC scores and a modest improvement (Supplementary Figure 17).
Analysis with nonparametric simulations from an independent data set
To corroborate the outcomes of the semiparametric simulation analysis, we performed nonparametric simulations from an independent study data set. We obtained the BJRs of 8,239 circRNAs computed with CIRI2 in 20 normal tissues and 76 tumour samples from a recent study on prostate cancer [35] (Table 1). In this data set, the number of replicates was sufficient to use the SimSeq tool [53], which performs nonparametric simulations without imposing any distribution assumption on the simulated data. Similarly to the previous analysis, we generated 90 instances for ‘null’ and ‘signal’ data sets with 6, 10, and 20 samples of two equally large condition groups. We analysed these data sets like the semiparametric data and ranked the methods according to FPR, TPR and FDR in each simulation type (Supplementary Figures 18-20). We observed a significant positive correlation between the mean ranks of the nonparametric and semiparametric simulations (Spearman’s rho > 0.5, p-value < 0.001; Supplementary Table 7), indicating a generally consistent performance of the methods in the two simulation settings. Unexpectedly, DEsingle showed an opposite AUPRC score than the semiparametric results.
Similarity of differential expression methods’ predictions
The methods' similarity was explored in the semiparametric simulations according to two metrics that considered the magnitude of prediction overlap and inquiring into different aspects of the use of predictions. First, we evaluated the similarity between method pairs according to the overlap of their DECs with an adjusted p-value ≤ 0.05, which allowed calculate the Jaccard similarity coefficient. Second, for each method pair, we considered the area under the concordance at the top (CAT), which we defined as the overlap of the top 100 circRNAs ranked according to adjusted p-values, regardless of fixed thresholds for the adjusted p-values.
Clustering the DEMs according to the similarity indexes, we observed that DEMs of the same base tool tended to cluster together (Figure 4). In particular, DESeq2 and Voom showed a high degree of similarity within the respective pipelines, suggesting that modifying the parameters of these tools did not affect their outcomes much. Instead, the three edgeR pipelines characterised by high FPR and TPR clustered apart from the other edgeR configurations. Consistently with the results above, Voom-LF clustered apart from edgeR-QFT. DESeq2 and edgeR using ZINB-WaVe weights reported similar predictions but slightly different from Voom-ZW. Interestingly, edgeR pipelines clustered closer to the Voom than DESeq2 pipelines according to Jaccard similarity (Figure 4 a), whereas three edgeR configurations grouped closer to DESeq2 when considering CAT (Figure 4 b), indicating more conservative p-values provided by DESeq2. Further, the scRNA-seq and bulk RNA-seq DEMs did not show distinct groups, indicating that they can provide similar results.
In the N10 data sets, allowing adjusted p-values ≤ 0.1, the DESeq2 pipelines showed the most consistent predictions regardless of the parameter configuration according to both the Jaccard index and CAT (Supplementary Figures 21-22). Conversely, the other DEMs showed a consistent ranking of their predictions (Supplementary Figures 22) but a great variation according to Jaccard similarity (Supplementary Figures 21), suggesting that the parameter configurations influenced the p-value magnitude but maintained the DEC ranking.
Overall ranking of the methods
To compare the methods' performances overall, we computed each method's rank relative to the other DEMs according to the F1 score, FDR, TPR, AUPRC, and FPR measures, independently in each simulated data set, with lower ranks corresponding to better-performing methods. The mean ranks and standard deviations computed on the N05 data sets are represented in Figure 5.
LncDIFF, MAST, Seurat-WLX, the simple Wilcoxon test, and DEsingle consistently performed worse than the other methods in all simulations. DEsingle achieved a good ranking according to the AUPRC, but the above analysis showed its unreliable behaviour in different data sets. NOISeqBIO, the DESeq2-BP, DEseq2-LRT, DEseq2-WAT, limma-VST, and PoissonSeq showed poor performance, ranking close to or higher than the third quartile. Seurat-BIM ranked the worst according to AUPRC and FPR. DESeq2 obtained poor ranking according to F1 score, FDR, and TPR, while scoring average ranks for the AUPRC and FPR. Interestingly, DESeq2-ZW showed a slightly better ranking than the other DESeq2 configurations.
EdgeR-RBST, edgeR-TWSP, edgeR-50DF, and NBID obtained the best mean ranks (below or close to the first quartile) according to F1 scores, owing mainly to their high TPRs. However, the edgeR pipelines ranked poorly according to FPRs, putting some concerns about the reliability of their predictions. Besides, NBID was outperformed by more than half the DEMs, according to the AUPRC, suggesting that it is suboptimal for modulating a significance threshold. All the Voom pipelines except Voom-ZW obtained rank below the median in all measures, indicating the consistently good performance of the Limma-Voom models, especially Voom-DT and Voom-RBST. The other edgeR-based methods were a close second. SAMseq and monocle showed interesting results on average but with a large variation, which indicates less consistent performance. Notably, all DEMs’ mean FPR ranks were above the first quartile, indicating that no method consistently outperformed the others in controlling type I error.
Different rankings were obtained on the data sets with three and ten replicates per group (Supplementary Figure 23), confirming that the sample size greatly influenced the method performances. DESeq2 obtained the best improvement in larger data sets, whereas circMeta, NOISeqBIO, and NBID showed better rankings with small numbers of samples.
Computational time
We compared the methods according to the CPU time required for the analysis (Supplementary Figure 24). Most methods ran rapidly in a few seconds or less than one minute. Conversely, computing weights with ZINB-WaVE was the most time-demanding task. Monocle, DEsingle, NBID, and ROTS were the slowest methods, requiring two to eight minutes to complete the analysis of one simulated data set.