Systematic benchmarking of statistical methods to assess differential expression of
circular RNAs

doi:10.21203/rs.3.rs-2018316/v2

Download PDF

Research Article

Systematic benchmarking of statistical methods to assess differential expression of circular RNAs

https://doi.org/10.21203/rs.3.rs-2018316/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

Circular RNAs (circRNAs) are covalently closed transcripts involved in critical regulatory axes, cancer pathways, and disease mechanisms. CircRNA expression measured with RNA-seq has particular characteristics that might hamper the performance of standard biostatistical methods for differential expression assessment (DEMs).

We compared 38 DEM pipelines configured to fit circRNA expression data’s statistical properties, including bulk RNA-seq, single-cell RNA-seq (scRNA-seq), and metagenomics DEMs. The DEMs performed poorly on data sets of typical size. Widely used DEMs, such as DESeq2, edgeR, and Limma-Voom, gave scarce results, unreliable predictions, or even contravened the expected behaviour with some parameter configurations. Limma-Voom achieved the most consistent performance throughout different benchmark data sets and, as well as SAMseq, reasonably balanced false discovery and recall rates. Interestingly, a few scRNA-seq DEMs obtained results comparable to the best performing bulk RNA-seq tools. Almost all DEMs' performance improved when increasing the number of replicates.

CircRNA expression studies require careful design, choice of DEM, and DEM configuration. This analysis can guide scientists in selecting the appropriate tools to investigate circRNA differential expression with RNA-seq experiments.

Bioinformatics

Computational Biology

Molecular Biology

Epigenetics & Genomics

Biostatistics

CircRNA

differential expression

benchmark

RNA-seq

low-expression

Circular RNAs (circRNAs) are transcripts in which an upstream 5’ splice site and a downstream 3’ splice site are covalently joined through a backsplicing process [1]. CircRNAs are pervasively expressed in eukaryotes, play critical cellular roles, are involved in disease and cancer mechanisms, and find many biomedical applications [1–4]. The last decade has seen mounting interest in studying circRNAs [5]. CircRNAs are often investigated through high-throughput total RNA sequencing (RNA-seq), and their characterisation is becoming a fundamental part of transcriptomics analyses [6,7] [8,9].

Most bioinformatics tools that quantify circRNA expression from RNA-seq data estimate circRNA abundance by counting the backspliced reads, i.e., the spliced reads aligned in non-collinear order to backsplice junctions [5,10]. They allow composing of backsplice junction read (BJR) count expression matrices, which can be analysed with statistical methods devised to assess differential gene expression (DGE) [11]. Although several works benchmarked the numerous DGE assessment tools developed for RNA-seq technology [12–17], circRNA expression data has never been considered so far.

CircRNA expression is generally low, as backsplicing is rarer than linear splicing [1,18,19], and, in RNA-seq data, the backspliced reads constitute no more than 2% of all spliced reads [18]. Moreover, technical aspects of the procedure to quantify the circRNA abundance from RNA-seq data may hamper the estimation of circRNA expression levels, even when circRNA-enriched sequencing libraries are employed [19]. Such biological and technical characteristics lead to numerous low-count expression estimates, which can undermine the performance of DGE assessment methods [16,20,21].

Thus far, only the circMeta package [22] provides a statistical method specific for differential circRNA abundance. The authors of circMeta observed that the Poisson distribution modelled circRNA expression counts better than the negative binomial. Therefore, they proposed using a Poisson-based 𝑧-test to determine differentially expressed circRNAs and showed that it was more powerful than DESeq2 [23] and edgeR [24]. However, their comparison was limited to parametric simulations based on a single real data set with a small number of replicates. Moreover, the parameter settings exploration was limited to the default for the two competitor methods, and only circRNAs expressed at moderate to high levels were considered, as is expected when selecting the circRNAs jointly predicted by multiple circRNA detection tools [25,26].

In this work, we first explored the characteristics of circRNA expression count data, confirming that most circRNAs yield small counts in typical RNA-seq data sets and highlighting a considerable fraction of zero counts. Then, we compared traditional DGE tools for bulk RNA-seq applied to circRNA expression data, considering different parameter combinations for low-count and sparse data. Because we observed similar features between circRNA, single-cell, and microbiome RNA-seq data, we also included statistical models developed for those fields. In total, we compared 38 differential expression assessment pipelines on hundreds of semiparametric and nonparametric simulated circRNA expression count data sets, evaluating the type I error control, false discovery rate, recall, F₁-score, area under the precision-recall curve, and similarity of predictions between the methods. Our systematic and comprehensive benchmarking provides an overview of the weaknesses and strengths of differential expression tools on circRNA data.

CircRNA expression data are characterised by a high proportion of small and zero counts

The signal to measure circRNA abundance from RNA-seq is limited compared to that available for linear transcripts and genes. CircRNAs are generally less abundant than linear transcripts and can be quantified unambiguously only by the reads encompassing the backsplice junctions [10,27] (Figure 1 a). Besides, gene expression abundance is measured by counting both the spliced and unspliced reads aligned to the whole gene region, thus summing the expression of all transcript isoforms of a gene [28]. In contrast, each circRNA represents one single transcript, and the backsplice junction reads (BJRs) originate only from the specific site of the circRNA sequence where the junction ends were joint [19] (Figure 1 a). Moreover, BJRs are computationally harder to identify than unspliced and linearly spliced reads as they require non-collinear alignments and additional processing to remove spurious hits [5], causing most circRNA detection tools to suffer from low detection rates [19,26].

The combination of circRNA biological features and computational hindrances in estimating their expression can result in data sets with a large fraction of small counts. We verified this characteristic in 34 RNA-seq data sets of matched ribosomal RNA-depleted and circRNA-enriched libraries from 17 human tissues (Table 1). CirComPara2 [26] was used to obtain linear and circular read mappings on circRNA-host genes. We discriminated four read alignment sets representing the expression signal available for (i) estimating gene expression, (ii) studying alternative splicing, (iii) comparing the abundance of circular and linear transcripts expressed by a gene, and (iv) estimating circRNA abundance. We compared the magnitude of the expression signals by counting (i) the unspliced and linearly spliced reads together, (ii) only the linearly spliced reads, (iii) only the linearly spliced reads aligned into backsplice junctions, and (iv) only the BJRs (Figure 1 a).

Regardless of the circRNA library enrichment, we observed that the highest signal was obtained for gene expression estimates, followed by the linearly spliced reads (Figure 1 b; Supplementary Figure 1). In turn, the spliced read counts slightly diminished if considering only those mapped on backsplice junctions. The BJRs showed the lowest values, also in the circRNA-enriched samples (Figure 1 b). Notably, median BJR counts were less than or equal to 10 in most samples (Supplementary Figure 1). These observations supported the hypothesis that, in RNA-seq data, circRNA expression estimates rely on a low signal biassed by the quantification procedure.

We further considered circRNA expression in RNA-seq data sets with multiple biological replicates to analyse the BJR count distribution of circRNA expression matrices. From the Sequence Read Archive public repository [29], we collected RNA-seq data of four independent circRNA studies that compared groups with at least five samples sequenced, more than 40 million paired-end reads, and composed of 10 to 50 samples of human tumours and healthy tissues (Table 1). In each data set, the samples showed high circRNA expression correlation within conditions, denoting homogeneity of the samples (Table 1).

In these data sets, most BJR counts laid below 10 (Figure 1 c), indicating that the circRNA small counts were not data set specific. Moreover, most circRNAs had a median BJR count of less than 10 (Figure 1 d), and the more samples in which a circRNA was detected, the higher the median BJR count (Figure 1 e).

These results suggested that the less expressed circRNAs might be undetected in some samples because of a sampling bias [36], which could inflate zero counts. Notably, the zero count fraction was large in all data sets (Figure 1 c). The unfiltered data set sparsity ranged from 44% to 72% null counts, but was not significantly correlated to the library size (r_Pearson = 0.05, p-value > 0.9) (Supplementary Figure 2).

To ascertain that these observations were not determined by some artefacts of the circRNA expression estimation algorithm, we computed the BJR counts with six additional circRNA quantification pipelines. We observed that the BJR count distribution was comparable among the quantification methods (Supplementary Figure 3), and the proportion of zero counts was high in all data sets regardless of the quantification pipeline (Figure 1 f).

Plus, as it is common practice in RNA-seq expression analysis [37], we applied five independent expression filtering strategies to the BJR count matrices, which considered discarding circRNAs according to the number of samples in which they were detected. As expected, the expression filters reduced the number of zero counts, but at the cost of discarding a significant fraction (from 30% up to 95%) of circRNAs (Figure 1 f). Moreover, the number of the low BJR counts remained high (Supplementary Figure 3), suggesting that the circRNAs detected in multiple samples also yielded a low expression signal.

Statistical modelling of circRNA expression count data

RNA-seq count data is often modelled with a negative binomial (NB) distribution [38]. However, when zero counts are in excess, a zero-inflated NB distribution (ZINB) may fit the data better [39]. We thus evaluated whether a ZINB distribution can model BJR counts better than an NB by calculating the goodness-of-fit (GOF) on the BJR count matrices, unfiltered and upon applying the expression filters. For each circRNA, the NB and ZINB GOF were compared according to the root mean square error (RMSE) of the mean counts and probability of observing a zero and the Akaike information criterion (AIC) scores.

Both NB and ZINB distributions obtained a small error for the mean count estimation (RMSE < 0.07) independently of the expression filtering procedure and data set (Supplementary Figure 4). The ZINB model provided better estimates of the observed zero proportion than the NB for each expression filter and data set except in the MS data set upon the application of the two filters discarding most circRNAs (i.e. “By condition” and “Half samples”) (Supplementary Figure 5). However, we observed small errors also for the NB (RMSE < 0.09). According to the AIC measure, the ZINB distribution modelled the BJR count data better than the NB for most circRNAs across data sets and expression filters (mean 68±17%) (Figure 1 g), suggesting that the model accounting for an excess of zeros might improve fitting circRNA expression data.

Comparison of differential expression assessment methods on circRNA data

In this work, we focus on the problem of assessing circRNA differential abundance. The traditional methods for bulk RNA-seq data analysis have been the primary choice when analysing circRNA expression, with DESeq2 [23], edgeR [24,37,40], and Limma-Voom [41] arguably the most used. However, the circRNA expression characteristics shown above suggest that circRNA BJR count data could not comply with the traditional differential expression methods (DEMs) assumptions and disrupt their performance.

The high proportion of small counts and the sparsity of circRNA expression data are comparable to those observed in single-cell RNA-seq (scRNA-seq) and whole metagenome shotgun sequencing (WMS). In particular, the circRNA data's small counts and library size are similar to droplet-based scRNA-seq data [36]. Further, the sparsity of circRNA data is analogous to scRNA-seq and WMS data, which range between 12 to 75% zeroes and 35 to 89%, respectively [17]. Finally, we observed that a ZINB distribution fits circRNA data better than NB in half the cases, as described in full-length scRNA-seq [36].

Therefore, we benchmarked 18 DEMs, including bulk RNA-seq DEMs and a few tools conceived for scRNA-seq and WMS data, selecting those freely available as R packages or functions. Furthermore, we explored different parameter settings, the ZINB-WaVE package weighting strategy [42,43], and normalisation approaches [44,45] coupled to DESeq2, edgeR, and Limma-Voom to handle small counts and sparse data specifically.

In total, we compared 38 differential expression analysis pipelines (Supplementary Table 1), evaluating their type I error control, false discovery rate (FDR), true positive rate (TPR, or recall), F₁-scores, area under the precision-recall curve (AUCPR), and computation time. Moreover, we calculated the similarity of predictions between DEMs according to two similarity indexes.

Benchmark data sets simulation with a semiparametric approach

We generated 720 simulation data sets using SP-SimSeq [46], a semiparametric approach that preserves the real circRNAs and circRNA-circRNA correlations observed in real data. Specifically, for each of the four multiple-sample data sets, we simulated 30 expression matrices with an equal number of samples in two conditions considering three (N03), five (N05), and ten (N10) samples per group. ‘Null’ data sets with no differentially expressed circRNAs (DECs) and ‘signal’ data sets with 10% DECs were generated. We evaluated the simulated data sets’ quality according to expression levels, fractions of zeros, and the relation between the both, as in Soneson and Robinson [47]. All measures were not significantly different from the original data sets, confirming that the simulated data followed the original data characteristics (Supplementary Tables 2-4).

The following paragraphs show the results of the N05 size data sets for the 0.05 significance threshold. We reasoned that this is a common scenario for circRNA RNA-seq experiments. The results from the N03 and N10 simulations at 0.01 or 0.1 significance are available in the Supplementary Material.

Type I error control

We evaluated the type I error rate for each DEM, i.e., the probability of predicting a DEC when it is not, by computing the false positive rate (FPR) in the ‘null’ data sets. The methods could be grouped according to (i) liberal, (ii) conservative, and (iii) sufficient control of the type I error (Figure 2). Among the liberal methods, Seurat-BIM-LRT showed largely uncontrolled type I error (FPR = 0.38), consistently with a previous assessment by Soneson and Robinson in scRNA-seq [15]. Other methods with moderately liberal type I error control included Seurat-WLX and three edgeR pipelines (TWSP, RBST, and 50DF), with a median FPR between 0.10 and 0.12, whereas slightly liberal methods included edgeR-ZW, NBID, Voom-LF, Voom-QN (0.07 ≤ FPR ≤ 0.08). In contrast, conservative results (0 ≤ FPR < 0.03) were obtained by MAST, lncDIFF, DEsingle, the Wilcoxon test, limma-VST, and all the DESeq2 pipelines but DESeq2-ZW. The remaining methods achieved an FPR close to the nominal value (0.03 ≤ FPR < 0.07), and most of them (10 out of 16) were bulk RNA-seq methods. Results from NOISeqBIO were not suitable for type I error estimation because the NOISeqBIO’s scores are comparable to adjusted p-values, explaining its low FPR. We show NOISeqBIO in this analysis only for the completeness of the report. The quasi-likelihood framework [37,48], devised to improve type I error control when a linear model contains fitted values that are exactly zero, was effective using edgeR (edgeR-RBST-QFT) but not with the Limma-Voom pipeline (Voom-LF-MFT), which obtained slightly higher FPR than the other limma-voom versions. Interestingly, DESeq2 showed a type I error closer to the imposed α only when using the ZINB-WaVE weights.

The performance of the methods was consistent regardless of the α threshold (Supplementary Figure 6). Plus, a larger sample set improved the error rates only slightly for most methods except DESeq2, especially DESeq2-ZW, which showed much better results in larger data sets. Instead, Seurat-WLX, metagenomeSeq, and the three edgeR pipelines mentioned above showed larger FPRs with data sets of increased size (Supplementary Figure 6).

Expression estimate characteristics of the false-positive differentially expressed circRNAs

We calculated signal-to-noise statistics for each tool that reported five or more false-positive (FP) DECs in at least one ‘null’ data set. Similarly to a previous work by Soneson and Robinson [15], we compared the significant and non-significant DECs according to their average counts per million (CPM), coefficient of variation (CV), variance, and mean fraction of zeros (Figure 2). In general, we did not observe marked signal-to-noise statistics. The CV and variance of CPM were slightly positive for all methods, particularly the edgeR pipelines, except the CV for SAMseq and PoissonSeq, which were mostly negative. Likewise, the average CPM of FP DECs was slightly higher than not significant circRNAs, except for Seurat-BIM, ROTS, metagenomeSeq and the few FP DECs predicted by limma-VST. We observed a heterogeneous behaviour regarding the fraction of zeros: limma-VST, metagenomeSeq, ROTS, Voom-ZW, monocle, edgeR (except edgeR-QFT), and Seurat-BIM failed more on circRNAs with higher fractions of zero counts. FP DECs originated approximately equally from all scenarios except for circMeta-DT, which showed a higher FP number on the IDC and MS data set instances (Figure 2). Overall, the zero counts did not significantly affect the type I error control as much as the expression abundance and variance.

False discovery rate, power, F₁-score, and area under the precision-recall curve

We used the ‘signal’ data sets to evaluate the methods’ false discovery and true positive rates. The Wilcoxon’s test and Seurat-WLX did not generate any significant prediction (Figure 3 a), thus resulting in a null TPR (Figure 3 b), and lncDIFF, MAST, and DEsingle returned significant predictions only in a few simulation instances. Similarly, DESeq2, PoissonSeq, and limma-VST did not predict any DEC in a relevant fraction of simulation instances, especially from the MS data sets. Only Seurat-BIM and circMeta-LC provided predictions below the imposed level in all the simulation instances.

Most methods (22 out of 38) showed higher FDR than the imposed 0.05 level: lncDIFF and Seurat-BIM scored the worst FDRs (FDR = 1 and 0.95, respectively), followed by PoissonSeq, circMeta, all edgeR pipelines, NBID, metagenomeSeq, monocle, ROTS, and all Voom pipelines but Voom-ZW (FDR > 0.09). On the contrary, NOISeqBIO, MAST, DEsingle, limma-VST, and all DESeq2 pipelines but DESeq2-ZW controlled the FDR lower than the nominal value (0 ≤ FDR ≤ 0.01). In DESeq2, a slightly more conservative FDR was obtained using the likelihood ratio test (DESeq2-LRT) compared to the Wald test (DESeq2-WaT). Voom-ZW, DESeq2-ZW, glmGamPoi, and SAMseq controlled the FDR close to the nominal value. Notably, every method except DEsingle and MAST presented FDP close to 1 in some instances. All methods obtained better FDR control on the N10 data sets but maintained the characteristics observed in the N05 data sets (Supplementary Figure 7).

The sensitivity was generally low, with a median below 50% for all methods (Figure 3 b). The highest TPRs (0.43 ≤ TPR ≤ 0.41) were obtained by three edgeR pipelines (TWSP, RBST, and 50DF). SAMseq, NBID, monocle, and four Voom pipelines(Voom-QN, Voom-LF, Voom-RBST, and Voom-DT) obtained TPRs between 0.36 and 0.31, whereas the remaining methods identified less than 30% true DECs. The choice of parameters greatly influenced sensitivity in edgeR pipelines. Interestingly, the quasi-likelihood framework produced opposite results when applied to edgeR or Limma-Voom, with the lowest and the highest TPR among the respective pipeline configurations. Similarly, ZINB-WaVE weights allowed higher recall rates with edgeR and DESeq2 but a lower TPR with Limma-Voom. Regarding the DESeq2 pipelines, the scRNA-seq-oriented pipelines obtained higher recall rates than the bulk RNA-seq configurations (Figure 3 b; Supplementary Figures 8-9). Notably, the adaptation to low counts of the circMeta test sensibly improved the recall rate (median TPR 0.23 and 0.03, respectively). Poor performance, close to zero, was achieved by limma-VST, lncDIFF, DEsingle, MAST, the Wilcoxon test, Seurat-WLX, and the bulk RNA-seq configurations of DESeq2.

All methods achieved significantly higher sensitivity with increased set sizes, except lncDIFF and Seurat-WLX, which did not detect any true DEC, and metagenomeSeq, which improved only a little (Supplementary Figure 9). The highest TPR among all settings (TPR = 0.9) was achieved by edgeR-RBST and edgeR-TWSP when allowing for a 0.1 adjusted p-value threshold in the N10 data sets (Supplementary Table 5). In the smallest data sets (N03), NOISeqBIO had the highest recall rate (TPR = 0.7 with 0.1 adjusted p-value), which was surprisingly higher than in the larger sets (Supplementary Figure 9).

We inspected the p-value distribution obtained in the ‘signal’ data sets to understand better the DEMs’ predictions (Supplementary Figure 10). CircMeta, edgeR, glmGamPoi, limma-VST, NBID, PoissonSeq, ROTS, SAMseq, and Voom showed p-value histograms as expected [49]. The other DEMs did not show a uniform distribution of the p-values, most having an overabundance of large p-values or a distribution skewed towards p = 1. Interestingly, the DESeq2 overabundance of large p-values was mitigated using the weights for zero counts. Comparing the p-value histograms between the N05 (Supplementary Figure 10) and N10 (Supplementary Figure 11) simulations, we observed better p-value distributions, indicating that the conservative p-value distributions were due to insufficient power of the methods with a small number of samples [16,49]. We observed a worse performance of Seurat-WLX compared to the simple Wilcoxon rank-sum test. Since Seurat-WLX implements an extended Wilcoxon rank-sum test that considers correlations between cases, the presence of positive correlations between circRNAs possibly increased the variance of the test, making the test more conservative.

Similarly to the analysis of type I error, we calculated the signal-to-noise statistics of the variability, fraction of zeros, and expression abundance, comparing for each method the false negative (FN) and true positive (TP) predictions, i.e. the circRNAs not detected as differentially expressed compared to those correctly identified. We did not observe significantly different characteristics of the FN compared to the TP predictions (Supplementary Figures 12-14). The poor recall rate could be related to an imprecise dispersion estimation of the models [50] or a systematic deviation from the theoretical null distribution of the test statistics [49].

We calculated the F₁-score of each method to evaluate precision and recall simultaneously (Figure 3 c; Supplementary Table 5). Monocle and SAMseq obtained the highest F₁-score (F₁ = 0.61), followed by Voom-QN (F₁ = 0.58), edgeR-TWSP and edgeR-RBST (F₁ = 0.57 and 0.56, respectively).

We observed that the methods generally achieved better precision than recall and that precision scores were less spread than recall scores. In particular, edgeR-TWSP and edgeR-RBST owed their high F₁-scores mainly to their high recall rates. Instead, SAMseq, monocle, and Voom-QN scores were driven mainly by a high precision (PPV ≥ 0.88) (Supplementary Table 5). Interestingly, the circMeta tests, designed explicitly for circRNA expression, ranked amongst the lowest F₁-scores. SAMseq held the highest score also in the N03 data sets (Supplementary Figures 15-16). However, we observed a different ranking in the N10 data sets: Voom and monocle still achieved the top scores, but ROTS, glmGamPoi, DEsingle, and five edgeR configurations ranked ahead of SAMseq (Supplementary Table 5; Supplementary Figures 15-16).

Finally, we inspected the ability of the methods to rank true DECs ahead of not significant ones by computing the area under the precision-recall curve (AUCPR). The AUPRC is informative for data sets with a significant skew in the class distribution [51,52], as in our simulations. DEsingle obtained the highest AUPRC scores, notwithstanding its poor performance observed in the above analysis, indicating that DEsingle could almost perfectly rank true DECs on the top positions and suggesting an overly conservative assignment of the p-values (Figure 3 d). SAMseq, Voom-DT, Voom-RBST, and monocle obtained the next best scores (median AUPRC ≥ 0.7) (Supplementary Table 6). Interestingly, we observed that some methods showing poor performance according to the above metrics, including DESeq2-ZI, the Wilcoxon-based methods, and MAST, obtained AUPRC scores comparable to the best performing tools. In larger data sets, only lncDIFF and Seurat-BIM showed small AUPRC scores and a modest improvement (Supplementary Figure 17).

Analysis with nonparametric simulations from an independent data set

To corroborate the outcomes of the semiparametric simulation analysis, we performed nonparametric simulations from an independent study data set. We obtained the BJRs of 8,239 circRNAs computed with CIRI2 in 20 normal tissues and 76 tumour samples from a recent study on prostate cancer [35] (Table 1). In this data set, the number of replicates was sufficient to use the SimSeq tool [53], which performs nonparametric simulations without imposing any distribution assumption on the simulated data. Similarly to the previous analysis, we generated 90 instances for ‘null’ and ‘signal’ data sets with 6, 10, and 20 samples of two equally large condition groups. We analysed these data sets like the semiparametric data and ranked the methods according to FPR, TPR and FDR in each simulation type (Supplementary Figures 18-20). We observed a significant positive correlation between the mean ranks of the nonparametric and semiparametric simulations (Spearman’s rho > 0.5, p-value < 0.001; Supplementary Table 7), indicating a generally consistent performance of the methods in the two simulation settings. Unexpectedly, DEsingle showed an opposite AUPRC score than the semiparametric results.

Similarity of differential expression methods’ predictions

The methods' similarity was explored in the semiparametric simulations according to two metrics that considered the magnitude of prediction overlap and inquiring into different aspects of the use of predictions. First, we evaluated the similarity between method pairs according to the overlap of their DECs with an adjusted p-value ≤ 0.05, which allowed calculate the Jaccard similarity coefficient. Second, for each method pair, we considered the area under the concordance at the top (CAT), which we defined as the overlap of the top 100 circRNAs ranked according to adjusted p-values, regardless of fixed thresholds for the adjusted p-values.

Clustering the DEMs according to the similarity indexes, we observed that DEMs of the same base tool tended to cluster together (Figure 4). In particular, DESeq2 and Voom showed a high degree of similarity within the respective pipelines, suggesting that modifying the parameters of these tools did not affect their outcomes much. Instead, the three edgeR pipelines characterised by high FPR and TPR clustered apart from the other edgeR configurations. Consistently with the results above, Voom-LF clustered apart from edgeR-QFT. DESeq2 and edgeR using ZINB-WaVe weights reported similar predictions but slightly different from Voom-ZW. Interestingly, edgeR pipelines clustered closer to the Voom than DESeq2 pipelines according to Jaccard similarity (Figure 4 a), whereas three edgeR configurations grouped closer to DESeq2 when considering CAT (Figure 4 b), indicating more conservative p-values provided by DESeq2. Further, the scRNA-seq and bulk RNA-seq DEMs did not show distinct groups, indicating that they can provide similar results.

In the N10 data sets, allowing adjusted p-values ≤ 0.1, the DESeq2 pipelines showed the most consistent predictions regardless of the parameter configuration according to both the Jaccard index and CAT (Supplementary Figures 21-22). Conversely, the other DEMs showed a consistent ranking of their predictions (Supplementary Figures 22) but a great variation according to Jaccard similarity (Supplementary Figures 21), suggesting that the parameter configurations influenced the p-value magnitude but maintained the DEC ranking.

Overall ranking of the methods

To compare the methods' performances overall, we computed each method's rank relative to the other DEMs according to the F₁ score, FDR, TPR, AUPRC, and FPR measures, independently in each simulated data set, with lower ranks corresponding to better-performing methods. The mean ranks and standard deviations computed on the N05 data sets are represented in Figure 5.

LncDIFF, MAST, Seurat-WLX, the simple Wilcoxon test, and DEsingle consistently performed worse than the other methods in all simulations. DEsingle achieved a good ranking according to the AUPRC, but the above analysis showed its unreliable behaviour in different data sets. NOISeqBIO, the DESeq2-BP, DEseq2-LRT, DEseq2-WAT, limma-VST, and PoissonSeq showed poor performance, ranking close to or higher than the third quartile. Seurat-BIM ranked the worst according to AUPRC and FPR. DESeq2 obtained poor ranking according to F₁ score, FDR, and TPR, while scoring average ranks for the AUPRC and FPR. Interestingly, DESeq2-ZW showed a slightly better ranking than the other DESeq2 configurations.

EdgeR-RBST, edgeR-TWSP, edgeR-50DF, and NBID obtained the best mean ranks (below or close to the first quartile) according to F₁ scores, owing mainly to their high TPRs. However, the edgeR pipelines ranked poorly according to FPRs, putting some concerns about the reliability of their predictions. Besides, NBID was outperformed by more than half the DEMs, according to the AUPRC, suggesting that it is suboptimal for modulating a significance threshold. All the Voom pipelines except Voom-ZW obtained rank below the median in all measures, indicating the consistently good performance of the Limma-Voom models, especially Voom-DT and Voom-RBST. The other edgeR-based methods were a close second. SAMseq and monocle showed interesting results on average but with a large variation, which indicates less consistent performance. Notably, all DEMs’ mean FPR ranks were above the first quartile, indicating that no method consistently outperformed the others in controlling type I error.

Different rankings were obtained on the data sets with three and ten replicates per group (Supplementary Figure 23), confirming that the sample size greatly influenced the method performances. DESeq2 obtained the best improvement in larger data sets, whereas circMeta, NOISeqBIO, and NBID showed better rankings with small numbers of samples.

Computational time

We compared the methods according to the CPU time required for the analysis (Supplementary Figure 24). Most methods ran rapidly in a few seconds or less than one minute. Conversely, computing weights with ZINB-WaVE was the most time-demanding task. Monocle, DEsingle, NBID, and ROTS were the slowest methods, requiring two to eight minutes to complete the analysis of one simulated data set.

In this work, we observed that the biological characteristics of circRNA expression and the technical aspects of its abundance estimation from bulk RNA-seq could generate data with a high proportion of “very small” counts, i.e. with a mean count in the 2-10 range [54]. Moreover, we found a substantial proportion of zero counts in circRNA BJR count matrices, comparable in magnitude to scRNA-seq and metagenome data [17]. These particular circRNA expression properties can violate the assumptions of the traditional DEMs, including DESeq2 and edgeR [55], which are currently used to evaluate also differential circRNA expression.

A high sequencing depth can mitigate data sparsity and proportion of small counts and better quantify circRNA expression levels [27]. However, the more reads are sequenced, the more expensive the experiment is, and still, the inefficient process of BJR estimation hinders detecting and quantifying circRNAs entirely. This study found that DEMs’ power improved in larger data sets, particularly with ten samples per condition. In line with our observations, also previous works showed that DEMs have significantly higher detection power of lowly expressed genes with an increased number of replicates than with an increased sequencing depth [13], and ten or more replicates per group are advised [56].

Moreover, RNA-seq data is commonly purged of low-counted elements [37,57,58] to improve the performance of DEMs [15]. However, with circRNAs, a similar data filtering might not be desirable as we observed that removing lowly expressed circRNAs can discard most of the detected circRNAs, thus omitting much information.

Ad-hoc data filtering may be unnecessary with properly configured DEMs’ parameters [20]. Therefore, we compared various parameter configurations of widely-used bulk-RNA-seq tools. Plus, the large fraction of small and zero counts observed in circRNA data motivated us to evaluate metagenome and scRNA-seq DEMs, which, from a statistical point of view, can handle data with such characteristics [36]. For the first time and unlike previous studies [42], our analysis assessed the performance of scRNA-seq DEMs on bulk RNA-seq data.

We collected an unprecedented and extensive set of DEMs by selecting tools available as R packages, currently maintained and functioning, reasonably fast, and well-performing or never tested in previous benchmarks. We did not consider statistical models devised to assess the variation of the circular-to-linear expression ratio (CLR), such as CircTest [59], seekCRIT [60], DEBKS [61], and the circMeta test for CLR [22], because they address a problem distinct from differential circRNA abundance.

We evaluated a few library size estimation strategies compatible with the selected DEMs and promising for circRNA data characteristics. In DESeq2, the poscounts, shorth, and deconvolution functions to compute size factors showed better results than the default approach. In edgeR and NBID, the TMMwsp and deconvolution normalisation procedures slightly improved the prediction of DECs. Finally, Limma-Voom worsened its FDP using quantile normalisation compared to the TMM normalisation. Nevertheless, a thorough comparison of normalisation procedures as in previous works [62] was not our aim and warrants a dedicated study.

Our comparative study complies with best practices and tools for benchmarking bioinformatics methods [63–65]. CircRNA RNA-seq comparative experiments with many samples or with ground truth of differential expression are scarce or absent in public repositories. Therefore, we used a semiparametric approach [46] to allow us to generate multiple data sets of different sample sizes. We focused on most typical scenarios of RNA-seq differential expression studies, thus limiting the maximum size of the data sets to ten samples per group, as larger numbers of replicates are uncommon. Nevertheless, the design of our benchmark enabled us to observe a clear trend of the methods’ performance upon increasing data set size. Besides, we obtained one pre-computed circRNA expression data set [35] with enough sample replicates to simulate unbiased data sets assuming no specific distribution underlying the expression data [53]. Although limited to one real data set, these nonparametric simulations mostly corroborated the results observed in the semiparametric data.

A few aspects of the DEMs’ performance were similar to the results of previous studies on low-count transcripts from bulk and single-cell RNA-seq data [12,13,15–17,20,56,66,67].

We observed that most Limma-Voom pipelines controlled the type I error close to the nominal value, whereas DESeq2 showed a more conservative behaviour. Further, the parameter choice affected the type I error control in edgeR, showing a tendency toward higher FPR when imposing high degrees of freedom [16,20].

Generally, the methods that controlled the FDR well showed low sensitivity; moreover, DEMs had higher power with larger data sets. DESeq2, edgeR, limma, and PoissonSeq confirmed poor sensitivity similarly to lowly counted transcripts in bulk-RNA-seq [13][20], especially with a small number of replicates [13][56]. The nonparametric DEMs, SAMseq and NOISeqBIO, required a higher replicate number to perform as well as other models [12] also in circRNA data. Moreover, we confirmed that SAMseq showed better FDR control than DESeq2 while retaining a high TPR [16][15]. Diversely, NOISeqBIO had unstable results depending on the set size: with three replicates, it showed high TPR [56], whereas, with five replicates, it obtained an FDR lower than nominal at the cost of a severe TPR loss [16]. Finally, the DEsingle’s opposite AUPRC performance observed between our semiparametric and nonparametric simulations was consistent with previous work [66], confirming that DEsingle’s results are unstable and might depend on the data set.

The DEMs also presented performance diverging from previous benchmark works, supporting that circRNA expression data have different characteristics than linear transcripts from standard bulk RNA-seq, scRNA-seq, and metagenome data. In particular, in our analysis, the p-value distributions of the most conservative DEMs displayed a smooth increase towards p = 1, suggesting that some systematic deviation from the theoretical null distributions of the test statistics occurred [49]. Instead, in low-counted lncRNA data, the DEMs presented conservative distributions with a spike near p = 1 [16], possibly due to the lncRNA’s high variability. Moreover, DESeq2 did not obtain high AUC as in scRNA-seq [66][15], denoting DESeq2 yields overly conservative p-values in circRNA data.

Further, we did not observe the same bias in the type of genes preferentially called differentially expressed as for some DEMs in scRNA-seq data [15]. In particular, edgeR-QLF incorrectly called significant lowly expressed genes with many zeros in scRNA-seq, whereas, in our results, its FPs showed a higher expression and a lower fraction of zeros compared to TNs. Plus, in contrast with results from scRNA-seq data [15], the quasi-likelihood framework with edgeR effectively reduced the FPR in circRNA data. Conversely, the quasi-likelihood framework was unexpectedly detrimental with limma-voom, although the signal-to-noise statistics were similar to those obtained with edgeR-QLF.

In our results, DESeq2 with the Wald test did not show more liberal results than with the LRT, as observed by Raithel et al. [20] with low-counted genes. Such discordant results could be due to dependence on the data [17]. Unlike in Assefa et al. [16], PoissonSeq adequately controlled the type I error with many samples and showed higher TPR with larger sets. However, its high FDR suggests that a Poisson distribution does not fit circRNA expression data well.

Similarly to results from 16S and WMS data sets [17], MAST performed poorly at each sample size and obtained an overly conservative FPR, whereas, in scRNA-seq data, it performed reasonably well [15], also with lowly expressed isoforms [66]. MetagenomeSeq showed FPR below the nominal value in our analysis, whereas it obtained liberal FPR in 16S and WMS data [17], suggesting that metagenomeSeq has less consistent behaviour in data of different types. Monocle performed poorly in the smallest data sets of our benchmark but reached reasonable FDR and TPR with five or more samples per condition, which contrasted with previous analyses on scRNA-seq data [66][15]. Seurat-BIM reached a poor AUC, unlike in scRNA-seq data [15], suggesting it underperforms on small count data. Differently from our results, the Wilcoxon test performed reasonably on lowly expressed isoforms in scRNA-seq data [66]. Finally, the nonparametric methods clustered apart, differently from the results on scRNA-seq data [15].

We expected tools with specific options or designed for addressing small counts and sparse data, such as the scRNA-seq and metagenome methods, to outperform standard tools in circRNA data. Many scRNA-seq methods were clustered with bulk RNA-seq DEMs, primarily according to their underlying distribution model. Surprisingly, a few scRNA-seq pipelines, including Monocle, NBID, ROTS and glmGamPoi, performed reasonably well in circRNA data showing comparable or better predictions than bulk RNA-seq DEMs, especially in data sets of a large sample size.

Our work highlighted the challenging features of circRNA expression estimated with bulk RNA-seq. Additionally, our comprehensive comparison of statistical tools for differential expression assessment applied to circRNA data marked a few caveats in circRNA expression analysis. We observed that no single method overperformed the others in every aspect and that the recall rate was generally low with RNA-seq data sets of typical size. Using default values in some methods can result in suboptimal performance. Conversely, custom parameter configurations can profoundly affect the predictions and contravene the expected performance. For instance, edgeR, one of the most used tools for RNA-seq data analysis, can provide misleading or poor predictions depending on its settings. EdgeR-RBST and edgeR-TWSP showed high FDR, whereas edgeR-QFT, although controlling FDR well, showed reduced TPR. Besides, Limma-Voom with default parameters controlled the type I error well and maintained a good trade-off between precision and recall but showed worse performance with the quasi-likelihood framework designed to better model zero counts. Instead, DESeq2, perhaps the most used tool in circRNA expression differential assessment, is overly conservative and underperforms several other methods, especially with data sets of typical size, no matter the parameters used. Notably, SAMseq, one of the oldest tools considered in our benchmark initially devised for microarray data and later adapted to RNA-seq, showed good results compared to its competitors. Further, scRNA-seq methods, such as glmGamPoi, monocle, and ROTS, showed promising results in data sets with twenty replicates and could inspire novel solutions for circRNA data analysis.

This study shed light on the difference between circRNA and traditional gene expression RNA-seq data. RNA-seq studies willing to inspect circRNA expression require carefully balancing the trade-off between a higher sequencing depth and the number of replicates to obtain robust results. Our findings indicate that circRNA differential expression assessment from RNA-seq urges the development of new and robust computational models addressing the issues that emerged in our analysis. Our comprehensive benchmark highlights the importance of selecting an appropriate tool and configuring its parameters according to the data set characteristics and can guide biostatisticians and bioinformatics researchers in the analysis of circRNA differential expression.

CircRNA data sets, expression quantification, and expression filters

We analysed six independent data sets from circRNA studies (Table 1) for 235 samples in total. The JHS data set [30] considered Illumina sequencing data of 17 human tissues, with matched ribosomal RNA-depleted and ribosomal RNA-depleted followed by RNase R treatment libraries to enrich the circular transcript fraction. The other data sets (DM1 [31], IDC [32], IPF [33], MS [34], and PC [35]) considered ribosomal RNA-depleted Illumina sequencing libraries of tumour and relative healthy tissue samples, with at least five biological replicates per group and more than 40 million sequenced paired-end reads.

Raw reads of the JHS, DM1, IDC, IPF and MS data sets were available in public data sets such as SRA and NGDC, whereas the circRNA expression data of the PC data set were provided by the authors of the original study [35]. The PC data set consisted of 29,234 circRNAs from 31 normal tissues and 126 prostate cancer samples. The BJR counts were computed with CIRI2, discarding circRNAs detected with less than 2 BJRs; further details are available in the original article [35].

CircRNAs of the JHS, DM1, IDC, IPF and MS data sets were detected and quantified with CirComPara2 v0.1.2.1 [26] using default parameters. CirComPara2 runs seven circRNA identification pipelines and combines their results to obtain reliable detections and expression quantification. In addition, from the CirComPara2 output files, we obtained the circRNAs quantified by CIRI2 [68], CIRCexplorer2 [69] v2.3.8 using either TopHat-Fusion v2.1.0 or Segemehl v0.3.4, DCC [59] v0.4.8, and Findcirc [70] v1.2. Moreover, we applied CIRIquant [71] v1.1.2, giving as input the circRNAs detected with CirComPara2 to obtain circRNA expression abundance also with this tool. Note that these methods encompass five different read aligners (Bowtie2, BWA-MEM, Segemehl, STAR, and TopHat-Fusion), plus one re-alignment method based on BWA-MEM, thus limiting possible biases derived from the mapping algorithms.

We explored the circRNA expression of each comparison data set by performing the principal component analysis (PCA) on the CirComPara2 expression estimates and removing circRNAs detected in less than three samples. PCA plots showed circRNA expression patterns associated with the main groups, indicating significant variation of circRNA expression between conditions (Supplementary Figure 25).

Five strategies to discard circRNAs were applied independently to the circRNA expression matrices. Each circRNA must be detected in (i) any sample (unfiltered data), (ii) at least three samples, (iii) at least half the samples, and (iv) all the biological replicates of at least one group (to keep only the circRNAs consistently expressed within a condition), and (v) at least as many samples as the size of the smallest group (such a filter can be helpful when groups have largely different numbers of replicates).

Goodness-of-fit analysis

We first checked whether the ZINB model was necessary for the circRNA counts. This was done by the comparison of the Akaike Index Criterion (AIC) of Negative Binomial (NB) and Zero-Inflated negative binomial (ZINB) models computed for each circRNA. The model with a lower AIC value was chosen as suitable. The NB model was fitted as implemented in the edgeR (v3.24.3) Bioconductor package. In particular, normalisation factors were calculated with the Trimmed Mean of M-values (TMM) normalisation [72] using the calcNorinactors function; common, trended, and tag-wise dispersions were estimated by estimateDisp, and a negative binomial generalised log-linear model was fit to the read counts of each feature, using the glmFit function.

The ZINB model was fitted using the zinbFit function as implemented in the zinbwave (v1.4.2) Bioconductor package; as explained in the original paper, the method can account for various known and unknown technical and biological effects [43]. However, to avoid unfair advantages to this method, we did not include any latent factor in the model (K = 0). We estimated common dispersion for all features (common_dispersion = TRUE), and we set the likelihood penalisation parameter epsilon to 1e10 (within the recommended set of values [42]).

To evaluate the goodness of fit of the models, we computed the mean differences between the estimated and observed values for each data set examined. For each model, we evaluated two distinct aspects: its ability to correctly estimate the mean counts (plotted in a logarithmic scale with a pseudo-count of 1) and its ability to correctly estimate the probability of observing a zero, computed as the difference between the probability of observing a zero count according to the model and the observed zero frequencies (zero probability difference, ZPD). We summarised the results by computing the two estimators' root mean squared error (RMSE). The lower the RMSE, the better the fit of the model.

Assuming homogeneity between samples inside the study condition, we specified a model consisting of only an intercept.

For GOF analysis, circRNA expression was quantified with CirComPara2 since we recently demonstrated that it provides more comprehensive detections and better expression estimates than other tools [26]. We argue that the following results can be generalised to other quantification algorithms since similar expression characteristics were observed, as shown in the above paragraph.

Semiparametric simulations

The SPsimSeq R package v1.4.0 [46] was used to simulate data sets from real data. SPsimSeq uses the Gaussian-copulas to retain the between-genes correlation structure and allows generating of arbitrarily large data sets. The original data sets underwent preliminary processing that considered the removal of circRNAs expressed in less than three samples, followed by selecting samples with a similar library size in both the sample groups. No samples were discarded for IDC and IPF data sets; the DM1 data set resulted in 5 control and 6 tumour samples, and the MS data set resulted in 12 control and 18 tumour samples. For each original data set, 30 simulations were run considering two sample groups of an equal number of samples, with set sizes of 6, 10 and 20 samples and unvaried library sizes between simulations. The number of circRNAs simulated corresponded to the circRNAs detected in the original data sets after the quality filters: 1,490 in DM1, 7,540 in IDC, 2,485 in IPF, and 7,217 in MS. Two types of data sets were generated: ‘null’ data sets, where no differentially expressed circRNAs were simulated, and ‘signal’ data sets, where 10% circRNAs were significantly differentially expressed between the sample groups and with an absolute log-fold-change ≥ 0.5. Each simulated data set underwent a preliminary filter to remove the simulated circRNAs with non-zero counts in less than three samples.

The quality metrics of the simulated data sets were computed with the countsimQC package’s functions [47] custom optimised for parallel execution.

Nonparametric simulations

We performed a preliminary quality assessment of the PC data set to determine sample batches via inspection of the first two principal components (Supplementary Figure 26). Then, we selected samples of only one batch to obtain homogeneous samples, for a total of 8,239 circRNAs from 20 normal tissues and 76 prostate cancer samples.

We performed nonparametric simulations with the SimSeq tool [53], which does not impose any distribution assumption on the simulated data. We generated 90 instances for ‘null’ and ‘signal’ data sets with 6, 10, and 20 samples of two equal replicate number condition groups.

Differential expression tools’ parameters used in this study

In our analysis, we considered the following differential expression tools: CircMeta [22], DESeq2 [23], DEsingle [73], edgeR [24,37,40], glmGamPoi [74], Limma [75], Limma-Voom [41], lncDIFF [76], MAST [77], metagenomeSeq [78], Monocle [79], NBID [39], NOISeqBIO [80], PoissonSeq [81], ROTS [82], SAMSeq [83], Seurat [84,85], and the Wilcoxon test.

For all algorithms, the p-values from genes with a non-zero-sum of read counts across samples were adjusted using the Benjamini–Hochberg procedure [86].

circMeta

CircMeta [22] implements an approximated z-test for testing the Poisson rates between ith circRNA in two conditions. We used two functions available in the code repository: the former (circMeta-DT) implements a z-test on counts normalised by the sample-wise count sum; the latter (circMeta-LC), which the authors proposed to handle low counts better, applies a square root transformation to the normalised count data before computing the test.

DESeq2

We considered seven parameter combinations of DESeq2 v1.22.2 [23]: (1) DESeq2-WAT used default parameters, which included the computation of size factors by the median of geometric mean ratios, and the Wald test; (2) DESeq2-BP used default parameters, Wald test included, but the beta prior opion set true; (3) DESeq2-LRT used default options but a likelihood ratio test (LRT); (4) DESeq2-ZI used the parameters for single-cell data (low counts, zero-inflation) recommended in the vignette, in particular the poscounts procedure for computing size factors, minmu = 1e-6, minReplicatesForReplace = Inf, and LRT; (5) DESeq2-LC used the same parameters as DESeq2-ZI but replacing the shorth function of the genefilter package to estimate the size factors; (6) DESeq2-SC used the same parameters as DESeq2-ZI except the size factors were computed with the deconvolution normalisation approach developed for scRNA-seq data implemented in the computeSumFactors function of the scran package [44]; (7) DESeq2-ZW used the same parameters as DESeq2-ZI-LRT but added observational weights computed with ZINB-WaVE [42,43]. Independent filtering was disabled in all functions.

DEsingle

DEsingle v1.14.0 [73] utilises a ZINB regression model to estimate the proportion of the real and drop-out zeros in the observed expression data. The expression values of each gene in each group of samples are estimated by a ZINB model. The method can classify the DE genes into three categories: (i) different expression status (DEs), (ii) differential expression abundance (DEa), and (iii) general differential expression (DEg). We applied DEsingle with default parameters, considering the p-values and adjusted the p-values of the DEg group where the hypothesis testing of the null hypothesis is conducted using the χ² likelihood ratio test statistics.

edgeR

We examined seven variation of the edgeR v3.36.0 pipeline [24,87]: (1) edgeR-DT involved the functions calcNormFactors, estimateDisp, glmFit, and glmLRT with default parameters, in particular, TMM normalisation and tag-wise dispersion estimation; (2) edgeR-RBST employed a robust dispersion estimation and a quasi-likelihood negative binomial generalized log-linear model to fit the data [40] through the estimateGLMRobustDisp and glmQLFit functions, respectively; (3) edgeR-QFT implemented the quasi-likelihood framework as described in [37,88]; (4) edgeR-50DF and (5) edgeR-EDF explored the effect of setting different degrees of freedom (50 and empirical estimation, respectively) as in [20] and [16]; (6) in edgeR-TWSP we used the TMM with singleton pairing normalisation variant of the TMM procedure, which is intended to perform better for data with a high proportion of zeros; (7) edgeR-ZW used observational weights computed with ZINB-WaVE (Van den Berge et al. 2018; Risso et al. 2018).

glmGamPoi

We used the Gamma-Poisson distribution-based method [74] implemented in the glmGamPoi v1.6.0 package, which we called within the DESeq2 function, setting other parameters as in the DESeq2-ZI configuration.

Limma and limma-voom

We used the limma Bioconductor package (v3.50.0) [75] with the voom function [41] that (i) transforms previously normalised counts to logCPM, (ii) estimates a mean-variance relationship, and (iii) uses this to compute appropriate observational-level weights. The linear model residual degrees of freedom were adjusted before the empirical Bayes variance shrinkage and were propagated to the moderated t-statistics.

We considered six limma-voom pipelines: (1) Voom-DT used default parameters for the limma-voom pipeline with size factors computed with edgeR; (2) Voom-LF used the voomLmFit function in the edgeR package, which is now recommended over voom for sparse counts with a medium to a high proportion of zeros [48]; (3) Voom-QN used default parameters but a quantile normalisation, as it was assessed in [16]; (4) in Voom-RBST we enabled the robust empirical Bayes procedure; (5) Voom-SP (“voom simple”) used default parameters and code as in the vignette; (6) Voom-ZW used weights computed with ZINB-WaVE.

Moreover, we combined limma with the variance stabilising transformation for UMI count data implemented in the vst function of the sctransform package v0.3.3 [45] and using the glmGamPoi method for initial parameter estimation.

lncDIFF

LncDIFF [76] v1.0.0 adopts the generalised linear model with zero-inflated exponential quasi-likelihood to estimate group effect on normalised counts. We used normalised counts with the edgeR TMM normalisation as input to the lncDIFF function with default parameters.

MAST

MAST [77] employs a generalised linear hurdle model to account simultaneously for stochastic dropouts and characteristic bimodal expression distributions in which expression is either strongly non-zero or non-detectable. The model parameters are fitted using an empirical Bayes framework, and differential expression is determined using the likelihood ratio test. As input data, we used log(CPM), with 1 as the prior count and normalised with the edgeR TMM strategy.

metagenomeSeq

MetagenomeSeq [78] was designed to address the effects of normalisation and under-sampling of microbial communities on disease association detection and testing feature correlations. The underlying statistical distribution for log2(count + 1) is assumed to be a zero-inflated Gaussian mixture model. The mixture parameter is modelled through a logistic regression depending on library sizes, while the Gaussian part of the model is a generalised linear model with a sample-specific intercept which represents the sample baseline, a sample-specific offset computed by Cumulative Sum Scaling (CSS) normalisation and another parameter which represents the experimental group of the sample. The fitZig function performs an expectation-maximization algorithm to estimate all the parameters. An empirical Bayes approach is used for variance estimation, and a moderated t-test is performed to identify differentially abundant features between conditions. Benjamini-Hochberg correction method was used to account for multiple testing.

Monocle

Monocle [79][89] is a tool originally designed for single-cell RNA-seq data analysis for ordering cells by progress through differentiation stages (pseudo-time). The tool can identify circRNAs that change expression significantly across cell types and conditions. The mean expression level of each circRNA is modelled with a Vectorised Generalised Additive Model that relates predictor variables to circRNA expression level. The test for differential expression is performed using an approximate χ² likelihood ratio test. Since we are interested only in comparing circRNAs among different conditions, the temporal ordering feature was not used in our study. We set an NB as the model for expression response variables and 0.1 as the minimum expression level parameter. Monocle2 was used in this study as we did not find in Monocle3 any feature improvements that would affect the results of our analysis.

NBID

We used the method by Chen et al. [39] proposed for UMI-counts differential expression analysis that allows independent estimations of dispersion for individual circRNAs within each group. The difference between groups is tested with a likelihood ratio test that compares the reduced and the full models, which follows a χ²distribution with one degree of freedom. The default normalisation method using the total number of counts as the size factor (NBID-DT) was compared with its estimation by scran (NBID-SC) [90].

NOISeqBIO

NOISeqBIO [80] is a non-parametric model that explores the distribution of fold-changes and absolute expression differences between the two conditions for the observed data and compares this distribution to the corresponding distribution obtained by comparing pairs of samples belonging to the same condition (named “noise distribution”). Briefly, NOISeq computes, for each circRNA, a statistic (q_NOISeq) defined as the fraction of points from the “noise distribution” corresponding to a lower fold change and a lower absolute expression difference than those of the circRNA of interest in the original data.

PoissonSeq

The PoissonSeq algorithm uses a Poisson log-linear model and a novel proposed method for normalisation [81]. The test for differential expression is simply a test for the significance of the correlation of circRNAs expression with the two conditions, which is evaluated by score statistics. Simulation experiments showed that these score statistics follow a χ² distribution, which is used to derive p values for DE. PoissonSeq implements a novel estimation of the false discovery rate for count data based on permutation.

ROTS

The reproducibility-optimised test statistic (ROTS) [82] adjusts a modified t-statistic according to the inherent properties of the data and provides a ranking of the features based on their statistical evidence for differential expression between two groups. This is done by maximising the overlap of top-ranked features in group-preserving bootstrap data sets among a family of t-type statistics. The final ROTS output is calculated from the original data using the optimised parameters giving the highest reproducibility Z-score. The false discovery rate is estimated by randomly permuting the sample labels.

SAMseq

SAMseq is a non-parametric model [83] based on Wilcoxon statistics averaged over several resamplings of the data using a sample permutation strategy to estimate a false discovery rate for different cutoff values for this statistic. These estimates are then used to define a q-value for each circRNA. We used default parameters of the SAMseq function from the samr v3.0 package and the “two class unpaired” problem type.

Seurat

Seurat [84,85] was applied using either the default ‘wilcox’ test (Seurat-WLX), an extension of the Wilcoxon rank-sum test that allows for correlations between cases [91], and the 'bimod' likelihood ratio test in Seurat-BIM [92]. Other parameters were left as default.

Wilcoxon rank-sum test

We implemented the non-parametric Wilcoxon rank-sum test as in Soneson & Robinson [15]. Specifically, the R function “wilcox.test” was applied to raw count estimates and with default parameters.

ZINB-WaVE

We used the method defined in Van den Berge et al. [42]. It introduces a weighting strategy based on the zero-inflated negative binomial model that identifies excessive zero counts. The computed weights can be used with traditional bulk RNA-seq differential expression pipelines, such as DESeq2 and edgeR [42].

Type I error control

For this analysis, we used the “null” simulated data set without differentially expressed circRNAs. The p-values returned by each method were used to compare the number of false discoveries upon thresholds of 0.1, 0.05 and 0.01. For NOISeqBIO, we used its scores in place of p-values only for completeness, but they were not considered for comparison with other methods.

Concordance at the top

We used the concordance at the top (CAT) to evaluate concordance for each differential expression method. Starting from two lists of ranked features by p values, the CAT statistic was computed in the following way. For a given integer i, concordance is defined as the cardinality of the intersection of the top i elements of each list, divided by i, i.e., #{𝐿1:𝑖∩𝑀1:𝑖}𝑖, where L and M represent the two lists. This concordance was computed for values of i from 1 to R.

Depending on the study, only a minority of features may be expected to be differentially expressed between two experimental conditions. Hence, the expected number of differentially expressed features is a good choice as the maximum rank R. The CAT displays high variability for low ranks as few features are involved, while concordance tends to 1 as R approaches the total number of features, becoming uninformative. We set R = 100, considering this number biologically relevant and high enough to permit an accurate concordance evaluation. We used CAT for Between-Method Concordance (BMC), in which a method is compared to other methods in the same simulated data set to evaluate consistency. To summarise this information for all pairwise method comparisons, we computed the area under the curve, giving a better score to the method pairs consistently concordant for all values of i from 1 to R.

Software package versions

The following software versions were used in this work: CircMeta v1.0.2, DESeq2 v1.22.2, DEsingle v1.14.0, edgeR v3.36.0, glmGamPoi v1.6.0, Limma v3.50.0, lncDIFF v1.0.0, MAST v1.20.0, metagenomeSeq v1.36.0, Monocle v2.22.0, NBID v0.1.2, NOISeqBIO v2.38.0, PoissonSeq v1.1.2, ROTS v1.22.0, SAMSeq v3.0, Seurat v4.1.0, ZINB-Wave v1.16.0, genefilter v1.76.0, scran v1.22.1, sctransform v0.3.3.

The featureCounts function [93] from the Rsubread v2.8.1 package [94] was used to compute read alignment counts. AUPRC were computed with the PRROC v1.3.1 R package [95] using the Davis & Goadrich algorithm [52]. Plots were generated using ggplot2 v3.3.5. The computational time of differential expression tools was measured with the tictoc v1.0.1 R package. All simulation analyses were run using the SummarizedBenchmark v2.12.0 framework [64].

Supplementary material

Additional file 1: This file includes supplementary methods describing details of all supplementary figures.

Additional file 2: This file includes all supplementary tables.

Authors’ contributions

Alessia Buratin: Conceptualisation, Data curation, Formal analysis, Methodology, Investigation, Software, Visualisation, Writing – original draft, Writing – review & editing. Stefania Bortoluzzi: Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing. Enrico Gaffo: Conceptualisation, Data curation, Formal analysis, Methodology, Investigation, Software, Visualisation, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing.

Funding

Ministero dell'Istruzione, dell'Università e della Ricerca (PRIN 2017 #2017PPS2X4_003 to S.B.); Associazione Italiana per la Ricerca sul Cancro, Milan, Italy (Investigator Grant 2017 #20052 to S.B.).

Availability of data and materials

The RNA-seq data sets used in this study can be accessed from the NGDC repository [96] with accession number PRJCA000751 (for JHS data); the GEO repository [97] with accession numbers GSE86356 (for the DM1 data), GSE52463 (for the IPF data), and GSE159225 (for the MS data); and the SRA repository [29] with accession number SRP156355 (for the IDC data); the PC data are available on request from the authors of [35].

The code used in this work is available at https://github.com/egaffo/DEM4circ.

Competing interests

The authors declare that they have no competing interests.

1. Liu C-X, Chen L-L. Circular RNAs: Characterization, cellular roles, and applications. Cell. 2022. doi:10.1016/j.cell.2022.04.021

2. Buratin A, Paganin M, Gaffo E, Dal Molin A, Roels J, Germano G, et al. Large-scale circular RNA deregulation in T-ALL: unlocking unique ectopic expression of molecular subtypes. Blood Adv. 2020;4: 5902–5914.

3. Dal Molin A, Hofmans M, Gaffo E, Buratin A, Cavé H, Flotho C, et al. CircRNAs Dysregulated in Juvenile Myelomonocytic Leukemia: CircMCTP1 Stands Out. Front Cell Dev Biol. 2020;8: 613540.

4. Kristensen LS, Jakobsen T, Hager H, Kjems J. The emerging roles of circRNAs in cancer and oncology. Nature Reviews Clinical Oncology. 2022. pp. 188–206. doi:10.1038/s41571-021-00585-y

5. Chen L, Wang C, Sun H, Wang J, Liang Y, Wang Y, et al. The bioinformatics toolbox for circRNA discovery and analysis. Brief Bioinform. 2021;22: 1706–1728.

6. An O, Tan K-T, Li Y, Li J, Wu C-S, Zhang B, et al. CSI NGS Portal: An Online Platform for Automated NGS Data Analysis and Sharing. Int J Mol Sci. 2020;21. doi:10.3390/ijms21113828

7. Yu H, Jiao B, Lu L, Wang P, Chen S, Liang C, et al. NetMiner-an ensemble pipeline for building genome-wide and high-quality gene co-expression network using massive-scale RNA-seq samples. PLoS One. 2018;13: e0192613.

8. Gokool A, Loy CT, Halliday GM, Voineagu I. Circular RNAs: The Brain Transcriptome Comes Full Circle. Trends Neurosci. 2020;43: 752–766.

9. Hua JT, Chen S, He HH. Landscape of Noncoding RNA in Prostate Cancer. Trends Genet. 2019;35: 840–851.

10. Hansen TB. Improved circRNA Identification by Combining Prediction Algorithms. Front Cell Dev Biol. 2018;6: 20.

11. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols. 2013. pp. 1765–1786. doi:10.1038/nprot.2013.099

12. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013. doi:10.1186/1471-2105-14-91

13. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14: R95.

14. Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform. 2015;16: 59–70.

15. Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018;15: 255–261.

16. Assefa AT, De Paepe K, Everaert C, Mestdagh P, Thas O, Vandesompele J. Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data. Genome Biol. 2018;19: 96.

17. Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N. Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data. Genome Biol. 2020;21: 191.

18. Xu C, Zhang J. Mammalian circular RNAs result largely from splicing errors. Cell Rep. 2021;36: 109439.

19. Szabo L, Salzman J. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet. 2016;17: 679–692.

20. Raithel S, Johnson L, Galliart M, Brown S, Shelton J, Herndon N, et al. Inferential considerations for low-count RNA-seq transcripts: a case study on the dominant prairie grass Andropogon gerardii. BMC Genomics. 2016;17: 140.

21. Warton DI. Why you cannot transform your way out of trouble for small counts. Biometrics. 2018. pp. 362–368. doi:10.1111/biom.12728

22. Chen L, Wang F, Bruggeman EC, Li C, Yao B. circMeta: a unified computational framework for genomic feature annotation and differential expression analysis of circular RNAs. Bioinformatics. 2020;36: 539–545.

23. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.

24. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26: 139–140.

25. Hansen TB, Venø MT, Damgaard CK, Kjems J. Comparison of circular RNA prediction tools. Nucleic Acids Res. 2016;44: e58.

26. Gaffo E, Buratin A, Dal Molin A, Bortoluzzi S. Sensitive, reliable and robust circRNA detection from RNA-seq with CirComPara2. Brief Bioinform. 2021. doi:10.1093/bib/bbab418

27. Nielsen AF, Bindereif A, Bozzoni I, Hanan M, Hansen TB, Irimia M, et al. Best practice standards for circular RNA research. Nat Methods. 2022; 1–13.

28. Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015;4: 1521.

29. Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011;39: D19–21.

30. Ji P, Wu W, Chen S, Zheng Y, Zhou L, Zhang J, et al. Expanded Expression Landscape and Prioritization of Circular RNAs in Mammals. Cell Rep. 2019;26: 3444–3460.e5.

31. Wang ET, Treacy D, Eichinger K, Struck A, Estabrook J, Olafson H, et al. Transcriptome alterations in myotonic dystrophy skeletal muscle and heart. Hum Mol Genet. 2019;28: 1312–1321.

32. Rao AKDM, Arvinden VR, Ramasamy D, Patel K, Meenakumari B, Ramanathan P, et al. Identification of novel dysregulated circular RNAs in early-stage breast cancer. J Cell Mol Med. 2021;25: 3912–3921.

33. Nance T, Smith KS, Anaya V, Richardson R, Ho L, Pala M, et al. Transcriptome analysis reveals differential splicing events in IPF lung tissue. PLoS One. 2014;9: e97550.

34. Iparraguirre L, Alberro A, Sepúlveda L, Osorio-Querejeta I, Moles L, Castillo-Triviño T, et al. RNA-Seq profiling of leukocytes reveals a sex-dependent global circular RNA upregulation in multiple sclerosis and 6 candidate biomarkers. Hum Mol Genet. 2020;29: 3361–3372.

35. Hansen EB, Fredsøe J, Okholm TLH, Ulhøi BP, Klingenberg S, Jensen JB, et al. The transcriptional landscape and biomarker potential of circular RNAs in prostate cancer. Genome Med. 2022;14: 8.

36. Jiang R, Sun T, Song D, Li JJ. Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology. 2022;23: 31.

37. Chen Y, Lun ATL, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 2016;5: 1438.

38. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016. doi:10.1186/s13059-016-0881-8

39. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol. 2018;19: 70.

40. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014;42: e91.

41. Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15: R29.

42. Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert J-P, et al. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018;19: 24.

43. Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018;9: 284.

44. Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17: 75.

45. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20: 296.

46. Assefa AT, Vandesompele J, Thas O. SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data. Bioinformatics. 2020;36: 3276–3278.

47. Soneson C, Robinson MD. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics. 2018;34: 691–692.

48. Lun ATL, Smyth GK. No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. Stat Appl Genet Mol Biol. 2017;16: 83–93.

49. Breheny P, Stromberg A, Lambert J. p-Value Histograms: Inference and Diagnostics. High-Throughput. 2018. p. 23. doi:10.3390/ht7030023

50. Zhou X, Robinson MD. Do count-based differential expression methods perform poorly when genes are expressed in only one condition? Genome Biology. 2015. doi:10.1186/s13059-015-0781-3

51. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432.

52. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning - ICML ’06. 2006. doi:10.1145/1143844.1143874

53. Benidt S, Nettleton D. SimSeq: a nonparametric approach to simulation of RNA-sequence datasets. Bioinformatics. 2015;31: 2131–2140.

54. Bartlett MS. The Use of Transformations. Biometrics. 1947. p. 39. doi:10.2307/3001536

55. Aufiero S, Reckman YJ, Tijsen AJ, Pinto YM, Creemers EE. circRNAprofiler: an R-based computational framework for the downstream analysis of circular RNAs. BMC Bioinformatics. 2020;21: 164.

56. Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016;22: 839–851.

57. Rau A, Gallopin M, Celeux G, Jaffrézic F. Data-based filtering for replicated high-throughput transcriptome sequencing experiments. Bioinformatics. 2013;29: 2146–2152.

58. Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences. 2010. pp. 9546–9551. doi:10.1073/pnas.0914005107

59. Cheng J, Metge F, Dieterich C. Specific identification and quantification of circular RNAs from sequencing data. Bioinformatics. 2016;32: 1094–1096.

60. Chaabane M, Andreeva K, Hwang JY, Kook TL, Park JW, Cooper NGF. seekCRIT: Detecting and characterizing differentially expressed circular RNAs using high-throughput sequencing data. PLoS Comput Biol. 2020;16: e1008338.

61. Liu Z, Ding H, She J, Chen C, Zhang W, Yang E. DEBKS: A Tool to Detect Differentially Expressed Circular RNA. Genomics Proteomics Bioinformatics. 2021. doi:10.1016/j.gpb.2021.01.003

62. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11: 94.

63. Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, et al. Essential guidelines for computational method benchmarking. Genome Biol. 2019;20: 125.

64. Kimes PK, Reyes A. Reproducible and replicable comparisons using SummarizedBenchmark. Bioinformatics. 2019;35: 137–139.

65. Soneson C, Robinson MD. iCOBRA: open, reproducible, standardized and live method benchmarking. doi:10.1101/033431

66. Mou T, Deng W, Gu F, Pawitan Y, Vu TN. Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing. Front Genet. 2019;10: 1331.

67. Stupnikov A, McInerney CE, Savage KI, McIntosh SA, Emmert-Streib F, Kennedy R, et al. Robustness of differential gene expression analysis of RNA-seq. Comput Struct Biotechnol J. 2021;19: 3470–3481.

68. Gao Y, Zhang J, Zhao F. Circular RNA identification based on multiple seed matching. Brief Bioinform. 2018;19: 803–810.

69. Zhang X-O, Dong R, Zhang Y, Zhang J-L, Luo Z, Zhang J, et al. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Research. 2016. pp. 1277–1287. doi:10.1101/gr.202895.115

70. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495: 333–338.

71. Zhang J, Chen S, Yang J, Zhao F. Accurate quantification of circular RNAs identifies extensive circular isoform switching events. Nat Commun. 2020;11: 90.

72. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11: R25.

73. Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018;34: 3223–3224.

74. Ahlmann-Eltze C, Huber W. glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data. Bioinformatics. 2021;36: 5701–5702.

75. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43: e47.

76. Li Q, Yu X, Chaudhary R, Slebos RJC, Chung CH, Wang X. lncDIFF: a novel quasi-likelihood method for differential expression analysis of non-coding RNA. BMC Genomics. 2019;20: 539.

77. Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology. 2015. doi:10.1186/s13059-015-0844-5

78. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Methods. 2013;10: 1200–1202.

79. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology. 2014. pp. 381–386. doi:10.1038/nbt.2859

80. Tarazona S, Furió-Tarí P, Turrà D, Pietro AD, Nueda MJ, Ferrer A, et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 2015;43: e140.

81. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012;13: 523–538.

82. Suomi T, Seyednasrollah F, Jaakkola MK, Faux T, Elo LL. ROTS: An R package for reproducibility-optimized statistical testing. PLOS Computational Biology. 2017. p. e1005562. doi:10.1371/journal.pcbi.1005562

83. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res. 2013;22: 519–536.

84. Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184: 3573–3587.e29.

85. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33: 495–502.

86. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995. pp. 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x

87. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40: 4288–4297.

88. Lun ATL, Chen Y, Smyth GK. It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR. Methods Mol Biol. 2016;1418: 391–416.

89. Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, et al. Reversed graph embedding resolves complex single-cell trajectories. Nat Methods. 2017;14: 979–982.

90. Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17: 75.

91. Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. The Annals of Applied Statistics. 2008. doi:10.1214/07-aoas146

92. McDavid A, Finak G, Chattopadyay PK, Dominguez M, Lamoreaux L, Ma SS, et al. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments. Bioinformatics. 2013;29: 461–467.

93. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30: 923–930.

94. Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research. 2019. pp. e47–e47. doi:10.1093/nar/gkz114

95. Grau J, Grosse I, Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31: 2595–2597.

96. CNCB-NGDC Members and Partners, Xue Y, Bao Y, Zhang Z, Zhao W, Xiao J, et al. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022. Nucleic Acids Res. 2021;50: D27–D38.

97. Edgar R. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002. pp. 207–210. doi:10.1093/nar/30.1.207

Table 1. The characteristics of the circular RNA data sets analysed in this study.
Name in this study	Accession ID and original study reference	Use in this study	Number of samples; replicates in each group; sample tissue; min-max sequenced reads; read type	Correlation among replicates^*
JHS	^§PRJCA000751 [30]	Comparison of read counts per alignment type	17 human tissue samples, matched ribodepleted and RNase-R treated libraries; 31-254M 150bp PE	-
DM1	^§§GSE86356 [31]	Semiparametric simulations	30 samples: 25 Myotonic Dystrophy Type 1 and 5 tibialis anterior muscle biopsies (healthy controls); 84-120M 50bp PE	(0.76, 0.96)
IDC	^§§§SRP156355 [32]	Semiparametric simulations	10 samples: 5 Invasive Ductal Carcinoma and 5 normal breast tissue; 67-120M 100bp PE	(0.42, 0.89)
IPF	^§§GSE52463 [33]	Semiparametric simulations	15 samples: 8 Idiopathic pulmonary fibrosis and 7 normal; 40-60M 100bp PE	(0.50, 0.94)
MS	^§§GSE159225 [34]	Semiparametric simulations	50 samples: 30 Multiple sclerosis and 20 healthy tissues; 80-140M PE 150bp	(0.74, 0.77)
PC	PMID 35078526 [35]	Nonparametric simulations	96 samples: 20 normal vs 76 prostate cancer (only batch 2 of the original data set); CIRI2 BJR count matrix	(0.40, 0.90)
^§ NGDC ID (National Genomics Data Center, China National Center for Bioinformation); ^§§ GEO ID; ^§§§ SRA ID; ^* Pearson’s correlation (minimum, maximum) calculated among replicates within conditions based on BJR counts; M: million reads; PE: paired-end; BJR: backsplice junction read

Supplementarymaterial.pdf

Download PDF

Version 2

posted

You are reading this latest preprint version

Systematic benchmarking of statistical methods to assess differential expression of circular RNAs

Status:

Version 2

Abstract

Figures

Introduction

Results

CircRNA expression data are characterised by a high proportion of small and zero counts

Statistical modelling of circRNA expression count data

Comparison of differential expression assessment methods on circRNA data

Benchmark data sets simulation with a semiparametric approach

Type I error control

Expression estimate characteristics of the false-positive differentially expressed circRNAs

False discovery rate, power, F1-score, and area under the precision-recall curve

Analysis with nonparametric simulations from an independent data set

Similarity of differential expression methods’ predictions

Overall ranking of the methods

Computational time

Discussion

Conclusions

Materials And Methods

CircRNA data sets, expression quantification, and expression filters

Goodness-of-fit analysis

Semiparametric simulations

Nonparametric simulations

Differential expression tools’ parameters used in this study

circMeta

DESeq2

DEsingle

edgeR

glmGamPoi

Limma and limma-voom

lncDIFF

MAST

metagenomeSeq

Monocle

NBID

NOISeqBIO

PoissonSeq

ROTS

SAMseq

Seurat

Wilcoxon rank-sum test

ZINB-WaVE

Type I error control

Concordance at the top

Software package versions

Declarations

Supplementary material

Authors’ contributions

Funding

Availability of data and materials

Competing interests

References

Tables

Supplementary Files

Status:

Version 2

False discovery rate, power, F₁-score, and area under the precision-recall curve