During the last decades, Next-Generation Sequencing (NGS) technologies have developed at a fast pace with the improvement of data quality coupled with a reduction of experimental costs. Since the early years of NGS, the use of RNAseq to profile transcriptomes became the method of choice replacing in time microarray-based analyzes [1]. Plant biologists use RNAseq-based transcriptomic extensively, generating knowledge about transcriptional regulation in several biological processes [2–5]. Differential gene expression analysis across different experimental conditions is classically used to gain insight into gene regulation events and gene co-expression analysis to identify functional modules.
A classical analysis workflow will start with a data normalization step to account for technical biases that affect the number of reads mapped to a gene. Several methods are available, and among the most used, we can find RPKM (Reads Per Kilobase per Million mapped reads) [6], Upper quartile normalization [7], RLE (Relative Log Expression) [8] and TMM (Trimmed Mean of the M-values) [9]. Multiple methods, based on different statistical modeling of data, are available to perform differential expression analysis. Negative binomial-based models with robust mean-variance modeling, have been used extensively at the beginning, and they are available in the R-packages edgeR [10] and DESeq [8]. More recently, the linear models and their generalized extensions for negative binomial distributions (GLM) have been proposed to account for the versatility of multifactorial experiments. They are available in the R-package limma [11] for the linear models and in the R-packages edgeR [12] and DESeq2 [13] for the generalized linear models. Following differential gene expression analysis, several approaches to identify and group co-expressed genes have been in use over the years. Pearson’s or Spearman’s correlations, WGCNA (Weighted correlation network analysis) method [14], hierarchical clustering and K-means are the most conventional approaches found in the literature [15,16]. With these approaches, the number of clusters is chosen, either a priori or a posteriori,by the user. Mixture models offer a different approach by identifying an underlying structure which corresponds to clusters of co-expressed genes. Moreover a model selection criterion allows determinining the most appropriate cluster number [17,18].
To perform such analysis, tools associated with the methods are available in R to quite quickly get from data to results [19]. In parallel bioinformatics tools offering a Graphical User Interface (GUI), and interactive visualization tools were developed. First-generation tools included the RNAseq read-mapping step [20–22], more often realized independently at present, depending on data and genome availability. Several of these GUI tools [23–29] ease the use of the main RNAseq analysis R-packages for normalization and differential expression analysis such as limma [12], DESeq [9], DESeq2 [14] and/or edgeR [13]. The role of these GUI is to realize R-based RNAseq data analysis with little or no experience in the command line. More recent tools take advantage of the R-shiny framework that eases the creation of a GUI for R-packages and pipelines [30]. The majority of these GUI tools include a high number of data visualization options and the possibility to generate figures for publications.
Even with all these tools, a biologist is often in front of a real dilemma on how to analyze his dataset correctly. Indeed a characteristic shared by the majority of GUI tools developed up to date is to offer the user the possibility to choose among multiple statistical methods for each step of the analysis with no specific propositions. However, the RNAseq data specificities, such as heterogeneity of counts or overdispersion among biological replicates, represent a methodological challenge that has to be addressed by proper statistical modeling of the gene expression. It is worth noting that in the case of multifactorial experiments, if interaction terms are included in the modeling, the writing of the contrasts might become tricky, requiring a good understanding of some statistical concepts, not always mastered by a biologist. As a result, the large-scale data analysis of RNAseq data is not straightforward for a biologist.
DiCoExpress aims to offer a tool usable without advanced statistical knowledge and/or programming skills to analyze RNAseq projects with complete and balanced designs with at most two biological factors and one technical factor. To offer a validated set of tools, we based our choice on three neutral comparison studies [18,31,32]. The idea of such studies is to design and implement a framework to generate realistic benchmark datasets with known truth to make an objective and reproducible performance assessment. Comparing normalization methods, Dillies et al. [31] showed that the RLE method implemented in the package DESeq2 [14] and the TMM method implemented in the package edgeR [11] demonstrate satisfactory behavior in the presence of highly expressed genes. Both these methods maintain a reasonable false-positive rate without loss of power. The choice of both methods was confirmed by Reddy et al. and Evans et al. [33,34] even in experiments with slightly asymmetric differential expression or different amounts of mRNA/cell per condition. Based on these detailed evaluations, both RLE and TMM are suitable, but we decided to choose the TMM normalization as the default method and proposed RLE as an alternative for normalization due to the choice made for the differential analysis described below.
Rigaill et al. [32] made a neutral comparison study among differential gene expression methods, including negative binomial-based, generalized linear models, and linear models on transformed data. Performance analyzes based on the p-value distributions, ROC curves, and proportion of true and false-positive rates show a clear difference of behavior between negative binomial-based methods and the others. Linear models on transformed data or generalized linear models are consequently the most adapted for the differential analysis. Among these models, as also observed in Schurch et al. [35], when the proportion of differentially expressed genes is low, the results obtained with the method implemented in the edgeR package are more satisfying. We thus chose the statistical model implemented in the edgeR package as a method of choice for differential expression data analysis. Moreover, we propose automatic writing of a large number of contrasts in order to facilitate the comparisons between the biological conditions considered in the experimental design. This automatic writing is a real advantage because, in the available R-packages, most contrasts in GLM with interactions between two factors must be handwritten and require thus an excellent understanding of the statistical modeling.
For the co-expression analysis, we preferred mixture models to correlation-based approaches. Mixture models aim at identifying an underlying structure in modeling the unknown distribution by a weighted sum of parametric distributions, each one representing a group of co-expressed genes. Gaussian mixture models were relevant for microarray data and were applied with success on several datasets [36,37]. For RNAseq data, which are discrete, Rau et al. [18] first concluded that normalized expression profiles modeled with a Poisson mixture are relevant for co-expression analysis. However, in the Poisson mixture, the dependence structure between samples is not considered and can mislead the results. To tackle this problem, they proposed then a Gaussian mixture after a transformation of the normalized expression profiles [19]. This model seems to be more suitable for RNAseq co-expression analysis by providing a proper identification of the groups of co-expressed genes because it accounts for per-cluster correlation structures among samples. For these reasons, we chose this Gaussian mixture implemented in the coseq R-package.
In conclusion, using these neutral comparison studies, we combined the most adapted tools for each step of a standard RNAseq analysis. DiCoExpress is a workspace, illustrated in Figure 1, to be installed on a computer to create a user-friendly workspace for analyzing RNAseq datasets. The directory Data will store all the projects, and the directory Results will contain a subdirectory per project with all the results of the different steps. The directory Sources contains the R functions used by DiCoExpress. Finally, the directory Template_scripts will contain an R script file for each project, allowing a semi-automated data analysis where the user is guided through all the steps from normalization to co-expression analysis. Hence, with DiCoExpress, our objective is to focus on the statistical modeling of gene expression according to the experimental design and on the interpretation of the results of such analysis in biological terms.