DiCoExpress: a workspace to process multifactorial RNAseq experiments from quality controls to co-expression analysis through differential analysis based on contrasts inside GLM models.

doi:10.21203/rs.2.19732/v1

Download PDF

Software

DiCoExpress: a workspace to process multifactorial RNAseq experiments from quality controls to co-expression analysis through differential analysis based on contrasts inside GLM models.

https://doi.org/10.21203/rs.2.19732/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 12 May, 2020

Read the published version in Plant Methods →

You are reading this older preprint version

Read the latest preprint version →

Background RNAseq is nowadays the method of choice for transcriptome analysis. In the last decades, a high number of statistical methods, and associated bioinformatics tools, for RNAseq analysis were developed. More recently, statistical studies realized neutral comparison studies using benchmark datasets, shedding light on the most appropriate approaches for RNAseq data analysis. Nevertheless, performing an RNAseq analysis remains a challenge for the biologists.

Results DiCoExpress is a workspace implemented in R that includes methods chosen based on their performance in neutral comparisons studies. DiCoExpress uses the pre-existing R packages as well as FactoMineR, edgeR and coseq, to perform quality control, differential, and co-expression analysis of RNAseq data. Users can perform the full analysis, providing a mapped read expression data file and a file containing the information on the experimental design. Following the quality control step, the user can move on to the differential expression analysis performed using generalized linear models with no effort thanks to the automated contrast writing function. DiCoExpress proposes a list of comparisons based on the experimental design, and the user needs only to choose the one(s) of interest for his research question. A co-expression analysis is implemented using the coseq package. Identified co-expression clusters are automatically analyzed for enrichment of annotations provided by the user, and several result outputs proposed. We used DiCoExpress to analyze a publicly available Bra ssica napus L. RNAseq dataset on the transcriptional response to silicon treatment in plant roots and mature leaves. This dataset, including two biological factors and three replicates for each condition, allowed us to demonstrate in a tutorial all the features of DiCoExpress.

Conclusions DiCoExpress is an R workspace to allow users without advanced statistical knowledge and programming skills to perform a full RNAseq analysis from quality controls to co-expression analysis through differential analysis based on contrasts inside generalized linear models . Hence, with DiCoExpress, the user can focus on the statistical modeling of gene expression according to the experimental design and on the interpretation of the results of such analysis in biological terms.

Plant Molecular Biology and Genetics

Bioinformatics

RNA-seq

analysis workspace

differential expression

contrasts

co-expression

During the last decades, Next-Generation Sequencing (NGS) technologies have developed at a fast pace with the improvement of data quality coupled with a reduction of experimental costs. Since the early years of NGS, the use of RNAseq to profile transcriptomes became the method of choice replacing in time microarray-based analyzes [1]. Plant biologists use RNAseq-based transcriptomic extensively, generating knowledge about transcriptional regulation in several biological processes [2–5]. Differential gene expression analysis across different experimental conditions is classically used to gain insight into gene regulation events and gene co-expression analysis to identify functional modules.

A classical analysis workflow will start with a data normalization step to account for technical biases that affect the number of reads mapped to a gene. Several methods are available, and among the most used, we can find RPKM (Reads Per Kilobase per Million mapped reads) [6], Upper quartile normalization [7], RLE (Relative Log Expression) [8] and TMM (Trimmed Mean of the M-values) [9]. Multiple methods, based on different statistical modeling of data, are available to perform differential expression analysis. Negative binomial-based models with robust mean-variance modeling, have been used extensively at the beginning, and they are available in the R-packages edgeR [10] and DESeq [8]. More recently, the linear models and their generalized extensions for negative binomial distributions (GLM) have been proposed to account for the versatility of multifactorial experiments. They are available in the R-package limma [11] for the linear models and in the R-packages edgeR [12] and DESeq2 [13] for the generalized linear models. Following differential gene expression analysis, several approaches to identify and group co-expressed genes have been in use over the years. Pearson’s or Spearman’s correlations, WGCNA (Weighted correlation network analysis) method [14], hierarchical clustering and K-means are the most conventional approaches found in the literature [15,16]. With these approaches, the number of clusters is chosen, either a priori or a posteriori,by the user. Mixture models offer a different approach by identifying an underlying structure which corresponds to clusters of co-expressed genes. Moreover a model selection criterion allows determinining the most appropriate cluster number [17,18].

To perform such analysis, tools associated with the methods are available in R to quite quickly get from data to results [19]. In parallel bioinformatics tools offering a Graphical User Interface (GUI), and interactive visualization tools were developed. First-generation tools included the RNAseq read-mapping step [20–22], more often realized independently at present, depending on data and genome availability. Several of these GUI tools [23–29] ease the use of the main RNAseq analysis R-packages for normalization and differential expression analysis such as limma [12], DESeq [9], DESeq2 [14] and/or edgeR [13]. The role of these GUI is to realize R-based RNAseq data analysis with little or no experience in the command line. More recent tools take advantage of the R-shiny framework that eases the creation of a GUI for R-packages and pipelines [30]. The majority of these GUI tools include a high number of data visualization options and the possibility to generate figures for publications.

Even with all these tools, a biologist is often in front of a real dilemma on how to analyze his dataset correctly. Indeed a characteristic shared by the majority of GUI tools developed up to date is to offer the user the possibility to choose among multiple statistical methods for each step of the analysis with no specific propositions. However, the RNAseq data specificities, such as heterogeneity of counts or overdispersion among biological replicates, represent a methodological challenge that has to be addressed by proper statistical modeling of the gene expression. It is worth noting that in the case of multifactorial experiments, if interaction terms are included in the modeling, the writing of the contrasts might become tricky, requiring a good understanding of some statistical concepts, not always mastered by a biologist. As a result, the large-scale data analysis of RNAseq data is not straightforward for a biologist.

DiCoExpress aims to offer a tool usable without advanced statistical knowledge and/or programming skills to analyze RNAseq projects with complete and balanced designs with at most two biological factors and one technical factor. To offer a validated set of tools, we based our choice on three neutral comparison studies [18,31,32]. The idea of such studies is to design and implement a framework to generate realistic benchmark datasets with known truth to make an objective and reproducible performance assessment. Comparing normalization methods, Dillies et al. [31] showed that the RLE method implemented in the package DESeq2 [14] and the TMM method implemented in the package edgeR [11] demonstrate satisfactory behavior in the presence of highly expressed genes. Both these methods maintain a reasonable false-positive rate without loss of power. The choice of both methods was confirmed by Reddy et al. and Evans et al. [33,34] even in experiments with slightly asymmetric differential expression or different amounts of mRNA/cell per condition. Based on these detailed evaluations, both RLE and TMM are suitable, but we decided to choose the TMM normalization as the default method and proposed RLE as an alternative for normalization due to the choice made for the differential analysis described below.

Rigaill et al. [32] made a neutral comparison study among differential gene expression methods, including negative binomial-based, generalized linear models, and linear models on transformed data. Performance analyzes based on the p-value distributions, ROC curves, and proportion of true and false-positive rates show a clear difference of behavior between negative binomial-based methods and the others. Linear models on transformed data or generalized linear models are consequently the most adapted for the differential analysis. Among these models, as also observed in Schurch et al. [35], when the proportion of differentially expressed genes is low, the results obtained with the method implemented in the edgeR package are more satisfying. We thus chose the statistical model implemented in the edgeR package as a method of choice for differential expression data analysis. Moreover, we propose automatic writing of a large number of contrasts in order to facilitate the comparisons between the biological conditions considered in the experimental design. This automatic writing is a real advantage because, in the available R-packages, most contrasts in GLM with interactions between two factors must be handwritten and require thus an excellent understanding of the statistical modeling.

For the co-expression analysis, we preferred mixture models to correlation-based approaches. Mixture models aim at identifying an underlying structure in modeling the unknown distribution by a weighted sum of parametric distributions, each one representing a group of co-expressed genes. Gaussian mixture models were relevant for microarray data and were applied with success on several datasets [36,37]. For RNAseq data, which are discrete, Rau et al. [18] first concluded that normalized expression profiles modeled with a Poisson mixture are relevant for co-expression analysis. However, in the Poisson mixture, the dependence structure between samples is not considered and can mislead the results. To tackle this problem, they proposed then a Gaussian mixture after a transformation of the normalized expression profiles [19]. This model seems to be more suitable for RNAseq co-expression analysis by providing a proper identification of the groups of co-expressed genes because it accounts for per-cluster correlation structures among samples. For these reasons, we chose this Gaussian mixture implemented in the coseq R-package.

In conclusion, using these neutral comparison studies, we combined the most adapted tools for each step of a standard RNAseq analysis. DiCoExpress is a workspace, illustrated in Figure 1, to be installed on a computer to create a user-friendly workspace for analyzing RNAseq datasets. The directory Data will store all the projects, and the directory Results will contain a subdirectory per project with all the results of the different steps. The directory Sources contains the R functions used by DiCoExpress. Finally, the directory Template_scripts will contain an R script file for each project, allowing a semi-automated data analysis where the user is guided through all the steps from normalization to co-expression analysis. Hence, with DiCoExpress, our objective is to focus on the statistical modeling of gene expression according to the experimental design and on the interpretation of the results of such analysis in biological terms.

To create DiCoExpress, we use the R programming language and several R-packages from CRAN and Bioconductor [39]. Each step of the analysis has a dedicated function available in the directory Sources. Seven functions compose the core of DiCoExpress (Fig. 2), and they are combined in a script, stored in the directory Template_scripts for each project to specify the steps of the analysis and the parameters to use. A full description of these seven functions is available in Additional File 1.

Input files and data quality controls

To run DiCoExpress on a project, the user has to provide only two input files: one containing a count table summarizing the mapped reads for each gene, named Project_Name_COUNTS.txt, and a second one with a description of the project design according to the experimental factors, named Project_Name_TARGET.txt. If functional gene annotations are available in a file, the user has the option to upload it. This information will be integrated into the result tables and can also be used to perform enrichment tests.

The (1) Load_Data_Files function allows the user to upload the Project_Name_COUNT.txt and Project_Name_TARGET.txt files. A check is done to be sure that both files are adequately built: the samples in Project_Name_COUNT.txt file must be organized in the same order as the rows of the Project_Name_TARGET.txt. If it is inconsistent, then the columns of the Project_Name_COUNT.txt are reorganized according to the column of the target file. DiCoExpress performs analysis for a complete and balanced experimental design. If this condition is not verified, then an error message appears, and the script stops running. A filter option in Load_Data_Files is proposed to extract a subset of samples leading to complete and balanced design, thus avoiding manual modifications of the expression file. The filtering rules are described according to the Project_Name_TARGET.txt (see section Results for an example). The (2) Quality_Control function produces several representations of the dataset before and after filtering low expressed genes and correction of the library size effect. This step is optional, but we advise the users to perform it to evaluate the quality of the RNAseq data before further analyses.

Differential expression analysis

The differential analysis is based on a negative binomial GLM, where the log of the gene expression is modeled by all the factors describing the experiment. When the number of observations is twice greater than the number of parameters of the model, we advise to include interaction terms between the biological factors. Such terms in the gene expression definition might reveal meaningful interactions such as genotype-environment interaction and answer in a direct way to some biological questions [40–42]. The (3) GLM_Contrasts function will automatically write a list of contrasts based on the model specified by the user. We focused on contrasts involving the biological factors, and their names are sufficiently explicit to understand the associated biological question addressed. For example, we proposed automatic writing of the difference between two modalities of a biological factor averaged on the second factor or for a given modality of the second factor. Hence thanks to this list of proposed contrasts, the user will be able to choose the ones that are relevant for the project without worrying about the complexity of their statistical formulation. Running this function is a prerequisite to run the differential expression analysis. The (4) DiffAnalysis_edgeR function uses edgeR R-package to estimate the parameters of the GLM and then test every contrast chosen by the user. As proposed by Rigaill et al. [32], the distribution of raw p-values of each contrast is inspected to assess the quality of the statistical modeling of the gene expression. Since the distribution of raw p-values is theoretically dominated by a uniform distribution, the fit between the statistical model and the data can be observed on these raw p-value histograms. If the raw p-value distribution is not satisfactory (see example in the Results here below and Fig. 4), we advise repeating the analysis using a more stringent cut-off for the filtering step or another rule of filtering. If the raw p-value distribution remains unsatisfactory, the problem might come from a large number of parameters compared to the number of observations available to estimate them. In this case, we advise modifying the modeling of the gene expression removing, for example, the interaction term. For each contrast, a subdirectory is created to store the differentially expressed genes (DEGs) lists and other useful results files.

Co-expression analysis

The (5) Venn_Intersection_Union function helps the biologist in the interpretation of the results by comparing different DEG lists. This function also generates the union and/or the intersections of these DEG lists in order to perform a co-expression analysis with the (6) Coexpression_coseq function. This latter function uses coseq R-package [18] to transform the raw data into normalized expression profiles. We kept the filter function of coseq removing the genes with low mean normalized counts. Those discarded genes are assigned in Cluster 0. A co-expression analysis is performed on the remaining genes using a Gaussian mixture after an arcsin transformation of the normalized expression profiles. Practically, multidimensional Gaussian mixtures of 5–30 subpopulations with unequal proportions and general covariance matrix are estimated. The EM algorithm used to estimate the model parameters is known to be sensitive to the initialization point. Coseq uses a small-EM strategy, and in DiCoExpress, we go further to get robust results. First, mixture models with 5, 10, 15, 20, 25, and 30 subpopulations are estimated 5 times each to identify an interval for the final number of co-expressed gene clusters. A second collection of models on this interval of a subpopulation is then estimated 40 times each (per default). The best mixture model is the one minimizing the Integrated Completed Likelihood (ICL). The ICL curve is expected to be a convex function of the number of subpopulations, and we use this criterion to assess that the chosen model fits well the data. A different behavior of the ICL curve means that the dataset is too heterogeneous. In this case, we advise users to modify the dataset removing some genes that show a too flat normalized profile. For the co-expression analyzes, we recommend using a powerful calculation server. The RData object of the second loop is saved at each iteration so, if the function is stopped, the analysis can be resumed. The RData of the selected model is also saved. Moreover, several tables and graphics are saved in order to check the analysis quality and to explore the co-expression results.

Enrichment analysis

Once an RNAseq analysis is complete, the next step is to evaluate the coherence of the results by comparing them with biological knowledge. To this end, the (7) Enrichment function performs hypergeometric tests in order to find annotation terms that are specifically enriched or depleted in a given list of genes with respect to a reference specified by an annotation file. The enrichment analysis following the co-expression analysis is automatically performed on all the co-expressed gene clusters. This function can also be applied to any list, e.g., lists of differentially expressed genes from the GLM analysis.

We illustrate the use of DiCoExpress by analyzing a dataset associated with the publication of Haddad et al. [43]. This RNAseq dataset describes gene expression in roots and mature leaves of Brassica napus with or without silicon (Si) treatment. Three biological replicates are available. The experimental design can be described by two biological factors Tissue and Treatment and a technical factor Replicate with three modalities (Fig. 3). To illustrate the outputs of enrichment tests, we used the annotation of B. napus v.5 from the Brassica genome database [44] to perform enrichment analyzes.

We tested DiCoExpress on the full dataset available in contrast to Haddad et al., who only focused on the root samples. The procedure is described in Additional File 2 as a tutorial of DiCoExpress. We started the analysis by filtering not expressed genes and those with low counts. We used the Counts Per Million (CPM) method with CPM_Cutoff = 1 and Filter_Strategy = NbConditions that are the default arguments of Quality_Control function. We choose the default method TMM to normalize the RNAseq libraries. Checking the quality control results in Brassica_napus_Data_Quality_Control.pdf output file, we observe a higher number of reads in the mature leaf samples compared to the root samples; nonetheless, the normalization seems suitable for further analysis since the boxplot of normalized counts are almost similar across all the samples (Supplementary Fig. 1A and 1B). A hierarchical clustering heatmap and principal component analysis graphs are generated to look at the sample similarities. In our analysis, we observe, as expected, a clear difference between the two tissues as well as an apparent clustering of mature leaf samples according to the treatment (Supplementary Fig. 1C and 1D).

We performed a differential expression analysis using a GLM with both biological factors and the technical replicate factor. We included an interaction between the two biological factors in the model. We checked the quality by looking at the raw p-value histograms of the seven contrasts automatically written by the GLM_Contrasts function. For the three contrasts, [MatureLeaf-Root], [NoSi_MatureLeaf-NoSi_Root], and [Si_MatureLeaf-Si_Root], the end of the histograms of raw p-values correspond to a uniform distribution indicating a good fit of the GLM model. However, on the histograms of the four other contrasts, [NoSi-Si], [MatureLeaf_NoSi-MatureLeaf_Si], [Root_NoSi-Root_Si] and [MatureLeaf_NoSi-MatureLeaf_Si]-[Root_NoSi-Root_Si], we observe an increase of the frequency around 1: this usually suggests that data are not properly filtered (Fig. 4A). Following this observation, we went back to the beginning of the analysis, setting a more stringent CPM_Cutoff = 5, and we obtained satisfying raw p-value histograms for the seven contrasts (Fig. 4B). We observed, as expected, that the highest number of differentially expressed genes is found for the comparison of both tissues with 28.261, 25.734, and 25.757 DEGs for the contrasts [MatureLeaf-Root], [NoSi_MatureLeaf-NoSi_Root] and [Si_MatureLeaf-Si_Root], respectively. A small number of differentially expressed genes is identified between the two treatments: 218, 754 and 173 DEGs for the contrasts [NoSi-Si], [MatureLeaf_NoSi-MatureLeaf_Si] and [Root_NoSi-Root_Si], respectively (Supplementary Fig. 2A). An advantage of using a GLM with an interaction term is to identify straightforward genes that respond differently to the Silicon treatment in the two tissues using the [MatureLeaf_NoSi-MatureLeaf_Si]-[Root_NoSi-Root_Si] contrast. In this interaction analysis, we found 106 genes differentially expressed, and an example of a gene in this list is shown in Supplementary Fig. 2B. The hierarchical clustering on the top 50 DEGs ranking on their p-values for this contrast is also proposed by DiCoExpress (Supplementary Fig. 2C). On the bottom of this plot, we observe groups of genes with a clear opposite behavior between the two tissues. For the others, the behavior is more variable, but all these genes are declared to be the most impacted genes by the treatment and in different ways in the two tissues.

As users often need to compare DEG lists, in DiCoExpress, we propose the Venn_Intersection_Union function to generate these lists quickly. In the Brassica napus dataset, we unite three contrasts: [MatureLeaf_NoSi-MatureLeaf_Si], [Root_NoSi-Root_Si] and [MatureLeaf_NoSi-MatureLeaf_Si]-[Root_NoSi-Root_Si] to study genes impacted in their transcription by the treatment (Supplementary Fig. 3A). Within the Venn diagram, we can distinguish genes whose expression varies in response to treatment in a specific tissue or in both treatments with a similar or different behavior depending on the tissue and examples from each class is shown in Supplementary Fig. 3B. This grouping of genes using a Venn diagram is based only on the results of the single contrast differential analysis. However, by performing a co-expression analysis, we can go further in the interpretation by clustering these genes according to their average expression profile in all samples. We applied the Coexpression_coseq function with default parameters to group the 945 DEGs from the union of the three contrasts. The convexity of the ICL curve has a clear minimum (a marker of a good quality clustering analysis), and we found seven clusters of co-expressed genes (Fig. 5). Three genes with low mean normalized counts were assigned in Cluster 0, i.e, coseq could not assign them to a cluster. Clusters 3 and 6 (71 and 106 genes, respectively) contain genes with low expression and no change in the roots, but their expression varies in response to the Si treatment in the mature leaves (over-expression in Cluster 3 and under-expression in Cluster 6). Conversely, Cluster 7 (201 genes) show low expression with no significant change in the mature leaves, but they are strongly expressed in the roots with a slight reduction following the treatment. Cluster 1 and 4 (203 and 146 genes, respectively) include genes more expressed in one tissue compared to the other one (higher expression in mature leaves for cluster 4 and higher expression in roots for cluster 1) but without significant Si treatment response. In cluster 2 (85 genes) and cluster 5 (130 genes) are grouped genes showing a small difference in expression levels between the two tissues. For both clusters genes show over-expression following the treatment in roots (more apparent in cluster 5), but no significant change in leaves. The cluster composition based on probabilistic modeling of the normalized gene profiles is, as it could be expected, different from the groups of genes found with the DEG list comparisons (Additional File 3). Following the co-expression analysis, that finishes the statistical analysis of this dataset used to illustrate the use of DiCoExpress, we performed enrichment analyses on these 7 clusters, and also of the 106 DEGs for the interaction contrast, and they are available in the tutorial (Additional File 2). Further interpretation and discussion of the biology behind these enrichments are beyond the scope of our presentation of the DiCoExpress usage.

DiCoExpress is a user-friendly workspace for analyzing efficiently multifactorial RNAseq transcriptome experiments from quality controls to co-expression analysis through differential expression analysis. We based the development of DiCoExpress on neutral comparison studies combining the most performant statistical approaches for each step of a standard RNAseq analysis. In DiCoExpress, we used generalized linear models (GLM) implemented in the R-package edgeR for differential gene expression analysis and Gaussian mixture models implemented in the R-package coseq to perform the co-expression analysis. DiCoExpress simplifies the GLM analysis proposing automated writing of all possible contrasts and optimizes the co-expression analysis with the re-estimation of the collection of Gaussian models. DiCoExpress produces a collection of files to visualize the results and multiple summary files of the data for further data exploration. The integrated enrichment analysis with the hypergeometric test gives the user the first glimpse at potential biological functions underlying the different gene lists. In conclusion, DiCoExpress allows biologists without advanced statistical knowledge and programming skills to use these two packages in a pre-existing and organized workspace.

Project name: DiCoExpress
Project home page: https://forgemia.inra.fr/ilana.lambert/dicoexpress/
Operating system(s): Windows, Mac OS, Linux
Programming language: R
Other requirements: R version 3.5.0 or higher with FactoMineR_2.0
License: GPL-2 | GPL-3
Any restrictions to use by non-academics: None

CPM : Counts Per Million
DEGs : Differentially expressed genes
FDR : False Discovery Rate
GLM : Generalized Linear Models
ICL : Integrated Completed Likelihood
LM : Linear Model
NB : Negative Binomial
NGS : Next Generation Sequencing
PCA : Principal Componant Analysis
RLE : Relative Log Expression
RPKM : Reads Per Kilobase per Million mapped reads
TMM : Trimmed Mean of the M-values

Acknowledgments

This work was mainly supported by the ANR PSYCHE (ANR-16-CE20-0009) and IPS2 benefits from the support of the LabEx Saclay Plant Sciences-SPS (ANR-10-LABX-0040-SPS)

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

IL, SC, and MLMM planned the pipeline. SC provided end-user biologist suggestions and tests. MLMM provided statistical expertise. IL and MLMM wrote the software, with contributions of CPLR. IL, CPLR, SC, and MLMM tested the pipeline. IL, SC, and MMLM wrote the paper. All authors read and approved the final manuscript.

DECLARATIONS

Ethics approval and consent to participate: not applicable

Consent for publication: not applicable

Availability of data and materials: https://forgemia.inra.fr/ilana.lambert/dicoexpress/
Funding: This work was mainly supported by the ANR PSYCHE (ANR-16-CE20-0009) and IPS2 benefits from the support of the LabEx Saclay Plant Sciences-SPS (ANR-10-LABX-0040-SPS)

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63.
Agarwal P, Parida SK, Mahto A, Das S, Mathew IE, Malik N, et al. Expanding frontiers in plant transcriptomics in aid of functional genomics and molecular breeding. Biotechnol J. 2014;9:1480–92.
O’Rourke JA, Bolon Y-T, Bucciarelli B, Vance CP. Legume genomics: understanding biology through DNA and RNA sequencing. Ann Bot. 2014;113:1107–20.
Rutley N, Twell D. A decade of pollen transcriptomics. Plant Reprod. 2015;28:73–89.
Bashir K, Matsui A, Rasheed S, Seki M. Recent advances in the characterization of plant transcriptomes in response to drought, salinity, heat, and cold stress. F1000Research. 2019;8.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinforma Oxf Engl. 2010;26:139–40.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47–e47.
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–97.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559.
Kumari S, Nie J, Chen H-S, Ma H, Stewart R, Li X, et al. Evaluation of Gene Association Methods for Coexpression Network Construction and Biological Knowledge Discovery. PLOS ONE. 2012;7:e50411.
D’haeseleer P. How Does Gene Expression Cluster Work? 2006.
Rau A, Maugis-Rabusseau C, Martin-Magniette M-L, Celeux G. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinforma Oxf Engl. 2015;31:1420–7.
Rau A, Maugis-Rabusseau C. Transformation and model choice for RNA-seq co-expression analysis. Brief Bioinform. 2018;19:425–36.
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, et al. RNA-seq analysis is easy as 1–2–3 with limma, Glimma and edgeR. F1000Research. 2016;5.
Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, et al. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012;40:W622–7.
Russo F, Angelini C. RNASeqGUI: a GUI for analysing RNA-Seq data. Bioinforma Oxf Engl. 2014;30:2514–6.
Russo F, Righelli D, Angelini C. Advancements in RNASeqGUI towards a Reproducible Analysis of RNA-Seq Experiments. BioMed Res Int [Internet]. 2016 [cited 2019 Jul 8];2016. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4764726/
Nelson JW, Sklenar J, Barnes AP, Minnier J. The START App: a web-based RNAseq analysis and visualization resource. Bioinformatics. 2017;33:447–9.
Varet H, Brillet-Guéguen L, Coppée J-Y, Dillies M-A. SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PLoS ONE [Internet]. 2016 [cited 2019 Jul 8];11. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4900645/
Su S, Law CW, Ah-Cann C, Asselin-Labat M-L, Blewitt ME, Ritchie ME. Glimma: interactive graphics for gene expression analysis. Bioinforma Oxf Engl. 2017;33:2050–2.
Li Y, Andrade J. DEApp: an interactive web interface for differential expression analysis of next generation sequence data. Source Code Biol Med. 2017;12:2.
Zhu Q, Fisher SA, Dueck H, Middleton S, Khaladkar M, Kim J. PIVOT: platform for interactive analysis and visualization of transcriptomics data. BMC Bioinformatics [Internet]. 2018 [cited 2019 Jul 8];19. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5756333/
Choi K, Ratner N. iGEAK: an interactive gene expression analysis kit for seamless workflow using the R/shiny platform. BMC Genomics [Internet]. 2019 [cited 2019 Jul 8];20. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6404331/
Kucukural A, Yukselen O, Ozata DM, Moore MJ, Garber M. DEBrowser: interactive differential expression analysis and visualization tool for count data. BMC Genomics. 2019;20:6.
shiny: Web Application Framework for R version 1.3.2 from CRAN [Internet]. [cited 2019 Jul 11]. Available from: https://rdrr.io/cran/shiny/
Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013;14:671–83.
Rigaill G, Balzergue S, Brunaud V, Blondet E, Rau A, Rogier O, et al. Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis. Brief Bioinform. 2018;19:65–76.
Reddy R. A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data. bioRxiv. 2015;026062.
Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018;19:776–92.
Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA N Y N. 2016;22:839–51.
Zaag R, Tamby JP, Guichard C, Tariq Z, Rigaill G, Delannoy E, et al. GEM2Net: from gene expression modeling to -omics networks, a new CATdb module to investigate Arabidopsis thaliana genes involved in stress response. Nucleic Acids Res. 2015;43:D1010–1017.
Frei dit Frey N, Garcia AV, Bigeard J, Zaag R, Bueso E, Garmier M, et al. Functional analysis of Arabidopsisimmune-related MAPKs uncovers a role for MPK3 as negative regulator of inducible defences. Genome Biol. 2014;15:R87.
R: The R Project for Statistical Computing [Internet]. [cited 2019 Nov 28]. Available from: https://www.r-project.org/
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12:115–21.
Brady SM, Burow M, Busch W, Carlborg Ö, Denby KJ, Glazebrook J, et al. Reassess the t Test: Interact with All Your Data via ANOVA. Plant Cell. 2015;27:2088–94.
Boussardon C, Martin-Magniette M-L, Godin B, Benamar A, Vittrant B, Citerne S, et al. Novel Cytonuclear Combinations Modify Arabidopsis thaliana Seed Physiology and Vigor. Front Plant Sci. 2019;10.
Varet H, Shaulov Y, Sismeiro O, Trebicz-Geffen M, Legendre R, Coppée J-Y, et al. Enteric bacteria boost defences against oxidative stress in Entamoeba histolytica. Sci Rep. 2018;8:1–12.
Haddad C, Trouverie J, Arkoun M, Yvin J-C, Caïus J, Brunaud V, et al. Silicon supply affects the root transcriptome of Brassica napus L. Planta. 2019;249:1645–51.
The pangenome of hexaploid bread wheat - Montenegro - 2017 - The Plant Journal - Wiley Online Library [Internet]. [cited 2019 Jul 31]. Available from: https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.13515
Dun X, Tao Z, Wang J, Wang X, Liu G, Wang H. Comparative Transcriptome Analysis of Primary Roots of Brassica napus Seedlings with Extremely Different Primary Root Lengths Using RNA Sequencing. Front Plant Sci [Internet]. 2016 [cited 2019 Jul 11];7. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4990598/
Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinforma Oxf Engl. 2005;21:3448–9.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003;13:2498–504.
Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinforma Oxf Engl. 2014;30:1236–40.
Guo A-Y, Chen X, Gao G, Zhang H, Zhu Q-H, Liu X-C, et al. PlantTFDB: a comprehensive plant transcription factor database. Nucleic Acids Res. 2008;36:D966–9.
Thimm O, Bläsing O, Gibon Y, Nagel A, Meyer S, Krüger P, et al. MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J Cell Mol Biol. 2004;37:914–39.

Download PDF

Journal Publication

published 12 May, 2020

Read the published version in Plant Methods →

Review #2 received at journal
10 Feb, 2020
Editorial decision: Major revision
10 Feb, 2020
Reviewer #2 agreed at journal
29 Jan, 2020
Review #1 received at journal
20 Jan, 2020
Reviewer #1 agreed at journal
06 Jan, 2020
Reviewers invited by journal
04 Jan, 2020
Editor assigned by journal
26 Dec, 2019
Editor invited by journal
25 Dec, 2019
Submission checks completed at journal
24 Dec, 2019
First submitted to journal
23 Dec, 2019

You are reading this older preprint version

Read the latest preprint version →

DiCoExpress: a workspace to process multifactorial RNAseq experiments from quality controls to co-expression analysis through differential analysis based on contrasts inside GLM models.

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Implementation

Input files and data quality controls

Differential expression analysis

Co-expression analysis

Enrichment analysis

Results and Discussion

Conclusions

Availability and Requirements

List of Abbreviations

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1