Transcriptome assembly from RNA-Seq data
We generated 167.8 million paired reads (86.2 million reads from poly-A enriched and 81.6 million from rRNA depleted total RNA samples) by Illumina paired-end sequencing of RNA prepared from the BmN4 cell line of Bombyx mori. To prove our concept of being able to produce a proper annotation without a genome and due to the unclear genetic background of the cell line, we applied a genome-free de-novo approach using the Trinity suite [12] (Fig 1A). After quality filtering, adapter trimming and erroneous k-mer removal almost 165 M paired reads and 158,589,380 bases were assembled into 186,401 ‘Trinity transcripts’ constituting 120,287 distinct ‘Trinity genes’. The assembled raw transcriptome represents 98.32% of the input reads, which shows that the assembly is highly representative (Additional file 1: Table S1). The traditional N50 statistics describe the minimal transcript length of transcripts that are assembled from at least 50% of the reads. We found the N50 length for our assembly to be 1553 bases. A better representation excludes lowly expressed transcripts as they might exhibit bigger biases. Hence, we investigated the N50 values across different expression level bins (ExN50) (Additional file 1: Fig S1). We found that the ExN50 peaks between the 80 and 90 expression percentiles. Thus a better representation is the E90N50 statistic, which represents the minimal transcript length of transcripts in the 90th expression percentile that are assembled from at least 50% of the reads mapping to these transcripts. The E90N50 transcript contig length is 2270 bases (Additional file 1: Table S2). We used TransRate [13] to validate the quality of the raw assembly (Additional file 1: Table S3). TransRate assesses the accuracy and completeness of a transcriptome assembly using only the input reads. It proceeds by mapping the reads to the assembled contigs, analyzing the alignments, calculating metrics for each individual contig, integrating these contig-level metrics to provide a contig score, and then combining the accuracy of the assembly with the score of each contig to produce an overall assembly score. The crude overall and optimal Transrate assembly score is 0.31 and 0.41, respectively, of which both are in the 70th percentile range of Transrate assembly scores of 155 published de-novo assembled transcriptomes [13] (Additional file 1: Fig S2). The expression-level-weighted assembly score, which weights each constituent contig score by the relative abundance level of each contig raises to 0.54 validating the high quality of the assembly and indicating that most of the low TransRate scores stem from contigs that are of relatively low abundance.
To extract all potential protein-coding transcripts from the assembled contigs, we applied TransDecoder [14] and kept transcripts that comprise an open reading frame of at least 20 amino acids. This filtering resulted in a list of 317,031 potential protein-coding open reading frames based on 95,817 individual genes. These potential protein-coding sequences are the basis of our further analysis. Using BUSCO [15], we detected that our assembled protein coding transcripts cover 94.8% of the arthropod BUSCO gene set (Fig 1B and Additional file 1: Table S4). Currently, three main initiatives have provided Bombyx mori genome assembly and annotation. For precision estimation, we compared our data to the currently available annotations of the different Bombyx mori varieties from UniProt UP000005204 with 14,776 gene models [16], NCBI Annotation Release 101 with 14,998 gene models, SilkBase 2017 with 16,880 gene models [4] and SilkDB 3.0 with 16,069 gene models [17]. In general, correspondence between current annotations and de-novo predicted proteins is high, with the majority of transcripts sharing protein full sequence coverage (90-100%) to the respective UniProt, NCBI, SilkBase and SilkDB 3.0 proteins (Fig 1C). When analyzing the genetic sequence variation between the BmN4 cell line and the transcripts from the NCBI and SilkBase annotation by mapping the RNA-Seq data to the respective CDS sequences of predicted gene models, we found on average exchange of 1 in 129 bases for NCBI annotations (i.e. 126,300 changes in 16,393,027 bases) and 1 in 105 bases for SilkBase annotations (i.e. 187,534 changes in 19,826,985 bases). The results of both comparisons are highly consistent. Approximately 75% of the detected changes are silent (synonymous) mutations while around 25% have missense (non-synonymous) and 0.4% nonsense effects (Additional file 1: Table S5) . These results unveil an unexpected quite large variation between Bombyx mori strains and emphasize the importance of applying a genome-free approach to provide exact CDS sequences especially for molecular biology applications.
Functional annotation of TransDecoder predicted protein sequences was performed using Trinotate [18] including blastp searches against all model species Swissprot databases, HMMER to identify protein domains, signalP to predict signal peptides, tmHMM to predict transmembrane regions and RNAMMER to identify rRNA transcripts. Furthermore, we used deeploc to predict protein localizations from the respective protein sequences. All functional annotations are included in the annotation file (Additional file 2: Table S10).
High resolution mass spectrometry data provides peptide evidence for protein coding transcripts
In order to provide evidence for the protein coding capability of our predicted protein coding open reading frames (ORF), we performed mass spectrometry measurements of protein extracts from the BmN4 cell line. Using LOPIT-DC [11] as a strategy to fractionate our sample, we aimed to increase protein detection depth, while also gaining cellular localization information for the detected protein sequences. Our full Bombyx mori proteome data set with 10 fractions contained 3,685,257 MS/MS spectra. Using the predicted open reading frames of the Trinity transcriptome assembly as search space for the mass spectrometry data, we noted a higher rate of identified MS/MS spectra compared to using the three currently available protein annotations from UniProt, NCBI, SilkBase and SilkDB 3.0 (two-sided paired Wilcoxon signed rank test, p=3.7*10-8 ,p=4*10-8, p=3.7*10-8 and p=3.71*10-8, respectively (Fig 2A). Applying stringent filtering criteria to have at least 2 identified peptides (at least one of them being unique), we identified a total of 6,273 protein groups (fasta files of the CDS and protein sequences of these proteins are provided Material S1 and S2, respectively). This was 16%, 18%, 14% and 24% higher than for the currently available UniProt (5,254 identified protein groups), NCBI (5,125 identified protein groups), SilkBase (5,396 identified protein groups) and SilkDB 3.0 annotations (4794 identified protein groups) (Fig 2B), emphasizing the power of strain specific gene sequences to increase proteome coverage and also validating the high quality of our genome-free de-novo transcriptome assembly. In order to investigate whether peptide identification could be hindered by provided protein annotations that include strain specific differences in protein sequences, we extracted peptide identification for relevant proteins from both Trinity and the most similar SilkBase annotations and chose as representatives those pairs that have an overlap of more than 80% in sequence but are not 100% identical in their protein sequence. We extracted all missense mutations that were identified for these annotation pairs and calculated for the respective locations the proportion of peptides that were not identified in the SilkBase search, although they could be identified in our de-novo Trinity annotation. We found that 88% of peptides assigned to predicted missense mutations (2325 out of 2653 peptides in 1988 protein groups affected) indeed hamper peptide identification when using the SilkBase annotation as search base.
Detected protein groups show improved quality statistics when compared to the raw in silico predicted potential protein coding transcripts, e.g. better overall correspondence with current UniProt, NCBI, SilkBase and SilkDB 3.0 annotations (Fig 2C, Additional file 1: Fig S3A), higher assembly scores (Fig 2D) and longer transcript lengths (Additional file 1: Fig S3B). We further observed that assembled contigs with high TransRate scores are indeed enriched with identification by mass spectrometry emphasizing the validity of the scoring approach used by TransRate, which is based on the raw read alignment features only (Additional file 1: Fig S3C).
Although the correspondence between our annotation and the annotation provided by NCBI and SilkBase is overall high (Fig 2C), there are still almost 20% of predicted coding sequences that correspond with less than 80% hit percentage coverage to the current annotations. We noted that some of the current SilkBase annotated transcripts are split into several (mostly two) separated genes in our annotation. This observation can have two main explanations: either the current gene annotations are interdependent (SilkBase includes NCBI annotations in the prediction process) and thus an erroneous annotation from earlier predictions could have been transferred to the newest SilkBase annotation, or, the separation of the non-corresponding annotations in our genome-free approach is wrong. To decide between these two possibilities, we checked the RNA-Seq read coverage across the gap between two separated proteins in our annotations that were suggested by SilkBase to be a single protein (941 pairs in total corresponding to 631 SilkBase genes). For this we investigated the RNA-Seq reads coverage for each of the relevant pairs of our Trinity predicted proteins and the respective genomic gap between these using the SilkBase genome assembly. Overall we observe that 76% of the Trinity annotated splits (629 out of 826 protein pairs) are well supported by clear gaps in the RNA-Seq raw data alignment at the split site (Fig S4 A and B). Only 5.5% of all Trinity predicted proteins (348 unique proteins in 195 protein pairs out of 6,273 detected proteins) do not show an evident gap in read coverage and hence likely have been falsely split in the annotation process. Our set of identified ORFs also includes 188 predictions that have a less than 85% hit coverage with a SilkBase annotation entry. The length differences are shown in Additional file 1: Fig S5A. In order to investigate if the shorter ORFs are supported by read mapping data, we calculated the difference between the read mapping frequency in the ORF region with the coverage at the edges of the transcripts. 124 (66% of the short ORFs) identified proteins show an evident absence of reads at the edges of the transcripts, while the remaining 64 might have been falsely split as we could observe mapped reads adjacent to the edges (Additional file 1: Fig S4C). Based on these observations, we conclude that our method has a precision rate of at least 93.4% (5,861 out of 6,273) for assigning individual genes. The respective categorizations into “high correspondence”, “evidently split”, “probable false split”, “evidently shorter than SilkBase” and “probably falsely shorter than SilkBase” have been included in the annotation table for clarity (Additional file 2: Table S10). Furthermore, we found mass spectrometry evidence for 164 predicted proteins that have been missed in any of the current annotations (marked with “newly annotated” in the annotation Additional file 2: Table S10, peptide evidence information for all novel proteins are provided in Additional file 4: Table S12). Another group of genes (513 genes (8% of all ORFs detected by MS); marked with “longer than SilkBase”) are longer than annotated in the current SilkBase annotation, however differences are mostly neglectable (Additional file 1: Fig S5B). We provide a website (http://butterlab.org/bombyxviewer) which incorporates data regarding all ORFs with mass spectrometric evidence including transcript and protein sequences and a genome viewer based on the SilkBase genome showing gene structure and individual RNA-Seq read mapping. Peptide evidence information and annotated MS-MS spectra of newly identified proteins are therewith also provided for download.
Subcellular localization of proteins determined by LOPIT-DC
To assign the Bombyx mori proteins to sub-cellular compartments, we performed label-free quantitative spatial proteomics adapting the recently released LOPIT-DC protocol [11]. Using differential centrifugation steps, we generated 10 subcellular fractions in independent quadruplicates. Fractionation replicates correlate very well (average Pearson correlation coefficient >0.97) (Fig 3A) and cluster together in the first two PCA dimensions (Fig 3B). Within the gradient, fraction series [1-3], [4-7] and [8-9] are similar to each other, while fraction 10 is most different from all others. This is consistent with fraction 10 constituting an acetone precipitation of all proteins that were not separated in the previous fractionation steps. Analyzing significant changes in pairwise comparisons of all fractions, efficient separation can be recapitulated, i.e. an increasing diversity of proteins can be observed throughout subsequent fractionation steps (Fig 3C). LFQ data and differential expression statistics across fractionation samples can be found in Additional file 3: Table S11.
We applied an unsupervised clustering approach to detect unique fractionation profiles. Self-organizing map (SOM) clustering of the normalized protein intensities of all proteins showing significant changes in at least one of the pairwise fractionation comparisons and appropriate clustering assessment and filtering (see Methods and Additional file 1: Fig S6) revealed 8 main clusters encompassing 3,942 protein groups as depicted in Fig 4A. The mean expression profiles across fractionation of the different clusters are depicted in Fig 4B (and in Additional file 1: Fig S7 for individual genes). In order to characterize the different clusters in terms of their potential subcellular localization or function, enrichment of specific categories as determined by signalP (prediction of signal peptides), tmHMM (prediction of transmembrane regions) and the Gene Ontology cellular-component (GO_cc) annotation were calculated (Fig 4C, Additional file 1: Table S7). Clusters 1-4 represent proteins that show relatively high intensities in the early fractionation steps with diminished intensities in later steps. Generally, membrane associated proteins (framed with black box in Fig 4C) are highly enriched in clusters 2 and 3. To get a more specific insight into the subcellular localization, we subsequently checked for enrichment of the GO_cc terms ‘lysosome’, ‘peroxisome’, ‘golgi apparatus’, ‘nucleus’, ‘chromatin’, ‘endoplasmic reticulum’ (ER), ‘mitochondrion’ and ‘ribosomes’, which were inferred by orthology to well-annotated model organisms in each cluster (framed with red box in Fig 4C). The most prominent enrichment was observed for cluster 1, which exhibits high levels in the first three fractions and low levels in later fractions. This cluster is highly enriched with mitochondrial genes (Fisher’s exact test, P = 10-197, fold-enrichment = 35.56). Cluster 2 shows reduced intensity after fractionation step 4 and is enriched with peroxisome, ER and lysosome annotated proteins (Fisher’s exact test, P = 10-16 , P = 10-6 and 10-5, fold-enrichment = 24.14, 3.2 and 4, respectively). Cluster 3 has lower intensity starting at fractionation step 5 and represents mainly endosplasmatic reticulum (ER) proteins (Fisher’s exact test, P = 10-7, fold-enrichment = 3.37). The profile of cluster 4 shows reduction of protein intensities after fractionation step 8 and contains proteins from Golgi apparatus (Fisher’s exact test, P = 10-5, fold-enrichment = 2.9). The second highest enrichment could be observed for cluster 5, where measured protein intensities peak in fraction 7-9 and which is highly enriched with ribosomal 40s and 60s proteins (Fisher’s exact test, P = 10-25, fold-enrichment = 11.76). Cluster 6 encompasses proteins with low abundance in the initial fractionation steps, which increase until step 9 and decline to minimal levels in fraction 10. This cluster exhibits a mixed enrichment profile of nuclear and chromatin associated proteins, but also with ribosomal proteins (Fisher’s exact test, P = 10-14, P = 10-3 and P = 10-13, fold-enrichment = 2.62, 2.97 and 6.92, respectively). Cluster 7 is the only cluster, that show exclusively high enrichment with nucleus associated proteins (Fisher’s exact test, P = 10-19, fold-enrichment = 3.18). Cluster 8, which shows abrupt protein intensity increase in the last two fractions couldn't be associated with any of the known localizations and hence was not further analyzed. Overall the cellular localization profiles of the different clusters correspond very well (average Spearman’s correlation coefficient =0.88) with those established in the LOPIT-DC method paper [11] where the assignment was established using experimentally validated marker genes (Additional file 1: Fig S8). Further, orthologous proteins of established Drosophila melanogaster cellular compartment marker genes show highly coherent profiles to the ones established by our unsupervised approach (Additional file 1: Fig S9). This shows that the fractionation strategy is robust and widely applicable. Comparing the fractionation profile of each protein to each of the clusters enables localization prediction also for proteins that could not be annotated properly by previous in silico analysis. The resulting clusters allow to assign proteins especially to the following 6 compartments (in descending order of certainty): Mitochondria (cluster 1), Ribosome (cluster 5), Nucleus (cluster 7), Peroxisome (cluster 2), Endoplasmic reticulum (cluster 3) and Golgi apparatus (cluster 4). For each individual MS detected protein, we calculated the correlations between its expression profile across fractionations and the predicted median profile of the above depicted localization. These correlation values are provided as confidence score for the localization probability of each protein. The information for each detected protein, including all relevant information such as transcripts type, length, score, annotations and localization categorization weights are provided in Additional file 2: Table S10).