Improving the Scaffolding of the Tcas Genome Assembly
The first published Tribolium genome sequence (NCBI Tcas3.0) was based on a Sanger 7x draft assembly [38] totaling 160 Mb, 90% of which was anchored to pseudomolecules or Linkage Groups (LGs) representing linkage groups in the molecular recombination map [39]. However, several large scaffolds (up to 1.17 Mb) were not included. To improve this draft assembly, we sequenced the paired ends of three large-insert jumping libraries (appr. 3,200 bp, 6,800 bp, and 34,800 bp inserts, respectively). These sequences were used to link scaffolds in the Sanger assembly and fill small gaps. Further, whole genome physical maps produced from images of ultra-long individual molecules of Tribolium DNA labeled at restriction sites (BioNano Genomics) were used to validate the assembly and merge scaffolds. The entire workflow and key steps are described below.
Using the long-insert jumping libraries, Atlas-Link (Baylor College of Medicine; www.hgsc.bcm.edu/software/atlas-link) joined neighboring anchored scaffolds and added several unplaced scaffolds, reducing the total number of scaffolds from 2,320 to 2,236. Of these, three were manually split because the joined scaffolds were known to be on different linkage groups based on the molecular genetic recombination map, leading to a total of 2,240 scaffolds. This analysis added formerly unplaced scaffolds to all LGs except LG4. In addition, 16 unplaced scaffolds were linked together.
We also took advantage of the new Illumina sequence information gained from the long insert jumping libraries to fill small gaps and extend contigs. GapFiller [40] added 77,556 nucleotides and closed 2,232 gaps. Specifically, the number of gaps of assigned length 50, which actually included gaps less than 50 nucleotides long or potentially overlapping contigs, was reduced by 65.6% (from 1,793 to 615).
Finally, BioNano Genomics consensus maps were used to validate and further improve the assembly (for details, see [41]). More than 81% of Tcas5.2 was directly validated by alignment with BioNano Genomics Consensus maps, the number of scaffolds was reduced by 4% to 2,148, and the N50 increased 3-fold to 4,753.0 kb. In total, the N50 was increased almost 5-fold where superscaffolding with BioNano Genomics optical maps improved the contiguity of the assembly the most. Table 1 shows the extent to which each step of the workflow impacted the quality of the genome assembly.
Re-Annotation of the Tribolium genome assembly
Re-annotation was performed using the gene finder AUGUSTUS [42]. For the current release, new data were available and incorporated as extrinsic evidence including RNA-Seq, ESTs (Expressed Sequence Tags) and protein sequences. The most impactful new information was the extensive RNA-Seq data (approximately 6.66 billion reads) covering different life stages and tissues. This allowed us to determine UTRs and alternative splice variants, which were not annotated in the previous official gene set. This increased both transcript coverage (Table 2) and the accuracy of the predicted gene features. The parameters of automated annotation were adjusted based on manual quality control of more than 500 annotations of previously published genes. The new gene set, OGS3, consists of 16,593 genes with a total of 18,536 transcripts. 15,258 (92%) genes have one isoform, 944 (5.7%) genes have two, 270 (1.6%) have three and 121 (0.7%) genes have more than three isoforms. During the re-annotation of the Tribolium gene set a basic parameter set for AUGUSTUS was developed and is now delivered with AUGUSTUS as parameter set “tribolium2012” (link for download: see Materials and Methods).
Major Changes in the OGS3
We compared the previous official gene set OGS2 [37], which was ‘lifted’ to the new assembly, Tcas5.2, with the new OGS3 and found that 9,294 genes have identical protein sequences, while 3,039 genes have almost identical protein sequences (95% minimum identity and 95% minimum coverage). 1,452 genes were completely new, meaning that they did not overlap any lifted OGS2 gene above the given thresholds. A similar number (1,420) of predicted genes from OGS2 do not exist anymore in OGS3. We further analyzed the “lost” and “new” genes and found that our procedure was efficient in removing false positive annotations and in detecting novel true genes. First, based on the lack of a BLAST hit in invertebrates (e-value cutoff: e-05), GO annotation or RNA-Seq coverage we assume that the “lost” OGS2 annotations had been falsely annotated. Second, when examining the newly found genes, we observe that 528 of 1,452 (36%) genes had significant BLAST hits in other insect species. Further, 690 of 997 (69.2%) of the new genes have at least one intron supported by RNA-Seq. New single exon genes have an average read coverage of about 550,000 reads per gene with minimum coverage of 11 reads per gene. The percentage of missing BUSCO genes was reduced from 0.7% to 0.4%. Together, these metrics indicate that real genes were newly annotated. Table 3 compares important characteristics between the previous and the current OGS.
We further examined gene structure changes (not including the identification of splice variants). For this, we counted both, gene join and split events that occurred in the new gene set. Joins are indicated when the CDS of an OGS3 gene overlaped the CDSs of two or more genes from the previous gene set on the same strand. In total, we observe 949 such join events. In 485 (51%) of these events, the new intron of an OGS3 gene was supported by spliced read alignments spanning the gap between two neighboring OGS2 genes, suggesting that the annotations had erroneously been split in the previous annotation. We detected gene split events by counting gene join events where an old OGS2 gene joined multiple OGS3 genes. We observed 424 such events. In 45 cases (10%) the joining OGS2 intron had RNA-Seq support. Taken together, while >50% of the joined genes were supported by sequencing data only 10% of the split events turned out to be likely false positives. This indicated that the parameter set was adequate to enrich for true annotations in the new gene set.
RNA-Seq support for the gene sets
Analysis of differential gene expression has become an essential tool in studying the genetic basis of biological processes. Such analyses profit from a better gene model where a higher number of reads can be mapped. To test whether the new gene set performed better in such analyses, we mapped our collection of RNA-Seq reads to both (Table 2). In this analysis 6.66 billion RNA-Seq reads from Tribolium where mapped against the two gene sets (transcriptome) OGS3 and, for comparison, OGS2 with the alignment tool BLAT [43]. Alignments with less than 90% identity were discarded and only the best alignment was kept for each read. About 70% of the reads mapped to OGS2 whereas 81% mapped to OGS3.
To evaluate the splice sites in the new gene set we compiled a set of splices suggested by gaps in RNA-Seq read alignments compared to the genomic sequence (intron candidates). These RNA-Seq read alignments where filtered by a range of criteria (see Methods). In total this set contained 65,274 intron candidates. We refer to the term multiplicity of an intron candidate as the number of reads that were found to cross a given exon-exon boundary at the identical position. Some candidate introns are likely not introns of coding genes, e.g. from alignment errors or from spliced noncoding genes. Overall, candidate introns had an average multiplicity of 7,898. 1,403 candidate introns had a multiplicity of one while 3,362 had a multiplicity smaller or equal to five. OGS3 contains about 30% more RNA-Seq supported introns than OGS2: 41,921 out of 54,909 introns in OGS2 (76.3%) and 54,513 out of 63,211 in OGS3 (86.2%) are identical to an intron suggested by RNA-Seq spliced read alignments (Table 3).
BUSCO analysis reveals very high accuracy of the gene set
The completeness of OGS3 was assessed using BUSCO (Benchmarking Universal Single-Copy Orthologs) and compared to the value for OGS2 [44] and to those of other sequenced genomes [45–47]. The genome of Drosophila melanogaster can be assumed to be the best annotated genome of insects, the genome of Apis mellifera was recently re-annotated and is therefore comparable to the OGS3 from Tribolium and for Parasteatoda tepidariorum, for Which the first genome version was just published with the peculiarity of large duplication events. Nearly all of the conserved genes from the BUSCO Arthropoda set where found in OGS2 and OGS3 (Table 4). OGS3 (99.6%) scored slightly better than OGS2 (99.3%). The completeness of OGS3 rivals that of Drosophila (99.8%) and is better than Apis (97.9%) or Parasteatoda (94.4%) (Table 4).
Official gene set and NCBI RefSeq genome
The genome assembly as well as the gene models have been submitted to Genbank (NCBI) as the RefSeq genome (GCF_000002335.3) and Tribolium (OGS3) (GCA_000002335.3) [48]. Genome assembly 5.2 and gene set OGS3 are available on the NCBI website (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/335/GCF_000002335.3_Tcas5.2) and are available as a preselection in several NCBI services, such as the BLAST search.
Protein sequence conservation
Drosophila melanogaster and Caenorhabditis elegans are the main invertebrate models for functional genetics and have contributed tremendously to the understanding of cellular and molecular processes relevant for vertebrate biology. However, their protein sequences are quite diverged compared to Apis mellifera or the annelid Platynereis dumerilii [49]. The transferability of findings to other taxa may depend, among other things, on the biochemical conservation of the proteins involved. Hence, when choosing a model system, the conservation of the proteome is an important aspect. In Tribolium, the genetic toolkit is more developed compared to other insects (except for Drosophila) or annelids. Unbiased genome-wide screening has been established making Tribolium an excellent alternative model for studying basic biological processes. We therefore asked how the protein sequences of the red flour beetle compare to other invertebrate model systems. As outgroup we used the main vertebrate model organism for medical research, the mouse Mus musculus.
We identified 1,263 single-copy orthologs across five species, made an alignment and calculated a phylogenetic tree (Fig. 1A). The Tribolium branch is shorter compared to those of Drosophila and C. elegans indicating that the Tribolium proteome is more similar to that of the mouse than are the proteomes of Drosophila and Caenorhabditis. In this comparison the annelid proteome appears to be even more similar to that of the mouse proteome. In such alignment-based sequence comparisons, the less conserved non-aligneable parts of the proteins are not considered. Therefore, we used an alignment-free method for measuring sequence distances [50, 51] on the same dataset and found it to basically reflect the same conclusion albeit with less resolution (Fig. 1B).
Prediction of microRNA binding sites
MicroRNAs are short non-coding RNAs that regulate gene expression by guiding the RNA-induced silencing complex (RISC) to complementary sites in the 3’UTR regions of target mRNAs (reviewed in [52]). The principal interaction between microRNAs and their targets occurs through the so-called “seed” region, corresponding to the 2nd and 8th position of the mature microRNA sequence [53], and this complementarity can be used for computational predictions of microRNA-target pairs. Previous studies experimentally identified 347 microRNA genes in the Tribolium castaneum genome, each of which can generate two mature microRNAs derived from the two arms (5p and 3p) of the microRNA precursor hairpin (Supplementary Table 1)[54, 55] . We extracted the 3’UTR sequences of Tribolium protein-coding genes and annotated potential microRNA binding sites in these regions using an algorithm based on the microRNA target recognition principles described in [53]. In addition, we generated an alternative set of computational microRNA target predictions using an algorithm based on the thermodynamic properties of microRNA-mRNA duplexes irrespective of seed complementarity [56]. The two algorithms identified 309,675 and 340,393 unique putative microRNA-target pairs, with approximately 60% overlap. Moreover, a similar number of genes in each set, 13,136 and 13,057 respectively, had at least one microRNA target site.
Comparison of microRNA target gene sets
MicroRNAs are recognized as important players in animal development, and their role in insects is best understood in the classical model organism Drosophila melanogaster. Comparative genomic analyses showed that 83 Tribolium castaneum microRNAs have one or more homologs in Drosophila [54, 55]. To assess whether conserved microRNAs also have a conserved target repertoire, we sought to assess the number of orthologous genes targeted by each conserved microRNA pair. To this end, we used an identical target prediction approach to determine microRNA-target pairs in Drosophila melanogaster, and calculated the numbers of homologous and non-homologous targets for each conserved microRNA pair in the two species (Supplementary Table 1). Results indicated that even though the majority of homologous microRNAs have conserved seed sequences for at least one mature product, their target repertoires diverged.
Nonetheless, a subset of well-conserved microRNAs had higher numbers of common predicted targets than expected by chance, especially based on seed complementarity. These included members of the bantam, mir-184, 279/miR-996, mir-2/11/13/2944/6, mir-9, mir-14, mir-1, mir-7, mir-34 seed families, which have been previously identified for their roles in key developmental processes in Drosophila, and are highly expressed in both fruit fly and beetle embryos.
Given the large number of target predictions identified for individual microRNAs we examined the specific conserved targets for one of the microRNAs that both exhibited significant target conservation and had well characterized targets in Drosophila. The miR-279/miR-996 family has been extensively characterized for its role in regulating the emergence of CO2 sensing neurons and in circadian rhythms. in Tribolium, of the nine characterized targets identified in Drosophila, one had no clear ortholog (upd), four did not have conserved targeted sequences in their UTRs (STAT, Rho1, boss, and gcm), but four targets (nerfin-1, esg, ru, and neur) had strongly conserved predicted target sites. microRNA regulation of all these four targets has clear functional importance in these developmental processes and two of them (nerfin-1 and esg) work together as key players in the formation of CO2 sensing neurons [57].
In summary, we provide an example where conserved microRNA regulate similar developmental pathways between the two taxa. It will be interesting to determine the degree of conservation of the entire microRNA set. The predicted microRNA binding sites are now available as tracks in the genome browser at iBeetle-Base (https://ibeetle-base.uni-goettingen.de/gb2/gbrowse/tribolium/)