Genome resequencing and assembly
Reference genome files
The T. castaneum reference genome assembly (Tcas_3.0, NCBI accession number AAJJ01000000) was downloaded from NCBI. The following 23 contigs, which had been marked by NCBI as contaminants were removed: AAJJ01000455, AAJJ01001129, AAJJ01001336, AAJJ01001886, AAJJ01003084, AAJJ01003125, AAJJ01003874, AAJJ01004029, AAJJ01004493, AAJJ01004617, AAJJ01005150, AAJJ01005727, AAJJ01005755, AAJJ01006305, AAJJ01006331, AAJJ01007110, AAJJ01007612, AAJJ01007893, AAJJ01008452, AAJJ01009546, AAJJ01009593, AAJJ01009648, and AAJJ01009654. In addition, the first 411 nucleotides from AAJJ01009651, and the first 1,846 and last 46 nucleotides from AAJJ01005383 were removed after being identified as contaminants. The remaining 8,815 contigs (N50 = 43 Kb) had been used to construct the 481 scaffolds (N50 = 975 Kb) included in Tcas 3.0. Information from a genetic recombination map based on molecular markers [39], was used to anchor 176 scaffolds in 10 superscaffolds (often referred to as pseudomolecules or chromosome builds). In Tcas 3.0 these are referred to as ChLGX and ChLG2–10, representing the linkage groups in the recombination map. The remaining 305 scaffolds and 1,839 contigs that did not contribute to the superscaffolds were grouped together in Beetlebase (http://beetlebase.org or ftp://ftp.bioinformatics.ksu.edu/pub/BeetleBase/3.0/Tcas_3.0_BeetleBase3.0.agp) (unknown placement).
Description of Illumina libraries
The DNA used to construct three long-insert jumping libraries (3, 8, and 20 kb target size) was isolated at the Baylor Human Genome Sequencing Center in 2004 for Sanger-based sequencing. Thus, the source of DNA for these data is the same as for the original reference genome. The insert sizes for the three libraries are 3,173 bp, 6,775 bp, and 34,825 bp, respectively, with 10–15% standard deviation. Library construction, Illumina sequencing and cleaning were performed by MWGOperon (Europe). For all libraries, reads of minimum length 30 bp and maximum 100 bp were retained after cleaning and removal of the internal spacer. The “_1” files contain the forward reads while the “_2” files contain the reverse reads. Reads lacking the spacer or containing insert sequence only on one side of the spacer were not used. Table 4 lists the number of reads and their length for the jumping libraries.
Scaffolds linked with Atlas-Link v0.01
Atlas-Link is a software tool that links and orients scaffolds using mate pair libraries (www.hgsc.bcm.edu/software/atlas-link). Scaffolds were indexed using the IS algorithm in BWA (bwa index -a is tcas3.contigs.fa), the SA coordinates of the input reads were determined (E.g., bwa aln -t 4 tcas3.contigs.fa 3kb_1.fastq > 3kb_1.fastq.aln.), and alignments generated given the single-end reads, respectively (E.g., bwa samse tcas3.contigs.fa 3kb_1.fastq.aln 3kb_1.fastq > 3kb_1.fastq.sam). The mapped reads were merged (We used merge_pair_in_sam.pl from ATLAS GapFill v2.2, downloaded from www.hgsc.bcm.edu/software/atlas-gapfill. E.g., merge_pair_in_sam.pl 3kb_1.fastq.sam 3kb_2.fastq.sam > 3kb.merged.sam) and Atlas-Link was run on each long insert jumping library (Ran first ATLAS_link/generate_input_for_ATLAS-link.pl -l sam_file_list -f tcas3.contigs.fa to generate the input files for Atlas-Link. Then, ran ATLAS_link/do-Atlas-Link.pl -l sam_file_list.libfile -t sam_file_list.vScaf.contig -f atlas.link.configure.file -a tcas.agp -o Output), with the settings described in Supplementary file 2. Table 5 shows the improvements that where achieved by Atlas-Link. Scaffold order and placement within Chromosome LG builds was used to validate the Atlas -Link output. We found that using a low value (3–10) for “minimum number of links to merge scaffolds” tended to scramble the order and orientation of scaffold that had been placed on Chromosome LG builds in the original assembly Tcas_3.0. Using a value of 300 minimum links reproduced most of the original order, linking neighboring scaffolds and including scaffolds that were unplaced in Tcas_3.0. In addition, we also updated the output AGP file, because Atlas-Link modified the contig_start and contig_stop coordinates making them all start from one. This is not an issue for the contigs that actually start from one, but for the ones that do not, i.e., the ones which NCBI has with x and x+y contig_start and contig_stop coordinates, Atlas-Link changed their coordinates to 1 and 1+y, thus indicating a different fragment of the contig to be kept. We fixed these to reflect the NCBI coordinates.
Contigs extended and gaps closed with GapFiller v1.10
GapFiller was run with the following parameters (perl GapFiller_v1–10_linux-x86_64/GapFiller.pl -l libraries.txt -s scaffolds.fasta -m 29 -o 10 -r 0.7 -n 10 -d 100 -t 0 -T 10 -i 10 -b standard_out. The results of the GapFiller run are listed in Table 6, the contents of libraries.txt are listed and explained in Supplementary file 2): minimum number of overlapping bases with the edge of the gap 29, minimum number of reads needed to call a base during an extension 10, percentage of reads that should have a single nucleotide extension in order to close a gap in a scaffold 0.7, minimum overlap required between contigs to merge adjacent sequences in a scaffold 10, maximum difference between the gapsize and the number of gapclosed nucleotides (extension is stopped if it matches this parameter + gap size) 100, number of reads to trim off the start and begin of the sequence 0, number of threads to run 10, and number of iterations to fill the gaps 20.
Note that the output from Atlas-Link is an AGP file, while GapFiller uses a batch FASTA file with the scaffolds as input. To generate the input file for GapFiller we used the AGP file generated by Atlas-Link and the batch FASTA file with the contigs and generated the scaffolds using the make_scaffolds_fa.pl (make_scaffolds_fa.pl tcas3.contigs.fasta atlas_link_with_fixed_contig_coords.agp scaffolds.fasta) script listed in Supplementary file 2. In the process, we merged the overlapping, and the adjacent contigs with no gaps in between them. We split the scaffolds into contigs with minimum gap length of 10, using fasta_to_agp.pl listed in Supplementary file 2. An intermediate assembly (Tcas_4.0) reflected these changes but lacked the BioNano Genomics mapping (see below).
Scaffolds joined using BioNano Genomics consensus maps
Annotation
The reannotation of the protein-coding genes of Tribolium castaneum was done in three main steps: 1) automatic gene prediction based on an unpublished intermediate assembly 4.0 with AUGUSTUS [42] incorporating evidence from multiple sources, 2) merging the gene prediction with the previous official gene set OGS2 [37] and 3) a mapping of the new gene set to assembly 5.2 using liftover [60]. Additionally, manual curation and correction was done for 399 genes.
Protein-coding Genes
AUGUSTUS is a gene prediction tool based on a hidden Markov model that allows to incorporate extrinsic evidence such as from RNA-Seq or protein homology. Such extrinsic evidence is summarized in the form of so-called ‘hints’ that are input to AUGUSTUS and that represent mostly soft evidence on the location of exons, introns and other gene features.
RNA-Seq libraries of around 6.66 billion reads from the iBeetle consortium and 9 external contributors constitute the majority of evidence. All reads were aligned against the repeat masked genome assembly 4.0 with GSNAP [61]. Hits were filtered according to three criteria. A hit must reach a minimum identity threshold of 92%. Furthermore, a paired read filter was applied: Reads that are paired must not exceed a genomic distance of 200 Kbp and must be correctly oriented towards each other. Subsequently, reads that could not be unambiguously aligned to a single locus (the identities of the two highest-scoring alignments were within 4% of each other) were discarded in order to avoid false positives such as from pseudogenes.
It is often hard to correctly align spliced reads, especially when they are spliced near the beginning or end of the read. Therefore, an iterative mapping approach was applied. First a set of preliminary introns was generated by using the spliced alignments found by GSNAP and by predicting introns ab initio with AUGUSTUS. Removing sequences of these introns produced partial spliced transcripts to which all reads were aligned a second time. We obtained an improved spliced alignment set with additional spliced alignments via a coordinate change induced by the coordinates of the preliminary introns (http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n = IncorporatingRNAseq.GSNAP). From the gaps in the read alignments hints on the location of introns were compiled, including the number of reads that support each intron. Further, from the RNA-Seq genome coverage hints on the location of (parts of) exons were generated.
In addition, evidence from 64,571 expressed sequence tags (ESTs), 19,284 proteins of invertebrates (from uniprot/swissprot database), repetitive regions in the genome detected by RepeatMasker (Smit, AFA, Hubley, R & Green, P. RepeatMasker Open–4.0.2013–2015, http://www.repeatmasker.org), 387 published coding genes from NCBI, 69 odorant binding Proteins [62] and 60 “gold standard” sequences, that were manually produced from cDNA by different groups within the Tribolium community. The RNA-Seq reads are available at public databases in the Bioproject PRJNA275195.
Integration of the previous gene set
By and large the AUGUSTUS gene set is more accurate. However, unclear loci remain, in which the true annotation is yet unknown. In order to introduce some stability in the gene set update we kept the old genes when in doubt whether a newly predicted gene with another structure is indeed a correction of the old gene structure. We address the problem of finding such gene structures by introducing the concept of specifically supported genes. Consider a gene gOGS2 from the previous gene set and a set of overlapping genes GAUG from the AUGUSTUS prediction. gOGS2 is said to be specifically supported, if it has at least one intron supported by RNA-Seq, that none of the genes in GAUG have. Additionally, every supported intron of genes in GAUG is also in gOGS2. In OGS3 we kept all specifically supported OGS2 genes and discarded all AUGUSTUS genes overlapping them.
The set of supported intron candidates was compiled from spliced RNA-Seq reads with a number of restrictions. Each intron candidate had to have a length between 32 and 350,000 bp, all splice sites had to be biologically feasible and the number of hints supporting a contradicting gene structure had to be at most 9 times higher than the number of hints supporting the intron candidate itself.
Additionally, we kept an OGS2 gene that did not overlap any AUGUSTUS gene, if it had homologs in Drosophila or other invertebrates or an annotated function (GO term listed in the Gene Onthology database [63]) or was covered by RNA-Seq reads with FPKM ≥ 0.01 (calculated with eXpress [64]). In total we kept 3,087 OGS2 genes and 13,413 AUGUSTUS genes.
Liftover from assembly 4.0 to assembly 5.2
After a Tribolium community call many genes were manually reviewed and edited based on an intermediate assembly 4.0. To preserve manually curated gene structures, we decided to transfer the new gene set to assembly 5.2. We created an assembly map that assigns each base of assembly 4.0 to a base in the new assembly 5.2, if possible. This map file was used to ‘lift’ above gene set to the updated assembly 5.2 using liftOver taken from the UCSC Genome Toolbox (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64.v287/). 337 genes could not be unambiguously and completely mapped. We applied our annotation pipeline to the new assembly and merged the result with the lifted gene set from the previous assembly. Consequently, we were able to identify gene structures for which the improved assembly allowed a better annotation. The new gene set was complemented by 469 gene structures, that could only be predicted based on the new assembly. Furthermore, we corrected 745 of the lifted gene structures according to the concept of specific supported genes as described above.
The standard Viterbi algorithm used in AUGUSTUS predicted 159 transcripts with an in-frame stop codon spliced by an intron. To replace them with alternative gene structures that do not contain in-frame stop codons we ran AUGUSTUS with the option –mea = 1 on the affected regions. MEA is an alternative algorithm that can prohibit spliced in-frame stop codons but needs more computational time.
During the GenBank submission process some gene models were revised and a small number of genes had to be manually edited or deleted.
Orthology assignment and proteome analyses
Orthologs and paralogs between T. castaneum and D. melanogaster were found using the OrthoDB database [65] and results were formatted accordingly using custom Perl scripts.
For the phylogenetic analysis, we compared T. castaneum (Insecta:Coleoptera) with three other invertebrates; Drosophila melanogaster (Insecta:Diptera), Caenorhabditis elegans (Nematoda) and Capitella teleta (Annelida). The mammalian Mus musculus was used as outgroup. More specifically, we used OrthoDB and obtained 1,263 single-copy orthologs, in order to perform a phylogenomics analysis with RAxML [66]. Briefly, a multiple sequence alignment was built for each orthologous group separately, using MUSCLE [67]. Then, the resulting alignments were trimmed using trimAl [68] with parameters “-w 3 -gt 0.95 -st 0.01” and concatenated using custom Perl scripts. The concatenated alignment was subsequently used to perform a phylogenomic analysis using RAxML 7.6.6 (PROTGAMMAJTT model of amino acid substitutions) with 100 bootstrap replicates. The final tree was edited with EvolView [69] and InkScape 0.91.
The same set of genes was analyzed separately in an alignment independent approach (see Supplementary file 2 for details). Two approaches were performed using six distance measures (d1, …, d6): In the first approach, we used ‘gdist’ to determine the pairwise distances between sequences inside the groups, then ‘phylip neighbor’ to compute corresponding phylogenetic trees, rooted by setting MMUSC as outgroup, and computing the consensus tree using ‘phylip consense’. In the second approach, we concatenated sequences in the groups in random order to form five artificial “whole proteom” sequences (one for each of the species), determined their pairwise distances and computed a phylogenetic tree using ‘phylip neighbor’, again setting the MMUSC sequence as outgroup. To check for robustness of the approach and also the influence of sequence lengths we performed these experiments with different subsets: (1) with all 1263 groups and (2) with a subset of the all groups. The subsets we considered were: (2a) groups with a certain minimum sequence length, (2b) only groups whose sequence lengths differed by at most a certain percentage, and (2c - only for experiment (B)) a random selection of groups (for instance, randomly select 80% of all groups for concatenation). Concatenation experiment (B) produced phylogenies that turned out to be almost immune against changes in order of concatenation and considerably robust against restricting consideration to all groups or subsets of groups concatenation. Best signals where obtained by distance d6, which resulted in the phylogeny displayed in Fig. 1B.