Overview of the TRITEX workflow
TRITEX is a computational pipeline for plant genome sequence assembly pipeline. It uses an input sequence assembly, Hi-C data, and a guide map to construct pseudomolecules, i.e. in silico representatives of entire chromosomes (Fig. 1). Contigs and Hi-C are well-known datatypes in contemporary sequence analysis. The concept of the guide map is more bespoke. It is a set of “markers” – sequence tags of variable length that are arranged in a linear order along the chromosomes. In contrast to other workflows such as 3D-DNA (25), TRITEX does not use Hi-C data to partition sequences into chromosomes. Instead, guide maps lift an existing solution to the problem of assigning sequences to chromosomes to a new assembly. In using guide maps, we borrow conceptually from reference-guided assembly methods such as RaGOO and RagTag (12, 13). TRITEX also uses the guide map to constrain the Hi-C map while allowing minor perturbations relative to the reference to accommodate structural variation. In the original TRITEX approach, the guide map was a dense genome-wide genetic map. We have since extended TRITEX to work with guide maps derived from a reference genome.
TRITEX operates on sequence assemblies, the constituent parts of which we refer to as “scaffolds”. The initial short-read TRITEX included a complex, multi-step toolchain to assemble primary sequence contigs from short reads and scaffold them with mate-pairs and linked reads. This procedure has since been rendered obsolete by the rise of accurate long-read assembly, which is easier, faster and more powerful than short-read assembly (2). The assembly of HiFi reads is a single-step process that can be accomplished by doing no more than running a single command. The resultant sequences are in most cases contigs, i.e. contiguous sequences without gaps, and often span tens of megabases. For these reasons, primary sequence assembly is not a focus of TRITEX anymore, which has in the era of long-read assemblies become a tool for chromosome-scale scaffolding and is agnostic about which programs are used to assemble contigs as long as the latter are contiguous and complete enough to scaffold them with Hi-C.
The TRITEX workflow can be broadly divided into two stages. In the first, the user runs shell scripts combining standard bioinformatics tools such as the read-mapper minimap2 (26) and the alignment record processors SAMtools (27) and BEDTools (28) into a pipeline for processing Hi-C reads and aligning guide map markers. The second phase is done interactively at the prompt of the R statistical environment (29). The outputs of phase 1 are read into main memory and a TRITEX assembly object with tables listing Hi-C links and guide map alignment records is initiated. The core algorithm for Hi-C map construction searches for a minimum spanning tree in the graph induced by Hi-C contact matrix and further refines it to include as many scaffolds as possible and to orient them relative to the chromosomal orientation of the guide map. The algorithm has been described in detail by Beier et al. (30) and has been unaffected by the changes brought about by much improved input assemblies and denser guide maps.
Once the assembly R object has been set up, Hi-C-based pseudomolecule construction can be proceeded with immediately. However, it is advisable before doing so to scrutinize the assembly for any obviously chimeric scaffolds, which join together sequences that are far apart in the actual genome. TRITEX provides static and interactive visualizations to help users display chimeric scaffolds produced in the input long-read assembly (with HiFi reads) and spot misplaced inverted scaffolds in the contact matrices. Static “diagnostic plots” show Hi-C coverage along the lengths of scaffolds as well as the collinearity of scaffolds to the guide map. Breakpoints in chimeras, i.e. the boundaries between falsely joined sequences, often co-colocalize with drops in physical Hi-C coverage (the number of Hi-C links spanning a genomic window). Diagnostic plots also show guide map alignments. If sequences from different chromosomes are joined, the chimeric scaffolds bear guide map markers from more than one chromosome (Fig. S1, Additional File 1). Intra-chromosomal chimeras also have markers from spatially separated regions, but this pattern is shared with true structural variations.
To come up with a list of putatively chimeric scaffolds, we use a simple heuristic: we look for scaffolds where Hi-C coverage falls by at least eight-fold below the scaffold-wide average in internal regions (less than 100 kb away from the scaffold ends). Troughs in Hi-C coverage tend to be shallower the closer two mis-joined sequences are to each other. Wrongly adjoining sequences that are not that far apart in reality (separated by less than 2% of the length of the chromosome) may result in only small disturbances of Hi-C coverage that may be spotted only by comparison to a high-density guide map or as off-diagonal signals in the contact matrix (see below).
To break chimeric scaffolds, users have to manually specify the coordinates of breakpoints to the TRITEX functions that update the assembly object so that Hi-C links and guide map alignment refer to corrected coordinates in the newly broken scaffolds. Often not all chimeras can be spotted from the get-go in diagnostic plots. Rather, cycles for chimera breaking, Hi-C map construction, and visual inspection of contact matrices may need to be repeated several times. Once a Hi-C map has been completed, the contact matrix, i.e. a heatmap showing the number of Hi-C links between genomic windows of fixed size (1 Mb by default), misplaced or misoriented scaffolds become manifest as off-diagonal signals (Fig. S2, Additional File 1). To help the inspection of contact matrices, we developed the Hi-C map inspector, an R Shiny App that is accessible through a web browser (Fig. S2, Additional File 1) after deployment on an R Shiny server. Genomic regions containing putatively chimeric scaffolds can be clicked on to get their names, create diagnostic plots for them and pinpoint the sites of culpable misjoins.
Even if errors are absent from scaffolds, the Hi-C map constructed from them may have some. Oft-seen mistakes are wrongly oriented (“flipped”) scaffolds or groups of scaffolds that were flipped (Fig. 2A) or inserted in the wrong places (Fig. 2B). To correct this, we devised a method to manually edit Hi-C maps as ordered lists of contigs with their orientations in a spreadsheet application. One such application is Microsoft Excel, whose overzealous autocorrection unadapted to genomic sequence identifiers poses certain risks (31), which we believe in the present case to be offset by the ease of editing in a graphical user interface most people are accustomed to. One measure of risk mitigation is that after editing, tables are read into R, and even inconspicuously misformatted rows will throw errors. Formally valid, but biologically wrong edits will become evident in the comparison of contact matrices before and after editing. Users will know that their edits must have begot the oddities and can repeat the step.
After one or more correct-map-inspect cycles to set aright chimeras or Hi-C map glitches, the map is written out and “compiled”, meaning an AGP (“a golden path”) tabular file recording contig positions is written to disk and used together with the sequence file to piece together a FASTA file of the pseudomolecules. The scripts doing this are no different in principle from those in short-read TRITEX but are now much simpler, as the complex, multi-step scaffolding was retired. One noteworthy change is that sequences of unplaced scaffolds are now written to a multi-FASTA file with one sequence record for each of them instead of a single concatenated sequence (“chrUn”) because the latter format is refused by the archive of the International Nucleotide Sequence Database Collaboration.
After this walkthrough, we illustrate some aspects of our pipeline with an example dataset for the maize inbred lines B73, which we also use to assess the impact of less dense linkage information on Hi-C map construction.
TRITEX as run on a maize dataset
We downloaded HiFi and Hi-C reads of maize B73 from public archives. Assembly of the Hi-C reads with hifiasm yielded 928 contigs with an N50 of 37.8 Mb (Table 1) and a total length of 2.17 Gb, reasonably close to the flow-cytometric genome size estimate reported by Arumuganathan & Earle (32). Out of a total of 859 million Hi-C read pairs, 89 million mapped with some degree of uniqueness (Q10 or better) to the assembly. The hifiasm assembly had three easily found chimeras (Fig. S1, Additional File 1) and no other surfaced in later steps.
Table 1
HiFi assembly statistics for maize B73.
Number of contigs | 928 |
Total length | 2,176,313,099 bp |
N50 | 37,817,658 bp |
N90 | 7,282,328 bp |
Mean contig length | 2,345,164 bp |
Maximum contig size | 153,870,576 bp |
Minimum contig size | 14,160 bp |
We constructed two guide maps. The first, which we refer to as the “reference” map, was built from the maize RefGen_v5 reference genome sequence assembly (33) and mimics the density and resolution afforded to pan-genome projects that add more sequence assemblies of diverse germplasm to an existing reference genome. We used the “mask_assembly.zsh” script included in TRITEX to extract from the reference pseudomolecules 538,812 single-copy sequences 100 bp or longer as described by Jayakodi et al. (14). The second guide map, referred to as “the genetic map one”, was a linkage map (34) of the Intermated B73 x Mo17 (IBM) population with 3,686 SNP markers, which are defined as 100-bp tags positioned on the maize AGPv1 (35). If a species does not have a genome assembly yet, genetic maps are often the best genomic coordinate system to position sequences in a genomic infrastructure under construction and the use of the IBM map to re-assemble the maize genome illustrates the performance of TRITEX for de novo genome assembly.
The reference and genetic guide maps yielded nearly the same Hi-C maps and assigned the vast majority of the assembled sequence to chromosomal locations. Hi-C contact matrices and alignment to the B73 RefGen_v5 reference (Figs. 3 and 4) confirmed the integrity of the pseudomolecules. Manual curation of the reference-guided assembly resolved a few inversions at chromosome termini. Uncommon signals decorated the interval 10–40 Mb on chromosome 6, which was composed of many small contigs. That region contains a highly repetitive ribosomal DNA locus (Fig. S3, Additional File 1), explaining the shortcomings of the contig assembly and the oddities in the Hi-C matrix. A region on chromosome 9 had very few Hi-C links, presumably because of its high repeat content and the attendant difficulties in mapping short Hi-C reads (Fig. S4, Additional File 1). As the region sat in the middle of a larger contig and Hi-C signals in its flanking regions are not out of the ordinary, a misassembly can be ruled out.
With the initial Hi-C matrices that were guided by a genetic map, manual curation was more involved and required more correct-map-inspect cycles because of the relative paucity of markers (Fig. 4C, chromosome 4 in Fig. S4, Additional File 1). Some smaller contigs (< 5 Mb) were assigned to chromosomes 3 and 4 by genetic markers whereas the reference-based correctly placed them on chromosome 6. We moved them to the unplaced scaffolds as without knowledge gleaned from the reference genome, we could not have placed them correctly. Even so, the final product was about as good as the reference-based Hi-C map (Table 2).
Table 2
Pseudomolecule statistics.
| Reference | Genetic map |
No. of contigs (chromosomes + unanchored) | 10 + 808 | 10 + 810 |
Length pseudomolecules (bp) | 2,121,154,572 | 2,114,179,800 |
Length of unanchored contigs (bp) | 55,170,127 | 62,144,699 |
N50 (Mb) | 222 | 222 |
Low-density guide maps are good enough
We asked ourselves how dense a linkage map has to be to serve as a guide map for TRITEX. This is a pertinent question in species where no reference genome, and possibly few other genomic resources, are available and TRITEX is used to construct the very first genome sequence of a species. To assess how the density of the guide map affects the outcome of Hi-C scaffolding, we constructed uncurated Hi-C maps from subsets of different sizes of markers in the IBM guide map. We randomly selected 50%, 25%, and 12.5% markers from the IBM universe. On one hand, some chromosomes had correct Hi-C maps even if only 35 guide markers were placed on them (Fig. 6). On the other hand, some Hi-C maps left out or put contigs in the wrong places at higher downsampling levels (Fig. 6H and 6I; Fig. S6, Additional File 1). As a non-visual measure of Hi-C map quality, we computed the correlation coefficients between positions of contigs in the reference genome and downsampled Hi-C maps. Even when retaining only 1/8 of markers (a total of 460 and as little as 24 per chromosome), the correlation is still high (mean r2 = 0.958, minimum across 30 replications: 0.83). This level of discordance between an uncurated Hi-C map and the ground truth will require more manual effort on part of the user, but it will most likely not disrupt their ability to piece together accurate pseudomolecules. Importantly, the recommendations derived from downsampling Hi-C markers and guide-map markers are predicated on highly contiguous and mostly accurate primary assemblies as can be generated from contemporary long-reads platform. The picture may look different in fragmented short-read assemblies with more misassemblies.