Identifying the optical assemblers for vOTU detection using short-, long- and hybrid-sequencing data
To comprehensively evaluate the effect of different assembly and binning tools on viral genome discovery, we used 95 viral-like particle (VLP) enriched human fecal samples sequenced on both Illumina (next-generation sequencing, NGS, or short-reads) and PacBio (third-generation sequencing, TGS or long-reads) platforms from our previous study [10] (Methods).
Our evaluation was shown in Fig. 1. First, genome assembly. We selected a total of twelve state-of-the-art assemblers for (meta)-genome analysis, including three NGS, five TGS and four hybrid assemblers (Table 1). Briefly, for the NGS data, we used IDBA-UD (v1.1.4) [37], MEGAHIT (v1.2.9) [18], and metaSPAdes (v3.15.4) [17] for the genome assembly. For the TGS data, we selected Canu (v2.2) [15], FALCON (v1.8.1) [38], Hifiasm-meta (v0.3) [39], metaFlye (v2.9.1) [40], and wtdbg2 (v2.5) [16]. For hybrid assembly that combines the NGS and TGS data, we used IDBA-hyb (v1.1.3) (GitHub - loneknightpy/idba), hybridSPAdes (v3.15.4) [41], metaViralSPAdes (v3.15.4) [42] and OPERA-MS (v0.83) [43]. All assemblers were run on all samples with default parameters (Methods). Secondly, we performed an in-sample dereplication of the contigs assembled from all samples for each tool. Thirdly, we identified viral contigs using a customized bioinformatics pipeline adopted from the human Gut Virome Database (GVD) [9] with modifications, and clustered them into the non-redundant species-level viral contigs referred to as vOTUs (Methods). Fourthly, for the viral contigs generated by each assembler, we used CONCOCT (v1.1.0) [54], MetaBAT2 [55], AVAMB [56], and vRhyme [57] for multi-coverage binning (i.e., when clustering contigs of a sample into bins, the read coverage of these contigs across all samples were also considered) [53]. Then, we conducted a systematic evaluation of the tools at the assembly level and the binning level. The quality metrics of viral contigs and bins, taxonomy classification analysis and phylogenetic status were included in the process (Fig. 1).
After assembly and viral genome identification, we first compared the numbers of vOTUs obtained from the assemblers. We found that MEGAHIT, FALCON and IDBA-hyb generated the highest number of vOTUs among the NGS, TGS and hybrid assembler groups, respectively. However, when considering only the high-quality vOTUs (hq-vOTUs) with > 90% genome completeness and no “contig > 1.5x longer than expected genome length” and “high kmer _ freq may indicate large duplication” warning information according to CheckV [4] (Methods), MEGAHIT, metaFlye, and hybridSPAdes performed the best within their respective assembler categories (Fig. 2A). Notably, assemblers that were not optimized for metagenomic data, such as canu and wtdgb2 generated significantly less vOTUs (Fig. 2A).
We observed that > 97% of the vOTUs by all assemblers contained < 5% contaminations (Fig. 2B, S1). This is because CheckV only counted bacterial genes at the ends of the assembled contigs as contaminations [4]. We thus did not consider the contamination levels as a key measurement of the vOTUs.
Finally, we compared the assembly length metrics of the vOTUs, including the lengths of the longest contig, total contigs and N50. For the NGS assemblers, MEGAHIT generated contigs with the longest total length and the highest N50, while metaSPAdes achieved the longest contigs (Fig. 2C). Among the TGS assemblers, Hifiasm-meta had the largest total length and the largest contig length. However, it is noteworthy that metaFlye, despite having the highest N50, did not significantly lag behind Hifiasm-meta in terms of total length and the largest contig length (Fig. 2D). Among the hybrid assemblers, hybridSPAdes achieved the largest contig length and was comparable to IDBA-hyb and metaViralSPAdes in terms of total lengths and N50 values, with only marginal differences in these metrics (Fig. 2E).
Overall, our results suggest that MEGAHIT, metaFlye, and hybridSPAdes stand out as the best tools in the NGS, TGS and hybrid assembler categories, respectively, featuring the identification of more and longer vOTUs with higher quality (Fig. 2F-H).
Complementarity of different assemblers in recovering high quality viral genomes
We next examined the overlaps and differences in the detected vOTUs across assemblers. We focused on the hq-vOTUs with CheckV completeness > 90% and no “contig > 1.5x longer than expected genome length” and “high kmer _ freq may indicate large duplication” warning information to avoid mis-evaluation due to genome incompleteness. Combining all such vOTUs from all assemblers and dereplicated at a 95% threshold using cd-hit (Methods), we obtained a combined set of 17,931 non-redundant hq-vOTUs (Table S1). Surprisingly, we found that more than half (54.5%, 9771) of them were assembler-specific (Figure S2). We also examined the overlaps among the three assembler groups (NGS, TGS, HYB) and found that few hq-vOTUs were recovered by all three groups (n = 1478, 8.24% out of 17931) or by two groups (i.e., n = 442 between TGS and NGS, n = 843 between TGS and HYB). We did find a significant overlap between the NGS- and HYB-groups (4297), likely because the pre-assembly step of these hybrid assemblers using NGS reads during the assembly [41, 43]. Additionally, the TGS-group derived the highest number of unique hq-vOTUs (n = 4725, 26.4%), followed by the hybrid (n = 3191, 17.8%) and NGS (n = 2955, 16.5%) (Fig. 3A). These results suggest that in addition to the choice of tools, the type of sequencing data, i.e., short- vs long-reads, also significantly influences the vOTU identification results, highlighting the necessity of using both the long- and short- reads for a complete gut virome characterization.
When compared with individual assemblers, we found that the combined set significantly expanded the numbers of the hq-vOTUs compared with individual assemblers, from 4.83 fold increase for MEGAHIT to 21.7 fold increase for metaSPAdes (Fig. 3B); of note, we excluded canu and wtdgb2 from this and subsequent comparisons because they generated much less hq-vOTUs and were not optimized for metagenomic analysis [15, 16] (Table 1). These results indicate significant complementarity of different assemblers in recovering high quality viral genomes.
Assembler-specific metagenome-assembled genomes can be error-prone, we thus adopted a phylogenetic approach to further validate the quality of these hq-vOTUs from different assemblers. We annotated the large terminase genes in hq-vOTUs and used the protein sequences for phylogenetic analysis. The dsDNA virus terminal enzyme gene, often employed as a marker gene for phylogenetic analysis, encodes a crucial enzyme involved in DNA replication and repair processes [64]. About 16% of the hq-vOTUs encoded the large terminase (Figure S3). We built multiple sequence alignments using the large terminase proteins and constructed a maximum-likelihood tree (Methods). As shown in Fig. 3C, we observed a significant concordance between the tree clades and the phage families annotated by PhaGCN_newICTV ([58]; see also Methods). Specifically, genomes belonging to different phage families formed discrete clades on the phylogenetic tree, each with well-defined boundaries (Fig. 3C; outer ring). Notably, within each clade (family), we often found non-redundant vOTUs derived from multiple assemblers (Fig. 3C). For example, the Autographiviridae family contained four hq-vOTUs from the NGS assembler MEGAHIT (Fig. 3D), while other assemblers contributed 37 more hq-vOTUs to this family (Fig. 3E). More importantly, the terminase proteins from these genomes showed significant sequence divergence (Figure S4), which was also evident from the long branch lengths on the phylogenetic tree (Fig. 3E). These results together indicate that our multi-assembler approach could indeed expand the gut virome identification by contributing assembler-specific and high-quality viral genomes.
Biases of different assemblers in recovering vOTUs at higher taxonomic levels
Next, we examined the overlaps in the identified viral contigs at higher taxonomic levels among all the assemblers. We annotated the hq-vOTUs into known viral families using PhaGCN_newICTV [58], resulting in 8%~43% of annotation rates across the assemblers, with an average of ~ 16% (Figure S5). A total of 19 viral families were annotated. All NGS assemblers were able to detect members of all families. So were all the hybrid assemblers except the metaViralSPAdes, which did not detect any members of the Rountreeviridae family (Fig. 4). Conversely, we observed significant performance variations among the TGS assemblers. Specifically, metaFlye and Hifiasm-meta could recover all families except the Rountreeviridae, while falcon additionally did not recover the Zobellviridae. Furthermore, wtdbg2 and canu missed majority of the families and recovered fewer family members when they did. Interestingly, all the TGS assemblers did not recovery any members of the Rountreeviridae family; further study should be implemented to determine whether it is because of the fewer members presented in the human gut, or its unique sequence and/or abundance characteristics.
Within each assembler category, we observed little difference in the performance of the three NGS assemblers in recovering viral families (Fig. 4). With the TGS assemblers, Hifiasm-meta and metaFlye assembling a fuller range of viral families and increasing the N50 values of several families. HybridSPAdes enabled the assembly of all families as well as being the most numerous in terms of contigs within the hybrid assemblers.
Together, our results indicate biases of different assemblers in recovering viral contigs at higher taxonomic levels, especially those of the TGS assemblers.
Figure 4. Evaluation of Taxonomic annotation of vOTUs assembled by different assemblers. The performance of each assembler in assembling non-redundant contigs of each virus family, the size of the dots represents the N50 of non-redundant contigs of that family of viruses assembled by that type of tool, and the color of the dots represents the classification of the assembler, and a bar in the above representation represents the number of the contigs of each virus family assembled by that tool.
Different binners exhibit markedly distinct behaviors in the binning of vOTUs
We also evaluated the performance of four binning tools on vOTUs, namely CONCOCT [54], MetaBAT2 [55], AVAMB [56] and vRhyme [57]. AVAMB consistently produced a greatest number of bins on all assemblers (Fig. 5A). Consequently, we found that bins created by CONCOCT contained a significantly high number of contigs (median 154) than those by other binners (MetaBAT2: median 8, AVAMB: median 1, vRhyme: median 2; p < 0.0001, Wilcoxon Test; Fig. 5B and Figure S6).
Subsequently, we applied the CheckV tool to assess the completeness and quality of the bins derived from the 12 assemblers and the four binners (Methods). Of note, CONCOCT produced 95 oversized bins comprising thousands of contigs, which exceeded the capacity of CheckV for completeness evaluation. We thus excluded these oversized bins from further analysis.
Overall, we observed that all binning methods significantly improved the completeness of viral genomes when we compared the completeness of the bins to the member contigs with the highest completeness values (Fig. 5C; p < 0.01 in CONCOCT and < 0.0001 in others, Wilcoxon Rank Sum Test). AVAMB achieved the greatest improvement in completeness among all the binning tools (the average increased completeness per bin for AVAMB, CONCOCT, MetaBAT2 and vRhyme was 17.6, 3.04, 10.7 and 17.3 respectively, Table S2). We observed the same trends across almost all assemblers (Figure S7-S10). Additionally, AVAMB consistently generated a greater number of HQ bins (i.e., those having > 90% completeness and no “contig > 1.5x longer than expected genome length” and “high kmer _ freq may indicate large duplication” warning information) compared to other binners (Fig. 5D).
We proceeded to compare the consistency of taxonomy annotation results for contigs within bins. Strikingly, among the 2515 multi-contig bins generated by CONCOCT and were taxonomically annotated, more than half (52.7%, 1326) contained vOTUs that were annotated to different viral families (Fig. 5E). In contrast, 97.8% of the multi-vOTU bins by MetaBAT2 showed consistent annotations results within the same family (Fig. 5F). Notably, only 6.1% of AVAMB-generated bins contained more than one vOTUs. However, within these multi-bins, a high level of taxonomic annotation consistency was observed, with 94.0% (3025 multi-bins) displaying consistent taxonomic classifications (Fig. 5G). Conversely, all bins generated by vRhyme contained multiple vOTUs and exhibited high consistency (350 bins, 96.7%) in taxonomy annotations (Fig. 5H). These findings indicate that MetaBAT2, AVAMB and vRhyme exhibits superior taxonomy annotations consistency than CONCOCT, while the latter tended to be more inclusive and cluster vOTUs from varying taxonomic levels.