Unassembled read statistics
We sequenced DNA extracted from bacterioplankton collected in the 0.2-3 µm size fractions at four time points during the spring of 2020 in the North Sea. The sampling points comprised one pre- and three bloom events, part of a time-series sampling campaign during 2020[14]. The Illumina SR approach recovered an average of 36 Gbp per sample (~ 240 bp paired-end reads), while the CCS PacBio LR approach recovered an average of 12 Gbp per sample (~ 6 kbp CCS reads) (Table S1). This difference translated into SR metagenomic samples having, on average, ~ 3 times more sequenced base pairs than LR samples, which is within the expected output for single sequenced samples[8]. Additionally, LR metagenomic samples had slightly higher GC content (0.43) compared to their SR counterpart (0.38) when all reads were considered (Table S1). The average coverage of the microbial community was overall similar between LR metagenomes (not significantly different; Fig. S1) as determined in Nonpareil[16]. The main difference was the LR sample from 2020-04-30 had a comparatively lower sequencing depth than other LR samples. Similarly, when determining sequence diversity using a combined measure of richness and evenness (total diversity), LR metagenomes had higher average sequence diversity, although not significantly different from SR metagenomes (Fig. S1). We mapped SR over LR of the corresponding sample to assess the degree of sequence overlap between metagenomic samples (Fig. S2). From the total number of SR per sample, an average of 80% had a match in the corresponding LR time point. Similarly, LR with an SR match represented an average of 93% for the four time points. Thus, both sequencing technologies recovered equivalent microbial community fractions and sequence diversity, although LR-based metagenomics captured a higher GC content sequence space.
Comparison Of Assemblies From Sr And Lr Metagenomic Samples
The total number of base pairs of assembled SR was, on average, more than four times the length of assembled LR (1.7 vs. 0.42 Gbp; see Table S2). However, contigs from LR metagenomes had a much higher N50 of 86 kbp than the 1.2 kbp obtained from SR-derived contigs (Fig. S3a and Table S2). More than ~ 84% of SR and ~ 66% of LR mapped back to their respective contigs (Table S2). As expected for SR metagenomes, the sequencing depth of contigs reached higher values compared to LR for all dates (avg. ~2,400x vs. 500x). However, average values were higher for LR than SR (~ 10x vs. 4.4x; Figure S3b), indicating a more even distribution of sequencing depth values for the LR technologies. Thus, assembled LR reads provide longer contigs than those from SR metagenomes.
Gene Predictions In Unassembled And Assembled Long-read Metagenomes
Among the most notable advantages of LR-based metagenomics is preserving long genomic regions on a single read. Nonetheless, indels can cause a frameshift in predicted protein sequences from LRs[40], especially when using unassembled sequences. Thus, we compared the length of predicted protein sequences from unassembled LR metagenomes to those in a comprehensive database. Commonly used tools, such as Prodigal[37], resulted in a higher fraction of smaller predicted protein sequences (length of predicted protein / length of reference protein; median = 0.93, IQR = 0.525, Fig. S4a). Gene prediction tools that can better manage errors in sequencing, such as FragGeneScan[36], resulted in a tighter distribution of predicted protein sequences (median = 1, IQR = 0.164, Fig. S4a). Thus, error-correction gene prediction tools are advantageous for indels correction in unassembled CCS reads. Prodigal was preferred when working with assembled reads (Fig. S4b,c). The distribution of predicted protein lengths compared to the lengths of references were more spread in SR (median = 0.983, IQR = 0.45) than LR metagenomic samples (median = 1, IQR = 0.052).
Comparing Sr And Lr Technologies For Recovering Mags From The Same Dna Sample
The main goal of this study was to compare the capabilities of both SR and LR sequencing technologies for the recovery of bacterial and archaeal genomes (i.e., MAGs). First, we summarize the parallel comparisons of MAGs recovered from the same DNA extractions sequenced using SR and LR technologies. A total of 341 and 254 MAGs were reconstructed from the four SR and LR metagenomic samples, respectively (Fig. 1a). Remarkably, 88% of the MAGs recovered from LR metagenomic samples carried at least one copy of the 16S rRNA gene, contrasting with only 23% of the SR MAGs (Fig. 1b). Nonetheless, more 16S rRNA genes could be expected if a higher 2.5 kbp contig cutoff is used for the binning of contigs, although, at the expense of higher MAG fragmentation. No significant differences in completion were observed between all recovered LR and SR MAGs. However, all recovered LR MAGs were less contaminated than SR MAGs (Table S3; p < 0.05). Consistently, 35% (89/254) of the LR-derived MAGs meet the MIMAGs[41] high-quality criterion, whereas only 3.5% (12/341) of the SR MAGs follow under the same quality thresholds. The genome-based taxonomic composition was similar for both sequencing technologies (Fig. 1a). For instance, Bacteroidia (~ 47%) and Gammaproteobacteria (~ 25%) were the dominating groups in MAGs from both sequencing approaches, in agreement with the taxonomic profile of blooms from previous years[13].
To compare genomic statistics of MAGs representing the same populations recovered from the same sample using both sequencing platforms, we generated pairs of genomes based on ANI ≥ 99% (Fig. 2). A parallel comparison of phylogenetic reconstructions derived from SR and LR MAGs was congruent, and a similar topology was observed (Fig. 2a). MAGs sharing ANI ≥ 99% were, for almost all cases, placed in similar positions in the tree. LR MAGs were composed of ~ 10 times fewer contigs than their ANI 99% SR counterpart (median = 8 vs. 83, p < 0.05 test; Fig. 2b). N50 values were also more than 11 times higher for LR MAGs (median ~ 350 kbp vs. ~30 kbp, p < 0.05). Interestingly, MAGs recovered from LR metagenomes were slightly longer than those from SR (median = 1.98 Mbps vs. 1.85 Mbps, p < 0.05). LR-derived MAGs also carried more predicted genes than the SR MAG pair (median = 1,872 vs. 1,704, p < 0.05; Fig. S5a). However, the differences in the number of predicted genes were not associated with particular taxonomic groups. This difference is likely a consequence of longer contigs built from reads containing complex regions not represented in SR, which are harder to reconstruct using SR approaches. Nonetheless, other statistics such as GC content, completion, and contamination were similar between the pair of MAGs.
Differences In Abundance And Diversity In Sr- And Lr-derived Mags
Overall, the comparison of relative abundance determined for the paired MAGs was correlated between sequencing technologies (Fig. S6). Nonetheless, divergences from the expected abundances for the pair of MAGs were evident when inspected at each time point, especially for the three samples obtained during the bloom (2020-04-14, 2020-04-30, and 2020-05-06). Unlike the March sample, the linear regression for the samples in April and May was less predictive, and a higher abundance for LR MAGs was evident (Fig. S6). These differences are primarily due to the increased representation (i.e., sequencing depth) of populations with higher GC content, such as Gammaproteobacteria, compared to Bacteroidia in LR metagenomes (Fig. 3a and Fig. S5b). However, a similar relative abundance for main taxonomic groups was observed between SR and LR MAGs when all, or those detected in both technologies, were determined (Fig. S5c,d). The exception was the higher abundances of Bacteroidia MAGs belonging in SR metagenomic samples at the last time point, in agreement with the observations at the MAG level. Thus, genomes with high GC content, such as Gammaproteobacteria and Acidimicrobiia, have increased sequencing depth in LR metagenomes. In contrast, Bacteroidia MAGs (with lower GC content) have a comparatively higher sequencing depth in SR metagenomic samples. Discrepancies in relative abundance between technologies due to GC content cannot be discarded for other groups likely not well captured in this analysis.
A critical aspect of ecological inferences based on recovered MAGs is determining whether both technologies can retrieve comparable units of diversity or species. To reduce the redundancy of MAGs belonging to the same populations captured from each time point, we de-replicated and selected representative MAGs at 99% ANI. Although 99% ANI is a high cutoff[42], de-replicated genomes resulted in 16% more MAGs from SR metagenomic samples (216 SR and 187 LR MAGs). This higher diversity is due to the higher sequencing effort (or the number of base pairs) obtained from SR compared to LR sequencing runs. When comparing the taxonomic affiliation of MAGs, 28 and 11 species were detected only in SR and LR MAGs, respectively. Interestingly, among these groups of uniquely recovered species, LR MAGs had an average higher GC content when compared to the SR MAGs (0.51 vs. 0.38; Fig. 3b). These differences likely reflect the inherent genomic characteristics of the groups enriched within each sequencing technology[43]. Most SR-only species belonged to Bacteroidia, Alphaproteobacteria, and Gammaproteobacteria, whereas for LR-only species, Acidiimicrobiia and Verrucomicrobiae were the two major classes. At the genus level, ~ 79% (22/28) and ~ 64% (7/11) of the unique MAGs belonged to genera only recovered in SR and LR metagenomes. At the phylum level, the exception was a Gemmatimonadota MAG with 63% GC content only recovered in LR metagenomes (PACB-20200310-m34, Fig. 3b). This Gemmatimonadota MAG had an abundance of 0.04% of the total community (truncated average sequencing depth, TAD80 = 3.77x, breadth of coverage = 88.34% at 1x TAD80) in the corresponding SR metagenome (2020-03-10). These results suggest that SR approaches failed in recovering this MAG due to low sequencing depth and high GC content of the target genome.
To test the effect of low sequencing depth and breadth of the coverage in the recovery of unique species, we performed a cross-platform mapping of SR and LR to the MAGs, representing unique species recovered with the opposite technology (Fig. S7). For the most part, the cross-mapping of SR and LR on unique MAG species resulted in low sequencing depth (median = 6.4 vs. 2.8) and breadth of the coverage (median = 96% vs. 88.3%) for SR and LR technologies. Thus, uniquely detected species in each dataset are likely due to a combined effect of differences in GC content[43] and sequencing depth between technologies.
Other Considerations When Choosing Lr Technologies
Currently, PacBio LR shotgun metagenomics is of higher cost per Gbp than SR (~ 2.4 times higher for our project, see Table S1). The cost per Gbp of Nanopore is currently between Illumina and PacBio. Nanopore technologies offer the affordability and benefits of recovering longer reads or the possibility of including short technologies for the better recovery of high-quality MAGs[4, 10]. Nonetheless, sequencing error and read lengths should be considered when selecting between LR technologies[4]. Despite the cost differences, the results presented here can guide researchers in deciding if LR metagenomics would be beneficial over SR approaches.
The current stage of algorithms and approaches for LR metagenomics is still limited compared to the large toolbox of SR technologies. While the methodology used here reflects the most appropriate tools and algorithms available at the time, we recommend that future studies pursue a critical assessment of newer approaches[12] when using LR techniques. The dataset presented here can also serve as a reference for testing and comparing algorithms and approaches for shotgun LR metagenomic sequencing.