Length distribution of the FVE-novel scaffolds
We applied the FVE-novel pipeline to the ocean virome sample [31] (HOE Legacy II diel viral-size metagenome: Station 70). This sample contains 18,471,506 paired-end reads with read length 151 bps. Using the GOV database containing 24,411 contigs as reference and the reads from the station 70 as input, FVE-novel generated 268 scaffolds. Figure 2 shows the length distribution of the scaffolds, ranging from 2,026 bps to 193,112 bps, with a median length of 4,561 bps. There are 66 scaffolds longer than 10 kb.
Comparison between the FVE-novel scaffolds and their reference sequences
As the scaffolds are originally assembled from the reads that are mapped to a set of contigs or genomes, these contigs or genomes can be considered the “reference contigs/genomes” of the scaffolds. Among the 268 scaffolds, 59 are longer than their corresponding reference sequences. Figure 3 shows the lengths of the 59 scaffolds with respect to those of their reference genomes, indicating that FVE-novel can generate considerably longer scaffolds than the original references. Moreover, some scaffolds have high ANIs (e.g., >95%) to their references, suggesting that parts of these scaffolds are present in the reference database and FVE-novel successfully extended those partial references. Some scaffolds have low ANIs to their references, indicating that these scaffolds are potentially novel.
Comparison of the FVE-novel scaffolds against several databases
To examine the quality of the scaffolds generated by FVE-novel, we analyzed the longest ten scaffolds in detail. The ten scaffolds range from 72,532 bps to 193,112 bps, with average depths of coverage ranging from 48.5x to 338x.
As the scaffolds are assembled using the GOV database as the input reference database, it is expected that the scaffolds have sequence similarity to the GOV database sequences. Table 1 shows the BLAST results against the GOV sequences, with average ANIs (%) ranging from 89.21 to 97.78 and alignment percentage from 6.25-99.98. Figure 4, generated using Artemis [32], shows how the scaffolds align to their corresponding references in the GOV database and the sequence similarity, further supporting that FVE-novel successfully extended the reference genomes.
To see how the FVE-novel scaffolds compared to the 483 scaffolds reported previously in the original study [31] from where we collected our read sample, we used BLASTN [33] to compare FVE-novel generated scaffolds with the 483 assembled scaffolds. Table 1 shows the best hit result. Nine FVE-novel scaffolds have very high ANIs to their best hits (95.25 to 99.73%). This is not surprising as the same read sample was used to generate both the FVE-novel scaffolds and also some of the 483 scaffolds. Interestingly, scaffold S6 does not have any hits in the 483 scaffolds, suggesting that FVE-novel was able to reconstruct a viral scaffold that was not recovered by the original study [31]. Scaffold 6 has a best hit to the GOV sequence with 97.78% ANI and 93.09% alignment length. Notably, all nine scaffolds generated by FVE-novel are longer than the scaffolds from the original study.
We also used BLASTN to compare the ten scaffolds against the nr nucleotide database and found that four scaffolds (S0, S1, S2, and S6) have no significant hits and the other six scaffolds (S3, S4, S5, S7, S8, and S9) have similarity to Prochlorococcus phage P-SSM4 (ANIs ranging from 86.46 to 92.68%) and Prochlorococcus phage P-HM2 (ANI 84.89%). This result is consistent with the original study [31] where parts of the Prochlorococcus phage genomes were also recovered from the data.
Comparison within the FVE-novel scaffolds
The observation that some of the ten scaffolds have similar BLAST hits suggests that their sequences might be similar. Therefore we also compared the scaffolds against each other (Figure 5). Four groups of scaffolds are observed: group 1 containing S0, S1, and S6, group 2 containing only S2 with no significant similarity to any other nine scaffolds, group 3 containing S3, S4, S5, S7, and S9, and group 4 containing only S8. As FVE-novel removed all scaffolds except one with identity greater than 95%, these scaffolds have similarity less than 95% and could be different viral species or divergent strains of the same virus. Note that 95% identity has been commonly used in the literature as the cutoff score for clustering/binning viral contigs into viral operational taxonomic units (vOTUs) [9, 34, 35]. Group 3 scaffolds (S3, S4, S5, S7, and S9) all have similarity to Prochlorococcus phage P-SSM4 (Table 1), thus likely represent different strains of Prochlorococcus phage P-SSM4. In the following, we analyzed the four groups of scaffolds in detail.
Group 1 scaffolds (S0, S1, S6) S0 is the longest scaffold in group 1 and has a length 193,112 bps. It is generated from the station 70 sample, thus we checked if this scaffold is also present in the other 11 viral metagenomic samples. Figure 6(a) shows the depth of coverage of S0 in all 12 samples. The average depth of coverage of S0 in station 6, 14, 18, 22, 28, 32, 37, 52, 56, 61, 67, and 70 samples are 95.26, 28.39, 32.24, 94.78, 52.08, 73.75, 17.71, 20.18, 82.55, 148.61, 85.54, and 93.02 respectively, with station 37 having the lowest coverage (17.71) and station 61 (148.61) the highest coverage. The pairwise Pearson correlation coefficient of per base coverage of S0 between any two samples ranges from 0.77 to 0.94, and therefore even though the abundances of S0 differ among stations, the per base read coverages along the scaffold are highly correlated among samples.
Figure 6(a) also shows that coverage dropped greatly after 150 kb for all samples, which means fewer reads got mapped to this region. We further analyzed this region for sequence composition and did not find any homopolymer or short repeats, and thus the lower mapping rate is not caused by repeat sequences. Another possible cause of coverage drop is that this part of the scaffold was the misassembly from a less abundant strain of the same virus. Here, we explored this possibility by reassembling the scaffold. Specifically, we used the “Map to Reference” algorithm implemented in Geneious 11.0.4 [36] to examine whether there are multiple viral strains and if there are, whether a complete assembly of the dominant strain can be generated. All of the metagenome reads in the station 70 sample were aligned to scaffold S0 using the “Low sensitivity/Fastest” settings allowing for 10% mismatches. The alignment showed a large number of Single Nucleotide Polymorphisms (SNPs), revealing the existence of two or more strains for this phage as well as distinct regions of high and low coverages consistent with the suspected chimeric assembly of these strains. To recover the dominant strain, the consensus sequence from the alignment was segmented into contigs with high coverage > 40X. These contigs were binned into lists of contigs with similar coverage for further assembly. Next, contigs in each bin were iteratively extended using Geneious by mapping reads to the contig ends with high stringency. Specifically, all of the paired-end reads that were previously mapped to the contigs were aligned to these high coverage contigs using “Map to Reference” with stringent “Custom Sensitivity” settings allowing no more than 1% “Mismatches per Read” and 1% “Gaps per Read” and requiring that both of the paired-end reads map to the new consensus sequence. This process was iterated for each contig using Geneious’ “Fine Tuning” settings up to “100 times”. This process was continued until the extended contigs merged together, maintaining approximately uniform coverage, and could no longer be extended or closed into a circular genome sequence. Finally, we recovered a 153 kb scaffold from scaffold S0 with uniform coverage across all 12 samples (Figure 6(b)). Comparison of this 153 kb scaffold with S0 reveals that the middle 90 kb of S0 (starting at 60 kb and ending at 150 kb) are exactly the same as the 153 kb strain assembled by Geneious (Suppl. Figure 1[1] [2]). But the first 60 kb of S0 has around 80% similarity to the 153 kb strain, so this part of S0 could be from a different strain and the last 40 kb of S0 (staring at 150 kb and ending at 193 kb) is the result of assembly artifact (Suppl. Figure 1). Comparison of the 153 kb strain with S1 (155 kb) shows that FVE-novel recovered this dominant strain successfully in S1 (Suppl. Figure 2). As scaffold S1 is a correct representation of the dominant strain of this novel virus, from figure 7, we can see that S1 has a uniform coverage across all 12 samples. In scaffold S6, our tool captured the 80 kbp of the 153 kbp dominant strain (Suppl. Figure 3) and S6 also has a uniform depth of coverage across all 12 samples (Suppl. Figure 4). In order to find out if this virus has multiple strains present in station 70 sample, we sub-sampled the 153 kb strain into 10 pieces of 20 kb long and applied TenSQR [37], a viral quasispecies reconstruction tool, to each piece. For each piece, TenSQR reported two or three strains, consistent with our observation that multiple strains of this virus are present in the station 70 sample (Suppl. Table 2).
Group 2 scaffold (S2) Scaffold S2 is present in all samples with varying abundances but highly correlated depth of coverage among the samples (Figure 8(a)). With the same aforementioned procedure, we generated a 151 kb scaffold that most likely represents a complete novel virus genome recovered from S2. Figure 8(b) shows that this 151 kb scaffold has a uniform depth of coverage across all 12 samples. Comparison of this scaffold to S2 shows that our tool successfully recovered most of this virus except a 15 kb long portion of it (Suppl. Figure 5). We also checked if multiple strains of this virus are present in the station 70 sample using TenSQR. Results show that multiple strains of this virus are also present in this sample (Suppl. Table 3).
Group 3 scaffolds (S3, S4, S5, S7, S9) As scaffolds S3, S4, S5, S7, and S9 have similarity to Prochlorococcus phage P-SSM4, we chose to analyze in detail the longest scaffold S3 with length 132,604 bp. Figure 9(a) shows consistent depth of coverage of S3 across all 12 samples, but coverage dropped dramatically in all samples from 50 kb to 65 kb. We then aligned all the reads from station 70 sample to scaffold S3 using Geneious and observed the presence of multiple strains. Using the same procedure as above, we were able to recover a 177 kb scaffold representing the dominant strain of Prochlorococcus phage P-SSM4 present in station 70 sample, with a uniform coverage across all 12 samples (Figure 9(b)). Comparison of this 177 kb scaffold to S3 shows that S3 matches the dominant strain with about 100% similarity for all except the 15 kb segment (the 50-65 kb part in S3) where the similarity dropped to 80% (Suppl. Figure 6). This result implies that our tool successfully captured about 117 kb of the dominant strain except from a piece of length 15 kb where the assembly switched to a different strain with lower coverage. Consistently, analysis with TenSQR suggests the presence of three to seven strains in the sample (Suppl. Table 4). Interestingly, alignment of the 177 kp scaffold to Prochlorococcus phage P-SSM4 (length 178,249 bp) shows similar as well as more divergent regions (ANI 82.9%) (Suppl. Figure 7).
Group 4 scaffold (S8) The depth of coverage of the scaffold S8 is quite uniform across all 12 samples along the region except the beginning 6 kb and the last 10 kb where coverage is higher than the remaining region, which can be caused by a different strain (Figure 10(a)). According to the blast result, S8 has similarity to Prochlorococcus phage P-HM2. We then used Geneious to recover the dominant strain from S8 and obtained a 183 kb scaffold. Figure 10(b) shows that the 183 kb scaffold has uniform coverage across all 12 samples. Comparison to S8 shows that in S8 we recovered about 60 kb of this dominant strain (Suppl. Figure 8). Analysis with TenSQR suggests the presence of two or three strains in this sample. Alignment of the 183 kb scaffold to Prochlorococcus phage P-HM2 (length 183,806 bp) also shows many similar regions and some dissimilar regions (ANI 86.56%) (Suppl. Figure 9).
Function annotation of the four complete scaffolds
We annotated genes from the four completed scaffolds, (a) the 153 kb dominant strain of a novel virus recovered from S0, (b) the 151 kb dominant strain of a novel virus recovered from S2, (c) the 177 kb dominant strain of Prochlorococcus phage P-SSM4 recovered from S3, and (d) the 183 kb dominant strain of Prochlorococcus phage P-HM2 recovered from S8. The four scaffolds have 195, 160, 220, and 231 proteins respectively predicted by Prodigal with metagenomics mode (i.e., Prodigal option: -p meta) [38]. We then annotated those proteins using eggNOG-mapper with the virus database and HMMER option [39]. For all the four scaffolds, eggNOG-mapper annotated some viral structural proteins such as capsid, baseplate, virion, and so on, indicating that they are potential virus genomes. Figure 11 shows the protein annotation of the two novel genomes recovered from S0 and S2. Figure 12 shows the protein annotation of the two Prochlorococcus phage strains recovered from S3 and S8 respectively. As both phages are Prochlorococcus phage strains, as expected, the arrangement of viral structural proteins in these phage genomes are similar to that of Prochlorococcus.