2.1 Viral contig summary
To demonstrate Virseqimprover’s utility in correcting and extending viral contigs, we took several metagenomic samples (under NCBI accession number SRX2912986 [21], SRX7079549), and ran assembly programs including FVE-novel [22], metaSPAdes [2], and MEGAHIT [5] to generate contigs. We then applied Virseqimprover to the seven longest contigs generated by the tools for contig extension and correction. Table 1 shows the length information of the seven contigs before and after applying Virseqimprover, indicating that most of the contigs got extended. The next several sections provide detailed comparison of the contigs generated by the popular assembly tools with the contigs refined by Virseqimprover.
2.2 Contigs S0, S1, S2
For contigs S0, S1, and S2, the longest contig was contig S0 with length 193,112 bp. Figure 1 shows that for S0, coverage from around 24,500 bp to 25,500 bp is lower than the average coverage and varies a lot after 150 kb. Virseqimprover checked for the uniformity of the coverage and extracted the longest region with uniform coverage, which was a region with length 127,423 bp from 25,567 bp to 152,989 bp. This longest non-suspicious region went through the iterative extension and error correction steps. When the iterative extension step was done, the output contig (length 163,662 bp) was checked for circularity. When the circular region was trimmed and Pilon was applied to the output contig, an improved contig (length 152,707 bp, denoted as S0’) was generated and became the final output of Virseqimprover as shown in Figure 1. Some missing part on the left was due to the trimming of the circular region. Besides, the drops in coverage at the ends of the sequences are due to the fact that the number of reads (especially paired-end reads) mapping at the edge regions is less than in the middle regions.
Similarly, after checking the depth of coverage of contig S1, we found that there was a non-uniform coverage region or suspicious region from 133,376 bp to 134,543 bp. Virseqimprover then extracted the longest non-suspicious region which was a region of length 133,375 bp from 1 bp to 133,375 bp and then applied the iterative extension and error correction steps to generate an extended contig. When all these steps were finished, the circularity of this contig was checked and Pilon was applied to the contig. The final output was a contig with length 152,362 bp. Figure 2 shows the depth of coverage of both original (S1) and final sequences (S1’).
For contig S2 with length 80,620 bp, the per depth of coverage was checked and according to Virseqimprover it had no non-uniform coverage region or suspicious region. Hence this contig directly went through the extension and error correction steps iteratively. Finally, after using Pilon to improve the assembly, a contig with length 151,828 bp was generated. Figure 3 shows that after applying Virseqimprover, we extended the original contig on both ends and nearly doubled the total length of the original contig to get a greatly extended contig (S2’) which has a uniform depth of coverage along the length.
The improved versions of contigs, S0’, S1’, and S2’, were compared with the 153 kb strain of a novel uncultured virus. After predicting the genes and visualizing the gene cluster comparison, as shown in Figure 4, we can see that S1’ and S2’ are very similar to the 153 kb reference strain, whereas S0’ is a bit different from all other contigs. This shows that Virseqimprover has correctly recovered the whole virus sequences. Additionally, since some regions in S0’ do not match with any of the regions of all the other contigs, it could be an indication that S0’ is a different strain of the same virus species.
2.3 Contig S3
In the same manner, the contig S3, with length 132,604 bp was checked for the uniformity of the coverage and found that it had a region with low coverage at around 49 kb to 66 kb as shown in Figure 5. BLAST search shows that this region aligns best with Cyanophage P-RSM3 and Prochlorococcus phage P-SSM4 whereas the other parts of the contig aligns best with Cyanophage P-RSM1 and Synechococcus phage metaG-MbCM1, indicating that this low coverage region is probably a misassembly. Virseqimprover flagged all the suspicious regions and extracted the longest region with uniform coverage from 66,390 bp to 132,603 bp. Then this 66 kb region was extended through iterative extension and error correction and a contig with length 160,821 bp was generated. After checking for the circularity of this contig and applying Pilon to this contig, an improved contig (denoted as S3’) with length 160,744 bp was produced which was the final output of Virseqimprover. Figure 5 shows that the non-uniform regions in the original contig are filtered out so that the final corrected and extended contig has a uniform depth of coverage along its length. Figure 6 shows the protein annotation of the final sequence. Using BLAST search, we identified four phages that show the highest sequence similarity to S3’. Based on the gene cluster comparisons, as shown in Figure 7, we can see that the improved contig S3’ does have some similar proteins with these phages. However, DNA sequence alignment of S3’ with these genomes also reveals some dissimilar regions, with great sequence identity variation along the entire sequence, ranging from 75.38% to 82.72%. Hence the improved S3’ might be from a novel phage species.
Figure 8 shows the similarity of S3’ (length 160,744 bp) to the viral sequence (177,631 bp) recovered by the semi-automated assembly process in Geneious. The protein identity threshold is 30%, which means that two proteins are considered to belong to the same group if their protein identity value is above 30%. Apart from the beginning part in the 177 kb strain that does not have many alignments in S3’, a small region in the middle of the sequence also shows difference between these two sequences (colored as the gray arrows). We thus further compared S3’ with the 177 kb strain to find out the difference in this specific region. Based on the pairwise DNA sequence alignment by EMBOSS Stretcher [23], the comparison of these two contigs reveals that in the 177 kb strain, a 1,282 bp region from 56,831 bp to 58,112 bp does not have many matches with S3’. In this part instead of this 1,282 bp region, S3’ contains a 1,342 bp region from 22,163 bp to 23,504 bp. Analysis of the depth of coverage of these two sequences in the area where they are different reveals that those areas have a relatively lower depth of coverage (about 150x) compared to the average depth of coverage (about 300x), as shown in Figures 9 and 10. Moreover, based on the BLASTP search, the specific protein sequence corresponding to this area in contig S3’ aligns best with Synechococcus phage metaG-MbCM1, whereas the 177 kb strain aligns better with Synechococcus phage S-SM2. The differences between S3’ and the 177 kb strain suggest that they may represent different strains of the same phage.
2.4 Contig S4
Contig S4 has 136,254 bps. After Virseqimprover’s contig extension, error correction, and circularity check, S4 was extended to 151,190 bps. Virseqimprover indicates the final extended contig S4’ is not circular. Figure 11 shows that the depth of coverage of the original S4 is rather uniform, and was extended for both sides of the sequence, with both left and right ends of S4’ showing higher coverage than nearby regions, which indicates the presence of repeats. Closer examination of the end sequences reveals that the regions indeed are repeats.
2.5 Contigs S5, S6
Contig S5 is generated by MEGAHIT and contig S6 is generated by metaSPAdes. These two contigs both have a 99% identity to the marine virus with ID AFVG 25M466, covering 12% and 40% of the viral genome, respectively. After applying Virseqimprover, S5 got extended to 14,374 bp, and S6 extended to 22,526 bp, as shown in Figure 12 and Figure 13. The extended S5’ covers 43% of the marine virus genome, and the extended S6’ covers 68%. After doing the pairwise alignment between the extended sequences and the marine virus genome, we find that the extended parts are identical to the marine virus genome, and therefore Virseqimprover can accurately extend the sequence in this sample.
2.6 Contig S7
Contig S7 is generated by metaSPAdes. It has a 99% identity to the marine virus with ID AFVG 25M409. After applying Virseqimprover, S7 was extended from 23,114 bp to 32,035 bp, which covers 98% of the AFVG 25M409 viral genome (32,812 bp). Figures 14 and 15 show that Virseqimprover successfully extends the original fragmented contig to nearly complete genome with high accuracy.
2.7 Evaluation of Virseqimprover
We used CheckV [25] as an independent source to further evaluate and completeness and quality of the contigs refined by Virseqimprover. CheckV determines the completeness and quality of assembled contigs by comparing them to a large database of complete virus genomes and has been used widely to evaluate the quality of assembly. Figure 16 shows the completeness of both the original and refined sequences. Based on CheckV, only one of the eight contigs generated by either FVE-novel, metaSPAdes or MEGAHIT is complete, in contrast, four contigs recovered or refined by Virseqimprover are complete, thus achieving an overall of 30% improvement over these assembly tools. Remarkably, contigs S2 improved from 48.19% to 100% in completeness, S4 from 60.72% to 100%, and S7 from 68.89% to 100%. Contig S5 also shows a significant improvement of completeness, from 13.18% to 44.92%. This shows that Virseqimprover is effective in extending contigs and improving the completeness of contigs generated by other assemblers. Table 2 shows the sequence quality assessed by CheckV. CheckV has four categories of quality evaluation, low, medium, high, and complete. Except contig S5, contigs refined/recovered by Virseqimprover all have improved quality over the original contigs generated by other assemblers. Only two of the original eight contigs fall into the low or medium quality category, whereas six of the eight contigs from Virseqimprover fall into either the high or complete category, suggesting that Virseqimprover is useful for improving the quality of contigs generated by other assemblers. Taking together, Virseqimprover significantly enhances the quality of the contigs, enabling the extension of fragmented sequences into more complete sequences or full genomes.