Virus classification for viral genomic fragments using PhaGCN2

doi:10.21203/rs.3.rs-1658089/v1

Background: Viruses are the most ubiquitous and diverse entities in the biome. Due to the rapid growth of newly identified viruses, there is an urgent need for accurate and comprehensive virus classification, particularly for novel viruses.

Results: Here, we present PhaGCN2, which can rapidly classify the taxonomy of viral sequences at family level and supports the visualization of the associations of all families. We evaluate the performance of PhaGCN2 and compare it with the state-of-the-art virus classification tools, such as vConTACT2, CAT, and VPF-Class, using the widely accepted metrics. The results show that PhaGCN2 largely improves the precision and recall of virus classification, increases the number of classifiable virus sequences in the Global Ocean Virome dataset (v2.0) by 4 times, and classifies more than 90% of the Gut Phage Database.

Conclusions: Here, we present PhaGCN2, which can rapidly classify the taxonomy of viral sequences at family level and supports the conduction of high-throughput and automatic expansion of the database of the International Committee on Taxonomy of Viruses.

PhaGCN2

Virus classification

machine learning

bioinformatics

genomics

taxonomy

novel virus

ICTV

As the most abundant biological entities on Earth, viruses can hijack organisms from every branch of the tree of life. They play critical roles in host mortality, metabolism, physiology, and evolution, impacting marine biogeochemical cycling and shaping the Earth’s microbiomes (1–5). Culture-independent next-generation sequencing technologies have recently been used to explore the tremendous diversity of the virosphere from multiple samples (6–8). With rapid expansion of viral genome databases, these advances have led the International Committee on Taxonomy of Viruses (ICTV) to present a consensus statement suggesting a shift from the “traditional” classification criteria—for example, virion morphology and single- or multiple-gene phylogenies—toward a genome-centered, and perhaps one day largely automated, viral taxonomy(9). Now, the virus classification mainly relies on the manual classification and definition of virologists, which is too slow to classify millions of viral genome sequences. For example, despite millions of virus sequences in IMG/VR, there are only about 10,550 types of classified viruses in the ICTV 2021 report (hereafter ICTV2021). Therefore, there is an urgent need for a virus classification method that can rapidly and accurately classify these new viral genome sequences and align computational classifications with ICTV-ratified taxa(10).

In our previous work, we present a semi-supervised machine learning model, named PhaGCN(11), based on a graph convolutional network. There are two main components in PhaGCN: CNN encoder and GCN classifier. First, the CNN encoder will encode contigs from different lengths into 256-dimensional embedding vectors. Each vector represents the motif-related patterns captured from the DNA sequences. Second, a knowledge graph is built to connect known phages in the RefSeq database and the test phages. Each node in the graph represents a phage, and the edges between phages represent sequence and protein-composition based similarity. We use the embedding vectors outputted from the CNN encoder as the node features and apply protein organization and protein similarity to define the edges. Finally, the semi-supervised GCN is applied on the knowledge graph to utilize both known phages and test phages for training. However, the current version can only conduct the classification virus under Caudovirales(11). More importantly, ICTV will frequently adjust its taxonomy criteria according to the progress of research, such as deleting old families, adding new families, and moving members from one family to another. The continuous change of the reference and the emergence of novel viruses are impeding the accuracy and sensitivity of automatic prediction. In particular, most learning-based models must specify the label set (e.g. family labels), which will not accommodate viruses from new families. Thus, a method that can possibly recognize new families is needed to support automatic virus classification.

Here, we present PhaGCN2 to align computational classifications with ICTV-ratified taxa by automatically upgrading the database. PhaGCN2 can predict the taxonomy of viral sequences at the family level and accurately identify the members of the novel virus families that have not yet been defined in ICTV. We compare PhaGCN2 with the state-of-the-art virus classification tools (vConTACT2, CAT, and VPF-Class) using widely accepted metrics such as precision, recall, and required computing resources. The experimental results show that our method is superior to the existing methods.

Improvements of PhaGCN2

In summary, PhaGCN2 contains three major improvements comparing to the previous version (Table S1), including (1) updating with ICTV and using prodigal to build reference database under the entire virus realm (Table S2), (2) using network topology to assist outlier recognition, and (3) assigning outlier nodes to family_like. The improvements in (2) and (3) enable PhaGCN2 to automatically suggest new families, which removes the limitation on fixed set of labels in commonly used supervised learning models. These improvements allow PhaGCN2 to obtain more accurate predictions than the original version, with the precision (Eq. (1)) increased from 73.19–83.91%, the recall (Eq. (2)) increased from 87.92–89.30%, and F₁ (Eq. (3)) increased from 79.88–86.52% (Table S3). The detailed descriptions can be found in the following sections.

$$Precision=\frac{TP\left(True Positive\right)}{TP\left(True Positive\right)+FP\left(False Positive\right)}$$

1

$$Recall=\frac{TP\left(True Positive\right)}{TP\left(True Positive\right)+FN\left(False Negative\right)}$$

2

$${F}_{1}=2*\frac{Recall*Precision}{Recall+Precision}$$

3

Database construction. The PhaGCN protein database is constructed by manually downloading protein sequences from National Center for Biotechnology Information (NCBI). There are two potential disadvantages to use the old database. First, the number of proteins is limited by the update of the RefSeq protein database. Second, users need to map the proteins to their original genomes sequence-by-sequence, which is tedious and error-prone. To establish a faster and more user-friendly pipeline to construct the database, we apply Prodigal(12) to conduct gene finding and protein translation based on the up-to-date ICTV database, with the latest ICTV2021 containing 10,550 viruses. PhaGCN2 with the database constructed by Prodigal was compared with the original PhaGCN database using 8,760 virus sequences (length > 8000bp) in DOV (Dataset of Oyster Virome)(13). The results reveal that 98.46% of the predictions are consistent, indicating that using Prodigal to establish a protein database is reliable (Table S2). Now, users can align computational classifications with ICTV-ratified taxa by the function of training virus classification database in PhaGCN2.

Network visualization. Similar to vConTACT2, PhaGCN2 can also output the virus family clustering network. This gives us an intuitive understanding of the relationship between different virus families and family members. In addition to visualizing the family relationship, we also use the network topology to identify possible new families, which consist of subgraphs with weak connection with nodes from ICTV. First, we identify outliers, which are test viruses (nodes) not connected to any viruses from ICTV (Figure S1A, red dots). Often these outliers are from new families but they were assigned to the predefined families (Figure S1B, green dots) due to the design limitation of the supervised learning algorithm.

Family-like prediction. To support the automatic identification of new families, we assign these outliers as family_like (probably belong to another family which is close to a reference family). For instance, if a node is predicted to be Lipothrixviridae_like, it means that this node is close to Lipothrixviridae, but it is not recommended to be cluster it into the same family. To verify the feasibility of predicting outlier as family_like, we use the ICTV2020 virus to build a protein database, and use the newly added viruses from ICTV2021 (including 2,636 viral reference genomes after filtrating) as the test data. Detailed prediction results are shown in Table S3. The precision and recall after integrating this function for each family is shown in Table S4.

Among the 2,636 newly added viruses, 339 of them belong to families that are not defined in ICTV2020 and thus their labels do not exist in our training data. PhaGCN2 assigned 204 viruses as family_like in total. Among these sequences, 167 test sequences are members of real novel families of ICTV2021 or the families not included during ICTV2020 training. Therefore, the precision of family_like label is 81.86% (167/204), and the recall is 49.26% (167/339). Among the 167 true family_like labels, 153 viruses are defined in ICTV2021 as Genomoviridae (a novel family in ICTV2021), but they were predicted as Geminiviridae (the same order under Geplafuvirales with Genomoviridae) in PhaGCN. Now, PhaGCN2 predicts them as Geminiviridae_like, which means these viruses probably belong to a family closely related to Geminiviridae. The other 37 test sequences were mistakenly annotated as family_like, as they are family members in the ICTV2020 list according to ICTV2021. For example, some viruses are Myoviridae in ICTV2021, but were predicted as Drexlerviridae (the same order under Caudovirales with Myoviridae) by PhaGCN. Now, PhaGCN2 recognize them as Drexlerviridae_like. Notably, although they are classified under Myoviridae according to the ICTV2021 criteria, they belong to a new genus under the family, which have no edges to the members of Myoviridae in ICTV2020. In fact, most of the 37 test sequences are classified a new genus in ICTV2021.

Comparison with the state-of-the-art tools

In order to have a comprehensive evaluation of PhaGCN2, we compare PhaGCN2 with vConTACT2, CAT, and VPF-Class using six widely used metrics, precision (Eq. (1)), recall (Eq. (2)), F1-score (Balanced Score, Eq. (3)), consistency (Eq. (4)) computing speed, and peak memory.

$$consistency=\frac{\text{T}\text{h}\text{e} \text{s}\text{a}\text{m}\text{e} \text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{i}\text{o}\text{n} \text{b}\text{y} \text{t}\text{w}\text{o} \text{t}\text{o}\text{o}\text{l}\text{s}}{\text{t}\text{h}\text{e} \text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{v}\text{i}\text{r}\text{u}\text{s}\text{e}\text{s} \text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d} \text{b}\text{y} \text{b}\text{o}\text{t}\text{h} \text{t}\text{o}\text{o}\text{l}\text{s}}$$

4

To compare the consistency of the prediction made by the three tools, we take the ICTV2021 data (9604 viral genomes sequence, known reference viruses) as test data. As show in Fig. 1A, the number of viruses of predicted by both vConTACT2 and PhaGCN2 are 2248, and 1494 of them are identical, with a consistency value of 66.46% ((739 + 755)/(1199 + 1049)) (Detailed information is listed in Table S5). The number of viruses predicted by both PhaGCN2 and CAT are 6752, and 5090 of them are identical, with a consistency value of 75.39% ((739 + 4351)/(1199 + 5553)). The number of viruses predicted by both vConTACT2 and CAT are 1266, and 777 of them are identical, with a consistency value of 61.37% ((739 + 38)/(1199 + 67)). There are 1199 sequences predicted by all three tools with 739 having the same prediction, leading to a consistency value of 61.63% (739/1199). Then we further take GOV2.0 (including 482,522 virus genome sequences and most of them are novel viruses) as the test data. The paper of GOV2.0 provided the ready-to-use results of vConTACT2 prediction(6). Thus, we only ran PhaGCN2 and CAT to predict the GOV2.0 (VPF-Class is not included in the test as its slow calculation). vConTACT2 only acquired 47,839 predictions (9.91%), CAT predicted 170,200 viruses (35.27%), and PhaGCN2 acquired 199,833 predictions (41.41%). As shown in Fig. 1B, the number of viruses predicted by both vConTACT2 and PhaGCN2 are 20,287, and 16,958 of them are identical, with a consistency value of 83.59% ((3205 + 13753)/(5441 + 14846)) (Detailed information is listed in Table S6). The number of viruses predicted by both PhaGCN2 and CAT are 13,996, and 5,694 of them are identical, with a consistency value of 40.68% ((3205 + 2489)/(5441 + 8555)). The number of viruses predicted by both vConTACT2 and CAT are 10780, and 5,893 of them are identical, with a consistency value of 54.67% ((3205 + 2688)/(5441 + 5339)). There are 5,441 sequences predicted by all three tools, and 3,205 sequences have the same results, with a consistency of 58.90% (3205/5441). These results show that these tools have similar consistency for known viruses. But when focusing on unknown viruses, alignment-based classification methods such as CAT has lower consistency with other tools.

As mentioned above, when using the newly added viruses from ICTV2021 as the test data., the recall and precision of PhaGCN2 are 89.30% and 83.91%, respectively (Table S1). Here, we further tested PhaGCN2, vConTACT2, CAT, and VPF-class on all the ICTV2021 sequences collected in the PhaGCN2 database, and compared the obtained predictions with the classification of ICTV2021 (Table 3). For precision, PhaGCN2 > CAT > vConTACT2 > VPF-Class (Table 1), for Recall, PhaGCN2 > CAT > VPF-Class > vConTACT2. And for F1-score, PhaGCN2 > CAT > VPF-Class > vConTACT2. The results show that PhaGCN2 largely improves the precision and recall of virus classification over the state-of-the-art tools. The detailed results are shown in Table S5.

In addition, we recorded the elapsed time and peak memory of the three tools. We randomly selected 1000, 5000, and 10000 sequences from GPD for testing (Fig. 2). PhaGCN2 is faster than vConTACT2 but slower than CAT. vConTACT2 has a high memory usage in the step of calculating similarity networks, while PhaGCN2 and CAT consumes less memory.

Analysis of the sequences without predictions

As mentioned above, while using the newly added virus from ICTV2021 as the test data, there are 1,492 sequences with predictions and 1,142 sequences without prediction. In our analysis of 1142 sequences without predictions (Table S7), 992 of them are from newly added families by ICTV2021 and thus cannot be predicted by PhaGCN2. Of the remaining 150 sequences, 80 are new genera under known families. We speculate that these new genera cannot be predicted because the different genera in these families are of low similarity. In addition, 49 sequences are missed. Although they are not new genera, they are not trained by PhaGCN2 because the sample size of this genus in the 2020 training set was too small (genera member < 8). For the remaining 21 sequences, we cannot determine the cause for the time being. However, compared with the total 2,634 test sequences, the number is acceptable.

Furthermore, we examined the protein-level similarity between newly added sequences in ICTV2021 with and without predictions against the reference genomes (ICTV2020 training data) using Diamond blastx, and compared their similarity distributions. As shown in Figure S2, the protein sequence identity distributions are significantly different between the two groups. Among them, virus sequences with relatively low variability and identity about 54.8% are likely to be predicted by PhaGCN2. However, highly variable sequences with identity lower than 37.4% have a low probability of prediction. Detailed results are shown in Table S8.

Possibility of genus-level prediction

Same as vConTACT2, PhaGCN2 can also draw the network diagram. We use the metagenomes of about 1700 human gut microbiome DNA viruses(14) as the test data and map the network with the results of PhaGCN2. Due to the space limitation, we only show the results of the 10 largest families in the database (Fig. 3A). It is obvious that virus nodes of the same families cluster closely. To visualize clusters at genus-level, we investigated the genera in the most abundant family—Siphoviridae. Again, the top 16 genus members in Siphoviridae were visualized using different colors in Fig. 3B. We can see that some genera, Pahexavirus, Skunavirus, and Ceduovirus, were clustered within themselves. However, some genera (such as Triavirus, Phietavirus, Bioseptimavirus, Dubowvirus, and Peeveelvirus) were mixed together (Fig. 3B). This suggests that they are not different enough for PhaGCN2 to predict them as different genus.

Investigation of public data using PhaGCN2

GPD and GOV2.0 represent two completely different viral habitats. In this section, we use PhaGCN2 to classify the GPD and GOV2.0 database. After removing the ineligible sequences, they are left with 142,333 (in all 142809) and 328,173 (in all 482522), respectively. As shown in Fig. 4, the overall recall of GPD and GOV2.0 is 91.9% and 40.8% respectively. The higher proportion of the unknown virues in GOV2.0 is far more than GPD, indicating that viruses in the ocean has not been fully explored, with a large portion still under the iceberg. When only focusing on the classified categories (without unknown), Siphoviridae, and Myoviridae account for 54.5%, and Siphoviridae_like and Myoviridae_like account for 31.1% in GPD. In contrast to GPD, Siphoviridae, and Myoviridae account for 28.9%, and Siphoviridae_like and Myoviridae_like account for 40.4% in GOV2.0. If other families under Caudovirales, such as Podoviridae and Herelleviridae, are included, 99.16% of the phages in the human gut are Caudovirales, while 94.8% in the ocean. It means that Caudovirales is the majority of both GPD and GOV2.0 at the order level, but GPD and GOV2.0 is quite different at the family level. Detailed results are shown in Table S9 and Table S5.

We further applied PhaGCN2 to classify 2202 qualified RNA virus genomes from the study of invertebrate and vertebrate viromes(15, 16). There are 1094 sequences with predictions, and only six virus genomes are predicted to be non-RNA viruses. The top 3 families are Marnaviridae, Dicistroviridae, and Nodaviridae, and they account for 18.7% in total. However, there are up to 52.5% of viruses cannot be taxonomically classified to a known viral family, which shows that our understanding of RNA virosphere is still very limited. The detailed results are shown in Table S10 and Figure S3.

Furthermore, according to the classification and site information of GOV2 at the family level, we ploted the distribution abundance maps of Myoviridae and Siphoviridae at different sites and depths (Fig. 5). As shown in Fig. 5, the closer to the equatorial region and upper ocean, the higher the proportion of Myoviridae is. In contrast, the proportion of Siphoviridae in the two poles is higher than in the equator. This means that viruses from different families may have evolved unique adaptations to the different niches over a long period. The detailed longitude, latitude, and content data are shown in Table S11.

vConTACT2 is a widely recognized tool for virus classification using a combination of ClusterONE(17), hierarchical clustering(18), and Markov cluster algorithm (MCL)-generated protein clusters(19). The advantage of this method is that it can accurately predict the genome classification of large DNA phages with multiple ORFs and frequent recombination. However, its performance deteriorates for phage contigs that contain fewer protein clusters. PhaGCN integrates the protein-cluster-based features into a more powerful machine learning model based on graph convolutional network and thus achieves higher accuracy with less computing resources. However, PhaGCN is limited to only phages, limiting its utility to comprehensive virus taxonomic classification. PhaGCN2 removes this limitation by augmenting the learning model and reference database. PhaGCN2 can be applied to all types of viral metagenomic data and automatically produces family-level taxonomic classification of both DNA and RNA viruses. In addition, it can suggest new viral families based on the network topology. Alignment-based classification methods such as CAT or comprehensive BLAST(20) rely only on the alignment result, and simply infer species’ classification based on majority votes. Although CAT is the second accurate tool in identifying known viruses (Table 3), alignment-based tools are not optimized for classifying novel or highly diverged viruses.

PhaGCN2 is designed and trained to make predictions at the family-level. Although the method can be extended to genus-level prediction, the small number of members of many genera are not sufficient to train a generalized learning model. Another challenge is that some genera under the current ICTV standard are too similar to be distinguished effectively (Fig. 3B). However, with the continuous growth of the ICTV reference data set and the adjustment of ICTV on close-related genera, prediction at the genes-level will be more feasible.

Like other learning-based models, PhaGCN2’s performance also relies on the quality of the training data. Due to the bias in sequencing, current training data does not systematically cover different taxonomic groups. Although PhaGCN leverages network topology to suggest novel families, its prediction ability on new families is limited. The detection rate of unknown virus sequences with identity less than 37.4% is usually very low (Figure S2). One possible strategy to enhance classification of new viral families is to conduct iterative prediction using PhaGCN2. First, we can conduct predictions on all viral genome data (such as IMG/VR(8)) using PhaGCN2. Then, we can add the newly predicted Family_like members into the training data to increase the capacity of PhaGCN2 on identifying more members of new families. The iterative training and searching is likely to increase the ability of PhaGCN2 on new family detection. We will investigate this in our future work.

However, for those "dark matter" sequences with no or very low similarity, it may be an impossible task to do a de novo viral classification. First, we can't evaluate the accuracy of predictions. Second, without any homologs, it is difficult to characterize the structure or function of their genomes. No matter how many sequences are identified, they are still "dark matter".

Finally, as PhaGCN2 does not predict whether the input sequence belongs to the virus or the host cell, we strongly recommend using viral sequences as input to PhaGCN2. In other words, virus identification tools (such as DIAMOND(21), Virsorter2(22), etc.) should be used to remove non-viral sequences before PhaGCN2 is applied.

Here, we present PhaGCN2, which can rapidly classify the taxonomy of viral sequences at family level and supports the visualization of the associations of all families. We evaluate the performance of PhaGCN2 and compare it with the state-of-the-art virus classification tools, such as vConTACT2, CAT, and VPF-Class, using the widely accepted metrics. The results show that PhaGCN2 largely improves the precision and recall of virus classification, increases the number of classifiable virus sequences in the Global Ocean Virome dataset (v2.0) by 4 times, and classifies more than 90% of the Gut Phage Database. PhaGCN2 makes it possible to conduct high-throughput and automatic expansion of the database of the International Committee on Taxonomy of Viruses.

Datasets and benchmarked tools

The main datasets and tools used or evaluated in this paper are listed as follows.

Sequences preprocessing before building the protein database

When training the CNN(25) model, to ensure that the number of samples sequence at the family-level is enough, we need to remove small families before building the database. The filter condition is length ≧ 1700bp (To make sure the sequence contains enough information), family members ≧ 8 (To ensure that each family contains at least seven training sequences and one validation sequence), and ACGT contigs (skipping contigs with non-ACGT characters).

Statistical information

We select random number of sequences to quantify the usage of computing resources. The sequences are randomly chosen using a random number generator in python. Run time was measured with the “/usr/bin/time” command available in Linux. Peak memory was measured with the “/usr/bin/free -h” command available in Linux. The Knowledge Graph network was visualized with Gephi(26) (v.0.9.2; https://gephi.org/) software. The others are drawn by R.

Method optimization of PhaGCN and PhaGCN2

In addition to predicting only Caudoviruses, there is still some important limitations in PhaGCN that has not been solved. Because ICTV will frequently adjust its taxonomy criteria according to the progress of research, such as deleting old families, adding new families, and moving members from one family to another. The continuous change of the reference and the emergence of novel viruses are impeding the accuracy and sensitivity of automatic prediction. In particular, most learning-based models must specify the label set (e.g. family labels), which will not accommodate viruses from new families. In view of this, we have made the following improvements to PhaGCN, including (1) updating with ICTV and using prodigal to build reference database under the entire virus realm, (2) using network graph to show the clustering relationship among family members, and (3) the prediction of novel viral families (family_like) based on the topology of the network (outlier nodes). Because the PhaGCN protein database is constructed by manually downloading protein sequences from NCBI, it is a tedious and error-prone process. The improvements in (1) enable PhaGCN2 to automatically translate the predicted genes and build a up-to-date protein database. The improvements in (2) and (3) enable PhaGCN2 to automatically suggest new families, which removes the limitation on fixed set of labels in commonly used supervised learning models.

Availability of data and materials

The source code of PhaGCN2 is available via: https://github.com/KennthShang/PhaGCN2.0.

Acknowledgements

Not applicable.

Funding

This project was supported by the Natural Science Foundation of China (nos. 31872499 and 31972847) to Yuan LH and Jiang JZ; the Central Public-Interest Scientific Institution Basal Research Fund, CAFS (nos. 2020TD42 and 2021SD05) to Jiang JZ; the Guangdong Provincial Special Fund for Modern Agriculture Industry Technology Innovation Teams (no. 2019KJ141) to Jiang JZ. The funders had no role in the study design, data collection, and analysis, decision to publish, or manuscript preparation.

Ethical Approval and Consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JJZ: Conceptualization, Methodology, Writing - Original Draft, Writing - Review & Editing, Supervision, Project administration, and Funding acquisition; YWG: Methodology, Software, Validation, Formal analysis, Data Curation, Writing - Original Draft, and Visualization; SJY: Methodology, Software, and Writing - Review & Editing; YLL, LM, ZP, and SYH: Resources, Data Curation, and Investigation; JT: Conceptualization and Resources; SYN: Conceptualization, Methodology, Writing - Review & Editing, and Supervision; YLH: Conceptualization, Supervision, Project administration, and Funding acquisition.

All authors read and approved the final manuscript.

Gelderblom, H.R. (1996) Structure and Classification of Viruses. Medical Microbiology.
Suttle, C.A. (2007) Marine viruses βÄî major players in the global ecosystem. Nature Reviews Microbiology, 5, 801-812.
Geoghegan, J.L. and Holmes, E.C. (2017) Predicting virus emergence amid evolutionary noise. Open Biol, 7.
Asokan, G.V. and Kasimanickam, R.K. (2013) Emerging Infectious Diseases, Antimicrobial Resistance and Millennium Development Goals: Resolving the Challenges through One Health. Cent Asian J Glob Health, 2, 76.
Grant, W.B. (2008) Hypothesis--ultraviolet-B irradiance and vitamin D reduce the risk of viral infections and thus their sequelae, including autoimmune diseases and some cancers.
Gregory, A.C., Zayed, A.A., Conceicao-Neto, N., Temperton, B., Bolduc, B., Alberti, A., Ardyna, M., Arkhipova, K., Carmichael, M., Cruaud, C. et al. (2019) Marine DNA Viral Macro- and Microdiversity from Pole to Pole. Cell, 177, 1109-1123 e1114.
Camarillo-Guerrero, L.F., Almeida, A., Rangel-Pineros, G., Finn, R.D. and Lawley, T.D. (2021) Massive expansion of human gut bacteriophage diversity. Cell, 184, 1098-1109 e1099.
Roux, S., Paez-Espino, D., Chen, I.A., Palaniappan, K., Ratner, A., Chu, K., Reddy, T.B.K., Nayfach, S., Schulz, F., Call, L. et al. (2021) IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res, 49, D764-D775.
Simmonds, P., Adams, M.J., Benko, M., Breitbart, M., Brister, J.R., Carstens, E.B., Davison, A.J., Delwart, E., Gorbalenya, A.E., Harrach, B. et al. (2017) Consensus statement: Virus taxonomy in the age of metagenomics. Nat Rev Microbiol, 15, 161-168.
Dutilh, B.E., Varsani, A., Tong, Y., Simmonds, P., Sabanadzovic, S., Rubino, L., Roux, S., Munoz, A.R., Lood, C., Lefkowitz, E.J. et al. (2021) Perspective on taxonomic classification of uncultivated viruses. Curr Opin Virol, 51, 207-215.
Shang, J., Jiang, J. and Sun, Y. (2021) Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics, 37, i25-i33.
Hyatt, D., Chen, G.L., Locascio, P.F., Land, M.L., Larimer, F.W. and Hauser, L.J. (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11, 119.
Jing-Zhe, J., Yi-Fei, F., Hong-Ying, W., Ying-Xiang, G., Li-Ling, Y., Tao, J., Mang, S., Shao-Kun, S., Meng, W., Tuo, Y. et al. (2021) Dataset of Oyster Virome and the Remarkable Virus Diversity in Filter-Feeding Oysters. Research Square.
Nayfach, S., Páez-Espino, D., Call, L., Low, S.J., Sberro, H., Ivanova, N.N., Proal, A.D., Fischbach, M.A., Bhatt, A.S., Hugenholtz, P. et al. (2021) Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nature Microbiology, 6, 960-970.
Shi, M., Lin, X.-D., Tian, J.-H., Chen, L.-J., Chen, X., Li, C.-X., Qin, X.-C., Li, J., Cao, J.-P., Eden, J.-S. et al. (2016) Redefining the invertebrate RNA virosphere. Nature, 540, 539-543.
Shi, M., Lin, X.-D., Chen, X., Tian, J.-H., Chen, L.-J., Li, K., Wang, W., Eden, J.-S., Shen, J.-J., Liu, L. et al. (2018) The evolutionary history of vertebrate RNA viruses. Nature, 556, 197-202.
Nepusz, T., Yu, H. and Paccanaro, A. (2012) Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods, 9, 471-472.
Lima-Mendez, G., Van Helden, J., Toussaint, A. and Leplae, R. (2008) Reticulate representation of evolutionary and functional relationships between phage genomes. Mol Biol Evol, 25, 762-777.
Bin Jang, H., Bolduc, B., Zablocki, O., Kuhn, J.H., Roux, S., Adriaenssens, E.M., Brister, J.R., Kropinski, A.M., Krupovic, M., Lavigne, R. et al. (2019) Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol, 37, 632-639.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool.
Benjamin, B., Chao, X. and H, H.D. (2015) Fast and sensitive protein alignment using DIAMOND. nature methods 12.
Guo, J., Bolduc, B., Zayed, A.A., Varsani, A., Dominguez-Huerta, G., Delmont, T.O., Pratama, A.A., Gazitua, M.C., Vik, D., Sullivan, M.B. et al. (2021) VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome, 9, 37.
von Meijenfeldt, F.A.B., Arkhipova, K., Cambuy, D.D., Coutinho, F.H. and Dutilh, B.E. (2019) Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol, 20, 217.
Pons, J.C., Paez-Espino, D., Riera, G., Ivanova, N., Kyrpides, N.C. and Llabres, M. (2021) VPF-Class: Taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics.
Shang, J. and Sun, Y.J.M. (2020) CHEER: hierarCHical taxonomic classification for viral mEtagEnomic data via deep leaRning.
M, B., S, H. and M, J. (2009) Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

Table 1

Datasets

Datasets	Years	Habits	Description
ICTV2021	2021		The 2021 ICTV Virus Metadata Resource (https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/12323)
ICTV2020	2020		The 2020 ICTV Virus Metadata Resource (https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/10312)
GPD (Gut Phage Database)	2021	Human gut	Lawley et al. (2021) created the Gut Phage Database (GPD), a collection of 142,809 non-redundant viral genomes (length>10 kb) obtained by mining 28,060 globally distributed human gut metagenomes and 2,898 reference genomes of cultured gut bacteria(7). (http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/gut_phage_database/)
GOV2.0 (global ocean DNA virome dataset)	2019	Ocean	Gregory et al. (2021) established an ~12-fold expanded global ocean DNA virome dataset (GOV2.0) of 195,728 viral populations, now including the Arctic Ocean, and validated that these populations form discrete genotypic clusters(6). (https://data.iplantcollaborative.org/dav/iplant/commons/community_released/iVirus/GOV2.0/)
MGV (Metagenomic Gut Virus)	2021	Human stool	Nayfach, et al. (2021) assembled the Metagenomic Gut Virus catalogue that comprises 189,680 viral genomes from 11,810 publicly available human stool metagenomes, naming the dataset as MGV(14). (https://github.com/snayfach/MGV)
DOV (Dataset of Oyster Virome)	2021	Oyster	Jiang et al. (2021) established a Dataset of Oyster Virome (DOV) that contains 728,784 nonredundant viral operational taxonomic unit (vOTU) contigs and 3,473 high-quality viral genomes, enabling the first comprehensive overview of viral communities in oysters(13). (https://ngdc.cncb.ac.cn/gsub/submit/bioproject/subPRO010366/overview)
Test RNA database	2016 and 2018	invertebrate and vertebrate	Shi et al. 2016(15) profiled the transcriptomes of over 220 invertebrate species sampled across nine animal phyla and reported the discovery of 1,445 RNA viruses, including some that are sufficiently divergent to comprise new families. And in 2018, using a large-scale meta-transcriptomic approach, they discovered 214 vertebrate-associated viruses in reptiles, amphibians, lungfish, ray-finned fish, cartilaginous fish and jawless fish(16).

Table 2

The benchmarked tools

Tools	vConTACT2	CAT	VPF-Class
Years	2019	2019	2021
Author	Sullivan, M. B.et.al.	Dutilh, B. E.et.al.	Pons, J. C.et.al
Description	vConTACT2 is a tool to perform taxonomy classification of viral genomic sequence data. It is designed to cluster and provide the taxonomic context of viral metagenomic sequencing data(19). It is widely used for virus classification. (https://github.com/dutilh/CAT)	CAT is a comparison-based species classification tool for metagenomic contigs. It first conducts gene calling, then maps the predicted ORFs against the nr protein database, and finally classifies entire contigs based on classification of the individual ORFs(23). (https://bitbucket.org/MAVERICLab/vcontact2/)	VPF-Class is a tool that can conduct host prediction and classify virus. It is a comparison-based metagenomic contig annotation tool(24). (https://github.com/biocom-uib/vpf-tools)

Table 3

Comparison of PhaGCN2 with the state-of-the-art virus classification tools

Tools	PhaGCN2	vConTACT2	CAT	VPF-Class
Test Data	9604¹ ICTV2021 (including 3189 RNA virus²)
True Positive	8379	1616	6928	3840
False Positive	260	773	825	3683
False Negative	965	4026	1852	2080
Precision	96.99%	67.64%	89.36%	51.04%
Recall	89.67%	28.64%	78.91%	64.86%
F1-score	93.19%	40.24%	83.81%	57.13%
¹Virus genomes longer than 1700 bp in the ICTV2021 were used as the test data for the evaluation of all the software. ²RNA virus genomes were excluded from vConTACT2 evaluation as it was designed for only DNA viruses classification.

No competing interests reported.

Virus classification for viral genomic fragments using PhaGCN2

Status:

Version 1

Abstract

Figures

Background

Results

Improvements of PhaGCN2

Comparison with the state-of-the-art tools

Analysis of the sequences without predictions

Possibility of genus-level prediction

Investigation of public data using PhaGCN2

Discussion

Conclusions

Methods

Datasets and benchmarked tools

Sequences preprocessing before building the protein database

Statistical information

Method optimization of PhaGCN and PhaGCN2

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1