Improvements of PhaGCN2
In summary, PhaGCN2 contains three major improvements comparing to the previous version (Table S1), including (1) updating with ICTV and using prodigal to build reference database under the entire virus realm (Table S2), (2) using network topology to assist outlier recognition, and (3) assigning outlier nodes to family_like. The improvements in (2) and (3) enable PhaGCN2 to automatically suggest new families, which removes the limitation on fixed set of labels in commonly used supervised learning models. These improvements allow PhaGCN2 to obtain more accurate predictions than the original version, with the precision (Eq. (1)) increased from 73.19–83.91%, the recall (Eq. (2)) increased from 87.92–89.30%, and F1 (Eq. (3)) increased from 79.88–86.52% (Table S3). The detailed descriptions can be found in the following sections.
$$Precision=\frac{TP\left(True Positive\right)}{TP\left(True Positive\right)+FP\left(False Positive\right)}$$
1
$$Recall=\frac{TP\left(True Positive\right)}{TP\left(True Positive\right)+FN\left(False Negative\right)}$$
2
$${F}_{1}=2*\frac{Recall*Precision}{Recall+Precision}$$
3
Database construction. The PhaGCN protein database is constructed by manually downloading protein sequences from National Center for Biotechnology Information (NCBI). There are two potential disadvantages to use the old database. First, the number of proteins is limited by the update of the RefSeq protein database. Second, users need to map the proteins to their original genomes sequence-by-sequence, which is tedious and error-prone. To establish a faster and more user-friendly pipeline to construct the database, we apply Prodigal(12) to conduct gene finding and protein translation based on the up-to-date ICTV database, with the latest ICTV2021 containing 10,550 viruses. PhaGCN2 with the database constructed by Prodigal was compared with the original PhaGCN database using 8,760 virus sequences (length > 8000bp) in DOV (Dataset of Oyster Virome)(13). The results reveal that 98.46% of the predictions are consistent, indicating that using Prodigal to establish a protein database is reliable (Table S2). Now, users can align computational classifications with ICTV-ratified taxa by the function of training virus classification database in PhaGCN2.
Network visualization. Similar to vConTACT2, PhaGCN2 can also output the virus family clustering network. This gives us an intuitive understanding of the relationship between different virus families and family members. In addition to visualizing the family relationship, we also use the network topology to identify possible new families, which consist of subgraphs with weak connection with nodes from ICTV. First, we identify outliers, which are test viruses (nodes) not connected to any viruses from ICTV (Figure S1A, red dots). Often these outliers are from new families but they were assigned to the predefined families (Figure S1B, green dots) due to the design limitation of the supervised learning algorithm.
Family-like prediction. To support the automatic identification of new families, we assign these outliers as family_like (probably belong to another family which is close to a reference family). For instance, if a node is predicted to be Lipothrixviridae_like, it means that this node is close to Lipothrixviridae, but it is not recommended to be cluster it into the same family. To verify the feasibility of predicting outlier as family_like, we use the ICTV2020 virus to build a protein database, and use the newly added viruses from ICTV2021 (including 2,636 viral reference genomes after filtrating) as the test data. Detailed prediction results are shown in Table S3. The precision and recall after integrating this function for each family is shown in Table S4.
Among the 2,636 newly added viruses, 339 of them belong to families that are not defined in ICTV2020 and thus their labels do not exist in our training data. PhaGCN2 assigned 204 viruses as family_like in total. Among these sequences, 167 test sequences are members of real novel families of ICTV2021 or the families not included during ICTV2020 training. Therefore, the precision of family_like label is 81.86% (167/204), and the recall is 49.26% (167/339). Among the 167 true family_like labels, 153 viruses are defined in ICTV2021 as Genomoviridae (a novel family in ICTV2021), but they were predicted as Geminiviridae (the same order under Geplafuvirales with Genomoviridae) in PhaGCN. Now, PhaGCN2 predicts them as Geminiviridae_like, which means these viruses probably belong to a family closely related to Geminiviridae. The other 37 test sequences were mistakenly annotated as family_like, as they are family members in the ICTV2020 list according to ICTV2021. For example, some viruses are Myoviridae in ICTV2021, but were predicted as Drexlerviridae (the same order under Caudovirales with Myoviridae) by PhaGCN. Now, PhaGCN2 recognize them as Drexlerviridae_like. Notably, although they are classified under Myoviridae according to the ICTV2021 criteria, they belong to a new genus under the family, which have no edges to the members of Myoviridae in ICTV2020. In fact, most of the 37 test sequences are classified a new genus in ICTV2021.
Comparison with the state-of-the-art tools
In order to have a comprehensive evaluation of PhaGCN2, we compare PhaGCN2 with vConTACT2, CAT, and VPF-Class using six widely used metrics, precision (Eq. (1)), recall (Eq. (2)), F1-score (Balanced Score, Eq. (3)), consistency (Eq. (4)) computing speed, and peak memory.
$$consistency=\frac{\text{T}\text{h}\text{e} \text{s}\text{a}\text{m}\text{e} \text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{i}\text{o}\text{n} \text{b}\text{y} \text{t}\text{w}\text{o} \text{t}\text{o}\text{o}\text{l}\text{s}}{\text{t}\text{h}\text{e} \text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{v}\text{i}\text{r}\text{u}\text{s}\text{e}\text{s} \text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d} \text{b}\text{y} \text{b}\text{o}\text{t}\text{h} \text{t}\text{o}\text{o}\text{l}\text{s}}$$
4
To compare the consistency of the prediction made by the three tools, we take the ICTV2021 data (9604 viral genomes sequence, known reference viruses) as test data. As show in Fig. 1A, the number of viruses of predicted by both vConTACT2 and PhaGCN2 are 2248, and 1494 of them are identical, with a consistency value of 66.46% ((739 + 755)/(1199 + 1049)) (Detailed information is listed in Table S5). The number of viruses predicted by both PhaGCN2 and CAT are 6752, and 5090 of them are identical, with a consistency value of 75.39% ((739 + 4351)/(1199 + 5553)). The number of viruses predicted by both vConTACT2 and CAT are 1266, and 777 of them are identical, with a consistency value of 61.37% ((739 + 38)/(1199 + 67)). There are 1199 sequences predicted by all three tools with 739 having the same prediction, leading to a consistency value of 61.63% (739/1199). Then we further take GOV2.0 (including 482,522 virus genome sequences and most of them are novel viruses) as the test data. The paper of GOV2.0 provided the ready-to-use results of vConTACT2 prediction(6). Thus, we only ran PhaGCN2 and CAT to predict the GOV2.0 (VPF-Class is not included in the test as its slow calculation). vConTACT2 only acquired 47,839 predictions (9.91%), CAT predicted 170,200 viruses (35.27%), and PhaGCN2 acquired 199,833 predictions (41.41%). As shown in Fig. 1B, the number of viruses predicted by both vConTACT2 and PhaGCN2 are 20,287, and 16,958 of them are identical, with a consistency value of 83.59% ((3205 + 13753)/(5441 + 14846)) (Detailed information is listed in Table S6). The number of viruses predicted by both PhaGCN2 and CAT are 13,996, and 5,694 of them are identical, with a consistency value of 40.68% ((3205 + 2489)/(5441 + 8555)). The number of viruses predicted by both vConTACT2 and CAT are 10780, and 5,893 of them are identical, with a consistency value of 54.67% ((3205 + 2688)/(5441 + 5339)). There are 5,441 sequences predicted by all three tools, and 3,205 sequences have the same results, with a consistency of 58.90% (3205/5441). These results show that these tools have similar consistency for known viruses. But when focusing on unknown viruses, alignment-based classification methods such as CAT has lower consistency with other tools.
As mentioned above, when using the newly added viruses from ICTV2021 as the test data., the recall and precision of PhaGCN2 are 89.30% and 83.91%, respectively (Table S1). Here, we further tested PhaGCN2, vConTACT2, CAT, and VPF-class on all the ICTV2021 sequences collected in the PhaGCN2 database, and compared the obtained predictions with the classification of ICTV2021 (Table 3). For precision, PhaGCN2 > CAT > vConTACT2 > VPF-Class (Table 1), for Recall, PhaGCN2 > CAT > VPF-Class > vConTACT2. And for F1-score, PhaGCN2 > CAT > VPF-Class > vConTACT2. The results show that PhaGCN2 largely improves the precision and recall of virus classification over the state-of-the-art tools. The detailed results are shown in Table S5.
In addition, we recorded the elapsed time and peak memory of the three tools. We randomly selected 1000, 5000, and 10000 sequences from GPD for testing (Fig. 2). PhaGCN2 is faster than vConTACT2 but slower than CAT. vConTACT2 has a high memory usage in the step of calculating similarity networks, while PhaGCN2 and CAT consumes less memory.
Analysis of the sequences without predictions
As mentioned above, while using the newly added virus from ICTV2021 as the test data, there are 1,492 sequences with predictions and 1,142 sequences without prediction. In our analysis of 1142 sequences without predictions (Table S7), 992 of them are from newly added families by ICTV2021 and thus cannot be predicted by PhaGCN2. Of the remaining 150 sequences, 80 are new genera under known families. We speculate that these new genera cannot be predicted because the different genera in these families are of low similarity. In addition, 49 sequences are missed. Although they are not new genera, they are not trained by PhaGCN2 because the sample size of this genus in the 2020 training set was too small (genera member < 8). For the remaining 21 sequences, we cannot determine the cause for the time being. However, compared with the total 2,634 test sequences, the number is acceptable.
Furthermore, we examined the protein-level similarity between newly added sequences in ICTV2021 with and without predictions against the reference genomes (ICTV2020 training data) using Diamond blastx, and compared their similarity distributions. As shown in Figure S2, the protein sequence identity distributions are significantly different between the two groups. Among them, virus sequences with relatively low variability and identity about 54.8% are likely to be predicted by PhaGCN2. However, highly variable sequences with identity lower than 37.4% have a low probability of prediction. Detailed results are shown in Table S8.
Possibility of genus-level prediction
Same as vConTACT2, PhaGCN2 can also draw the network diagram. We use the metagenomes of about 1700 human gut microbiome DNA viruses(14) as the test data and map the network with the results of PhaGCN2. Due to the space limitation, we only show the results of the 10 largest families in the database (Fig. 3A). It is obvious that virus nodes of the same families cluster closely. To visualize clusters at genus-level, we investigated the genera in the most abundant family—Siphoviridae. Again, the top 16 genus members in Siphoviridae were visualized using different colors in Fig. 3B. We can see that some genera, Pahexavirus, Skunavirus, and Ceduovirus, were clustered within themselves. However, some genera (such as Triavirus, Phietavirus, Bioseptimavirus, Dubowvirus, and Peeveelvirus) were mixed together (Fig. 3B). This suggests that they are not different enough for PhaGCN2 to predict them as different genus.
Investigation of public data using PhaGCN2
GPD and GOV2.0 represent two completely different viral habitats. In this section, we use PhaGCN2 to classify the GPD and GOV2.0 database. After removing the ineligible sequences, they are left with 142,333 (in all 142809) and 328,173 (in all 482522), respectively. As shown in Fig. 4, the overall recall of GPD and GOV2.0 is 91.9% and 40.8% respectively. The higher proportion of the unknown virues in GOV2.0 is far more than GPD, indicating that viruses in the ocean has not been fully explored, with a large portion still under the iceberg. When only focusing on the classified categories (without unknown), Siphoviridae, and Myoviridae account for 54.5%, and Siphoviridae_like and Myoviridae_like account for 31.1% in GPD. In contrast to GPD, Siphoviridae, and Myoviridae account for 28.9%, and Siphoviridae_like and Myoviridae_like account for 40.4% in GOV2.0. If other families under Caudovirales, such as Podoviridae and Herelleviridae, are included, 99.16% of the phages in the human gut are Caudovirales, while 94.8% in the ocean. It means that Caudovirales is the majority of both GPD and GOV2.0 at the order level, but GPD and GOV2.0 is quite different at the family level. Detailed results are shown in Table S9 and Table S5.
We further applied PhaGCN2 to classify 2202 qualified RNA virus genomes from the study of invertebrate and vertebrate viromes(15, 16). There are 1094 sequences with predictions, and only six virus genomes are predicted to be non-RNA viruses. The top 3 families are Marnaviridae, Dicistroviridae, and Nodaviridae, and they account for 18.7% in total. However, there are up to 52.5% of viruses cannot be taxonomically classified to a known viral family, which shows that our understanding of RNA virosphere is still very limited. The detailed results are shown in Table S10 and Figure S3.
Furthermore, according to the classification and site information of GOV2 at the family level, we ploted the distribution abundance maps of Myoviridae and Siphoviridae at different sites and depths (Fig. 5). As shown in Fig. 5, the closer to the equatorial region and upper ocean, the higher the proportion of Myoviridae is. In contrast, the proportion of Siphoviridae in the two poles is higher than in the equator. This means that viruses from different families may have evolved unique adaptations to the different niches over a long period. The detailed longitude, latitude, and content data are shown in Table S11.