Nucleocytoviricota dataset preparation: open reading frame (ORF) vs intergenic regions’ GC% content
In this study, viral genomes were we carefully selected to construct our main dataset. Only viruses that had complete genomes available on GenBank (NCBI) were included. Representative genomes of the main viral families of the Nucleocytoviricota were selected, summing up for a total of sixty-one complete viral genomes: Phycodnaviridae (n = 4), Mimiviridae (n = 13), Ascoviridae (n = 1), Iridoviridae (n = 5), Marseilleviridae (n = 5), Asfarviridae (n = 1), Poxviridae (n = 23), “Pithoviridae” (n = 2; and 1 pithoviridae-like), “Pandoraviridae” (n = 1), as well as extended Asfarviridae (n = 3), and unclassified DNA viruses (n = 2), according to ICTV proposed taxonomy in 2020 (Walker et al., 2020). We have also included the recently discovered Yaravirus (Yaraviridae) in our analysis. Despite not being classified within the Nucleocytoviricota, phylogenomic analyses indicate a close relationship between Yaravirus and nucleocytoviruses (Miranda Boratto et al., 2022). Therefore, we considered it appropriate to include this small amoeba virus in our final dataset. Based on manually curated data based on the Virus-Host Database, host organisms for all viruses have been indicated within the dataset (Table 1), and the genomic GC% content of host organisms has been extracted. Since lymphocystis disease virus 1 has multiple probable hosts within fish species, such as Osmeridae, Percichthyidae, Percidae, Soleidae, Clupeidae, and other families, single representatives with available complete genomes have been selected, based on NCBI’s taxonomy (in cases where it was possible). In addition, the genomic GC% content from host organisms of the following viruses were not available: Heterosigma akashiwo virus 1, Phaeocystis globosa virus group 1, Chrysochromulina ericina virus, Tetraselmis viridis virus, invertebrate iridescent virus 3, sea otterpox virus, pteropox virus, salmon gillpox virus, Anomala cuprea entomopoxvirus, and Diachasmimorpha longicaudata entomopoxvirus.
Table 1
GC% content average of viral genome, viral ORF, and associated host organism genome. Overall GC% content of Nucleocytoviricota viruses (Phycodnaviridae, Mimivirdae, Ascoviridae, Iridoviridae, Marseilleviridae, Asfarviridae, Extended Asfarviridae, Poxviridae, “Pithoviridae”, Yarviridae, "Pandoraviridae", and unclassified DNA viruses).
Family | Virus | GenBank Accession | Genome size (kb) | Viral genomic GC% | Viral ORF GC% | Host organism | Host genomic GC% |
Phycodnaviridae | Paramecium bursaria Chlorella virus 1 | NC_000852.5 | 330.6 | 40 | 40.6 | Chlorella variabilis | 65.5 |
Emiliania huxleyi virus 86 | NC_007346.1 | 407.3 | 40.2 | 40.6 | Emiliania huxleyi | 64.5 |
Ectocarpus siliculosus virus 1 | NC_002687.1 | 335.6 | 51.7 | 52.2 | Ectocarpus siliculosus | 53.49 |
Heterosigma akashiwo virus 1 | NC_038553.1 | 274.7 | 30.4 | 30.8 | Heterosigma akashiwo | N/A |
Mimiviridae | Cafeteria roenbergensis virus | NC_014637.1 | 617.4 | 23.3 | 23.4 | Cafeteria roenbergensis | 70.44 |
Acanthamoeba polyphaga mimivirus | NC_014649.1 | 1,181.5 | 28 | 28.8 | Acanthamoeba castellanii | 58.35 |
Acanthamoeba polyphaga | 58.7 |
Samba virus | KF959826.2 | 1,181.5 | 28 | 28.7 | Acanthamoeba castellanii | 58.35 |
Megavirus chiliensis | | 1,246.1 | | | Acanthamoeba castellanii | 58.35 |
NC_016072.1 | 25.3 | 26.3 | Acanthamoeba polyphaga | 58.7 |
| | | Acanthamoeba griffini | N/A |
Moumouvirus australiensis | MG807320.1 | 1,098 | 25.1 | 26 | Acanthamoeba polyphaga | 58.7 |
Tupanvirus deep ocean | MF405918.2 | 1,439.5 | 29.4 | 30.5 | Acanthamoeba castellanii | 58.35 |
29.4 | 30.5 | Vermamoeba vermiformis | 42.48 |
Tupanvirus soda lake | KY523104.2 | 1,516.2 | 29.1 | 30.2 | Acanthamoeba castellanii | 58.35 |
Vermamoeba vermiformis | 42.48 |
Aureococcus anophagefferens virus | NC_024697.1 | 370.9 | 28.7 | 29.3 | Aureococcus anophagefferens | 69.92 |
Bodo Saltans Virus | MF782455.1 | 1,385.8 | 25.3 | 25.7 | Bodo saltans | 51.6 |
Phaeocystis globosa virus group I | NC_021312.1 | 459.9 | 32 | 33.4 | Phaeocystis globosa | N/A |
Chrysochromulina ericina virus | NC_028094.1 | 473.5 | 25.4 | 26 | Haptolina ericina | N/A |
Tetraselmis viridis virus | KY322437.1 | 668 | 41.2 | 40.6 | Tetraselmis viridis | N/A |
Cotonvirus japonicus | AP024483.1 | 1,476.5 | 25.3 | 26.6 | Acanthamoeba castellanii | 58.35 |
Ascoviridae | Spodoptera frugiperda ascovirus 1a | NC_008361.1 | 156.92 | 49.2 | 50.3 | Spodoptera frugiperda | 36.37 |
Iridoviridae | Lymphocystis disease virus 1 | NC_001824.1 | 102.65 | 29.1 | 29.2 | Osmeridae (Hypomesus transpacificus **) | 44.5 |
Percichthyidae (Maccullochella peelii **) | 40.5 |
Percidae (Etheostoma cragini **) | 40.5 |
Sparus aurata | 41.94 |
Centrarchidae | N/A |
Pleuronectidae | N/A |
Pleuronectoidei | N/A |
Soleidae (Brachirus orientalis **) | 39 |
Clupeidae (Alosa alosa**) | 42.5 |
Frog virus 3 | NC_005946.1 | 105.9 | 55.1 | 57.1 | Notophthalmus viridescens | N/A |
Lithobates pipiens | N/A |
Dryophytes versicolor | N/A |
Lithobates sylvaticus | N/A |
Oophaga pumilio | 26.9 |
Invertebrate iridescent virus 3 | NC_008187 | 191.1 | 47.9 | 50.4 | Mosquitos ( Aedes taeniorhyncus**) | N/A |
Decapod iridescent virus 1 | MF599468.1 | 165.8 | 34.6 | 29.8 | Penaeus vannamei | 36.5 |
Invertebrate iridescent virus 6 | NC_003038.1 | | 28.6 | 35 | Acheta domesticus | 38.5 |
| Gryllus bimaculatus | 38.6 |
212.4 | Spodoptera frugiperda | 36.37 |
| Choristoneura fumiferana | 38.1 |
| Chilo suppressalis | 35.7 |
Marseilleviridae | Marseillevirus marseillevirus | NC_013756.1 | 368.4 | 44.7 | 45 | Acanthamoeba castellanii | 58.35 |
Acanthamoeba polyphaga | 58.7 |
Brazillian marseillevirus | KT752522 | 362.2 | 43.3 | 43.9 | Acanthamoeba castellanii | 58.35 |
Golden marseillevirus | NC_031465.1 | 360.6 | 43.1 | 43.7 | Acanthamoeba castellanii | 58.35 |
Limnoperna fortunei | 33.6 |
Lausannevirus | NC_015326.1 | 346.7 | 42.9 | 43 | Acanthamoeba castellanii | 58.35 |
Tunisvirus | NC_038511.1 | 380 | 43 | 43.6 | Acanthamoeba castellanii | 58.35 |
Asfarviridae | African swine fever virus | NC_001659.2 | | 38.6 | 38.7 | Chlorocebus aethiops | 40.9 |
189.3 | Sus scrofa | 41.6 |
| Phacochoerus africanus | 40.48 |
Extended Asfarviridae | Pacmanvirus S19 | MZ440852.1 | 418.5 | 33.2 | 34.3 | Acanthamoeba castellanii | 58.35 |
Faustovirus e12 | KJ614390 | 465.9 | 37.7 | 36.9 | Vermamoeba vermiformis | 42.48 |
Kaumoebavirus | MT334784.1 | 362.5 | 43.1 | 43.4 | Vermamoeba vermiformis | 42.48 |
Poxviridae | Fowlpox virus | NC_002188.1 | 291 | 31.2 | 31.3 | Gallus gallus | 42 |
Meleagris gallopavo | 41.1 |
Sheeppox virus | NC_004002.1 | | | | Meleagris gallopavo | 41.1 |
149.9 | 25 | 25.3 | Capra hircus | 42.1 |
| | | Ovis aries | 43.2 |
Yokapox virus | NC_015960.1 | 175.7 | 25.6 | 26.2 | Mus musculus | 41.95 |
Mule deerpox virus | AY689437.1 | 170.5 | 27 | 27.6 | Odocoileus virginianus | 41.5 |
Nile crocodilepox virus | | | | | Crocodylus niloticus | N/A |
NC_008030.1 | 190 | 61.9 | 62.4 | Crocodylus porosus | 43.85 |
| | | | Crocodylus johnsoni | N/A |
Myxoma virus | NC_001132.2 | 162.4 | 43.5 | 43.5 | Oryctolagus cuniculus | 43.97 |
Eastern kangaroopox virus | MF467281.1 | 170.1 | 54 | 54.3 | Macropus giganteus (eastern gray kangaroo*) | 44.3 |
Molluscum contagiosum virus | MH646551.1 | 192.1 | 64.3 | 63.9 | Homo sapiens | 40.4 |
Sea otterpox virus | NC_037656.1 | 127.8 | 31.3 | 31.6 | Enhydra lutris | N/A |
Vaccinia virus | NC_006998.1 | 182.5 | 33.4 | 34.6 | Homo sapiens | 40.4 |
Bos taurus | 41.92 |
Cotia virus | KM595078.1 | 185.1 | 23.6 | 24.4 | Chlorocebus aethiops | 40.9 |
Mus musculus | 41.95 |
Orf virus | | | | | Homo sapiens | 40.4 |
NC_005336.1 | 139.9 | 63.8 | 64.3 | Capra hircus | 42.1 |
| | | | Ovis aries | 43.2 |
Pteropox virus | NC_030656.1 | 133.4 | 33.8 | 34 | Pteropus scapulatus | N/A |
Salmon gillpox virus | NC_027707.1 | 241.5 | 37.5 | 37.1 | Salmo salar | N/A |
Squirrelpox virus | NC_022563.1 | 148.8 | 66.7 | 67 | Sciurus vulgaris | 39.26 |
Swinepox virus | NC_003389.1 | 146.4 | 27.4 | 27.8 | Sus scrofa | 41.6 |
Eptesipox virus | NC_035460.1 | 176.6 | 23.6 | 23.9 | Eptesicus fuscus | 43.5 |
Chlorocebus aethiops aethiops | N/A |
Yaba monkey tumor virus | | | | | Erythrocebus patas | 41.05 |
NC_005179.1 | 134.7 | 29.8 | 30.1 | Papio hamadryas | 40.9 |
| | | | Homo sapiens | 40.4 |
Anomala cuprea entomopoxvirus | NC_023426.1 | 245.7 | 20 | 20.5 | Anomala cuprea | N/A |
Amsacta moorei entomopoxvirus | NC_002520.1 | 232.3 | 17.8 | 18.2 | Lymantria dispar | 38.55 |
Melanoplus sanguinipes entomopoxvirus | NC_001993.1 | 236.1 | 18.3 | 19 | Locusta migratoria | 41 |
Schistocerca gregaria | 42.55 |
Diachasmimorpha longicaudata entomopoxvirus | KR095315.1 | 252.9 | 30.1 | 31 | Diachasmimorpha longicaudata | N/A |
“Pithoviridae” | Pithovirus sibericum | NC_023423.1 | 610 | 35.8 | 40.2 | Acanthamoeba castellanii | 58.35 |
Cedratvirus A11 | NC_032108.1 | 589 | 42.7 | 43 | Acanthamoeba castellanii | 58.35 |
Pithoviridae-like | Orpheovirus | NC_036594.1 | 1,473.5 | 25 | 28.1 | Vermamoeba vermiformis | 42.48 |
Yaraviridae | Yaravirus brasiliensis | MT293574.1 | 44.9 | 58 | 58 | Acanthamoeba castellanii | 58.35 |
"Pandoraviridae" | Pandoravirus quercus | NC_037667.1 | 2,077.2 | 60.7 | 64.4 | Acanthamoeba castellanii | 58.35 |
Unclassified DNA virus | Medusavirus | AP018495.1 | 381.2 | 61.7 | 61.5 | Acanthamoeba castellanii | 58.35 |
Mollivirus sibericum | NC_027867.1 | 651.5 | 60.1 | 60.2 | Acanthamoeba castellanii | 58.35 |
As previously described for different models, sequence composition may vary along genomes extension, especially between intergenic and intragenic regions (Bernaola-Galván et al., 2004; Bohlin et al., 2017; Vinogradov, 2003; Wen-Hua et al., 2016). In our study, we have found that GC% content of coding sequences and intragenic regions tend to present similar sequence composition, given that whole genome GC% values were similar to ORF mean GC% (Table 1). On the contrary of what one can expect, this might indicate there is no differential selective pressure over these distinct parts of the viral genome, or at least none that would be associated with GC%. Notwithstanding, it is worth mentioning some exceptions, including decapod iridescent virus 1, invertebrate iridescent virus 6, and Pithovirus sibericum, which presented differences between the total genome GC% mean and the ORF GC% mean (ranging from 4.4–6.4% variation). In addition, codon-usage analysis was performed and then compared to %GC content profiles. A clear relationship was observed between %GC and the use of codons rich in C and G.
Nucleocytoviricota coding sequence (CDS)/ORF GC% variation
The GC% content variation profile of nucleocytoviruses coding sequence (CDS)/ORF assessed has a notable range with a minimum value of 8.13% (Amsacta moorei entomopoxvirus; CDS NP_065034.1), and a maximum value of 83.91% (Orf virus; CDS NP_957782.1), both being described as hypothetical proteins of representatives of the Poxviridae family (Fig. 1, and supplementary table 1). These aspects cohesively demonstrate the Poxviridae family as the family of greatest GC% range within Nucleocytoviricota, thus presenting most influence over the phylum’s GC% profile.
In terms of CDS/ORF GC% variation and mean within viral families, the following ranges were observed: (i) 18.61–64.63% in Phycodnaviridae, with Ectocarpus siliculosus virus 1 and Heterosigma akashiwo virus 1 presenting an overall maximum and minimum GC% content, respectively; (ii) 15.15–69.09% in Iridoviridae, with both frog virus 3 and invertebrate iridescent virus 3 presenting significantly higher GC% content in comparison to other family members; (iii) 27.88–57.36% in Marseilleviridae, with no specific representatives; (iv) 19.66–62.96% in Asfarviridae (and extended Asfarviridae), with kaumoebavirus and pacmanvirus S19 presenting an overall maximum and minimum GC% content, respectively; (v) 9.18–57.83% in “Pithoviridae” (and pithoviridae-like), with orpheovirus presenting notably lower GC% content; (vi) 9.68–62.55% in Mimiviridae, with Tetraselmis viridis virus and Cafeteria roenbergensis virus presenting an overall maximum and minimum GC% content, respectively; and (vii) 8.13–83.91% in Poxviridae, with lower GC% represented in Entomopoxvirinae (Figs. 2 and 3).
When looking specifically into CDS/ORF GC% variation within individual viral genomes, we observed how the GC% varies in coding sequences along the extension of genomes, which may allow for the identification of possible hotspots for HGT or duplication events (Fig. 4). Interestingly, we also observed an absence of specific patterns for GC% variation among coding sequences of the viral families of the nucleocytoviruses assessed (supplementary Figs. 1–7). Thus, representative genomes of each family were selected for demonstrating main aspects of ORF GC% distribution along genomes extension (Fig. 4).
When considering how our data might have presented statistical difference amongst analyzed groups, we performed Kruskal-Wallis’ test (p < 0.05) followed by Dunn’s multiple comparison test for the quoted comparison between two isolates against each other within a family, subfamily, or group. Our statistics evaluation of the dataset herein proposed and explored demonstrated that there is significance amongst isolates within viral families, subfamilies, and groups (supplementary Fig. 8). This was a consistent result even when comparing isolates of the Marseilleviridae, which presented the minor of the GC% variations among all nucleocytoviruses’ families assessed, pointing that the five isolates’ genomes analyzed are significantly different (p < 0.0001) in terms of GC% content (supplementary Fig. 8).
Virus-host GC% similarities and HGT analysis
Virus-host coevolution is a key factor for sharing features between genomes of viruses and their hosts. Studies have demonstrated how sequence composition and nucleotide frequency correlation between virus-hosts can be associated with dynamics of viral adaptability, or even be used as reliable metrics for virus-host linkage suppositions (Lobo et al., 2009; Monier et al., 2007; Roux et al., 2015). When comparing viral GC% content profile of nucleocytoviruses to those of host organisms included in this study, unlike one could expected, values were not similar in most cases (Table 1). Few exception cases were: Ectocarpus siliculosus virus 1 (51.7%) and host Ectocarpus siliculosus (53.49%); decapod iridescent virus 1 (34.6%) and host Penaeus vannamei (36.5%); invertebrate iridescent virus 6 (35% ORF GC mean) and hosts Spodoptera frugiperda (36.37%) and Chilo suppressalis (35.7%); African swine fever virus (38.6%) and hosts Chlorocebus aethiops (40.9%), Sus scrofa (41.6%), and Phacochoerus africanus (40.48%); kaumoebavirus (43.1%) and host Vermamoeba vermiformis (42.48%); myxoma virus (43.5%) and host Oryctolagus cuniculus (43.97%); and Yaravirus (58%) and host Acanthamoeba castellanii (58.35%). Although few patterns of GC% similarity between viruses and hosts were observed in this dataset, it is important to consider that many of these viruses can infect to more than one host organism, which would imply on different selective pressure over sequences’ composition. Likewise, many of host organisms did not have complete genome sequences available for GC% calculation, making it not possible to compare viral GC% content at the moment.
Although few evidence of direct correlation were observed between GC% content of viruses and hosts analyzed in this work, the influence of host genomic characteristics was not excluded from the possibilities to further explain CDS/ORF GC% variation observed on viral genomes. Genome expansion through acquisition of genes from host organisms by HGT has been broadly debated regarding nucleocytoviruses, and literature has shown that different viral families among Nucleocytoviricota present different tendencies of HGT events taking place (Filée, 2009). Overall, Nucleocytoviricota viruses are considered to have some propensity to acquiring host genes by HGT, and such events may play an important role in composing the content diversity and size of these viruses’ genomes (Filée, 2009). As a matter of fact, because NCLDVs encode homologs of conserved genes that are commonly found among the domains of Bacteria, Archaea, and Eukarya, it has been hypothesized that the phylum could even be considered as a fourth domain of life, supposing there could have been a common ancient ancestor from which these shared sequences could have been herd. Yet, this hypothesis has not been supported by multiple phylogenetic analysis, but it indicates a reinforcement of multiple independent acquisitions from various cellular lineages (Mönttinen et al., 2021; Raoult et al., 2004; Williams et al., 2011; Woese, 1998).
Considering this, HGT was hypothesized as one of the possible causes to gene GC% variation on different nucleocytoviruses genomes (Maumus and Blanc, 2016), especially in cases where proximal gene groups presented GC% content distant from the overall viral GC% mean. For instance, this was observed for Emiliania huxleyi virus 86 (Ehv86) (Fig. 4I), in which a cluster of CDS/ORF of discrepant GC% values (positions of 290 to 315) was identified. Considering that events of HGT of entire metabolic pathways have been previously described for Ehv86 and its host Emiliania huxleyi (Monier et al., 2009), we explored the possibility of identifying this content over the protist’s genome employing BLAST alignment (data not shown). However, no hits for any of the CDS/ORF were found when aligned to the host genome, thus leaving about the discussion to why these sequences are so different in GC% composition, and how they originated. Likewise, a discrepant GC% CDS/ORF cluster was identified within the genome of Chrysochromulina ericina virus (ChreV) (positions of 422 to 442) (Fig. 4H), and such observation can be linked to ChreV’s remarkable genomic characteristics, such as abundant mobile genetic elements, complex gene evolution, and host gene acquisition among other features (Gallot-Lavallée et al., 2017).
After extended data curation, we carefully selected 30 genes out of our target CDS/ORF dataset, based on the three maximum and three minimum GC% values observed for each viral genome (supplementary table 1). We then proceeded for phylogeny assessment and evaluation of HGT events, according to the methodology described by Irwin et al. (2022). Among all analyzed targets, 14 potential HGT events were identified, such as “Histone H2B/H2A fusion protein” (AMQ10945.1) of the Brazilian marseillevirus (Supplementary Figs. 9–35; Fig. 5A). Interestingly, a probable HGT from virus to host was observed (Fig. 5B) when evaluating the targeted sequence of “Papain-like cysteine peptidase” (YP_009310305.1) of Golden marseillevirus. Moreover, it is important to punctuate that these findings are in accordance with previously described HGT events on Marseilleviridae members (Bertelli and Greub, 2012; Boyer et al., 2009).
Lastly, other analyzed sequences were considered inconclusive or did not indicate potential HGT events, like were the cases of targets “CD47-like protein” (NP_659700.1) of Sheeppox virus (Fig. 5C), “Putative replication factor and/or DNA binding/packing protein” (NP_078747.1) of Lymphocystis disease virus 1, and “NAD-dependent DNA ligase” (ATE87064.1) of Decapod iridescent virus 1 (both inconclusive towards HGT inference). This does not exclude that these events could potentially be observed more frequently when accessing all the identified outliers for GC% content variation among CDS/ORF of viral genomes, and other viral isolates in future studies. Therefore, events of HGT remain as a possible cause for GC% variation within viral genomes, yet to be further explored.
Potential paralogs and gene duplication events
Another hypothesis for gene GC% variation observed in this study was the presence of duplicated ORF. Considering this scenario, if a given ORF presented one or more copies along the virus genome, it is hypothesized that one of the copies would remain conserved, whereas other copies would potentially evolve in different selective pressure conditions (Gabaldón and Koonin, 2013; Gao et al., 2017; He and Zhang, 2005; Magadum et al., 2013; Shackelton and Holmes, 2004), likely allowing for GC% content variation. For instance, gene duplication is one of the currently known ways in which genome evolution can be boosted, allowing for the emergence of new genes with different functions (Gabaldón and Koonin, 2013; Magadum et al., 2013). Moreover, gene duplication has already been described as a major component involved in genome expansion and genetic diversity among giant viruses’ genomes (Machado et al., 2023).
Another evidence that gene duplication could be associated with GC% variation relies on a remarkable characteristic within the Poxviridae family, the inverted terminal repetition sequences (ITR). ITR consist of genome terminal regions present inverted duplicated sequences, known to harbor most of the variable genetic content of poxviruses genome, whereas conserved genes are mainly observed within the central genome region (Brennan et al., 2023; Wittek, 1982; Wittek et al., 1978). This is consistent with the majority of poxviruses gene GC% variation along genomes extension observed in the present work (supplementary Fig. 6).
Among all selected ORF of maximum and minimum GC% analyzed (n = 336), we identified 60 genes with at least one potential duplication, according to the used threshold of 40% coverage, and 40% identity. Of these ORF, 46 presented 2 potential copies, 5 presented 3 potential copies, and 9 presented 4 or more potential copies. The maximum quantity of probable copies identified for an ORF was 17, regarding the targeted “Putative ankyrin repeat protein” (AMK61738.1) of Samba virus. Yet, the majority of 70% identified ORF are characterized as hypothetical proteins, whereas the resulting 30% are miscellaneous. Underrepresented miscellaneous identified groups of probable paralogs were “MHC-like TNF binding protein” (3.3%), “EFc gene family protein” (3.3%), “Putative ankyrin repeat protein” (3.3%), “Chemokine-binding protein” (1.7%), and “Collagen and repeat containing protein” (1.7%) (Supplementary table 1).
Still considering the 60 identified ORF, 55% are represented in the family Poxviridae, followed by Mimiviridae (13.3%), Phycodnaviridae (8.3%), Iridoviridae and “Pithoviridae” (both with 6.7%), and Marseilleviridae and Asfarviridae (both with 5%) (supplementary table 1). It is worth noticing that a higher observation of duplicated genes is to be expected for poxviruses’ genomes due to ITR. Moreover, the identification of probable gene duplication events based on GC% variation among genomes has led to interesting targets for future orthologs studies, considering more investigation regarding gene duplication and orthology in nucleocytoviruses’ genomes should be done to further comprehend these events.
Perspectives in sequence composition studies of viral genomes
When studying genomes and sequence composition, there are multiple approaches that can be considered, one of which is the evaluation of GC% content. However, the ratio of guanines and cytosines on a DNA/RNA sequence is just one of the many nucleotide ratios that can be measured. A different metric, yet based on G + C ratio, is the calculation of CpG dinucleotides, which differs from GC% content once it is based on the measurement of specifically bonded cytosines to guanines (Cytosine-Phosphate-Guanine). CpG dinucleotide is known to be associated to gene regulation through methylation, cancer inducing factors, and even virulence augmentation since certain antiviral protein specifically bind to CpG (Bergbauer et al., 2010; Fernandez et al., 2009; Willis’ And and Granoff, 1980; Xia, 2020).
Besides CpG, other dinucleotide composition on genomic sequences can lead to important information about viruses and host organisms. As a matter of fact, different dinucleotide relative ratios can reflect the chemistry of dinucleotide stacking energies and base-step conformational tendencies of an organism, as well as species-specific properties of DNA modification, replication, and repair mechanisms (Karlin and Burge, 1991). Regarding other nucleotide composition evaluation, trinucleotide ad tetranucleotide composition should also be considered as relevant metrics for retrieving biological information off genome sequences of viruses and hosts (Perry and Beiko, 2010; Pride et al., 2006).
Furthermore, considering how assessing the GC% content profile of the phylum Nucleocytoviricota has provided an insightful view over large viral genomes, we consider other metrics of sequence composition analysis to be promising – especially since there remains a plethora of information yet to be unveiled from nucleocytoviruses’ sequence composition features.