Phages constitute the most abundant group of biological agents on the planet, with conservative estimates suggesting numbers of 1030 around the globe1. As consequence of their sheer abundance, they have significant impacts across all microbial systems, hence they are of utmost importance in the study of the microbial world.
Phage impacts are dependent on the mode of bacteriophage infection which can be lytic, lysogenic, or chronic2. While lytic infection consists of viral particle production and host lysis to release the viral particles, the lysogenic mode of infection, consists of phage integration into the host cell genome. This integration is known as a prophage and allows the bacteriophage genome to be replicated without causing the lysis of the host by repressing lytic functions2,3. Phages capable of lysogeny are known as temperate viruses and can switch between the lytic and lysogenic cycles. The switch in mode of infection from lysogeny to lytic is known as induction and can be caused by different external factors, these can include, antibiotics4, UV rays5, reactive oxygen species6, changes in temperature7, changes in pH8, bacterial metabolites and products of host physiology9, but can also occur spontaneously10. In addition, in polylysogeny, an event in which there are multiple prophages in a single host, prophages can encode noncanonical induction pathways to outcompete each other11. The switch between modes of infection leads to changes in microbial communities that extend beyond virus-host interactions and have broader implications. Prophages are commonly present within the genomes of bacteria and archaea, with some cases showing they can make up to 20% of the genome12. In addition, most bacteria, are polylysogens11 and a well-studied case is the E. coli O157:H7 strain Sakai which has 18 prophages13. However, typically, a prophage genome represents only about 1% of the host's genome14. Finally, in the chronic mode of infection, viral particles are continuously produced without lysis of the host2.
Prophages can impact their respective microbial systems in different ways. They can manipulate the host’s gene expression and function, affecting the host’s cellular processes; they can also alter the host’s physiological functions or introduce new functions2. At a broader level, prophages impact the structure, function, ecology, and evolution of microbial systems2,10,15,16,17. Specifically, by lysing microbes in competition, prophages can prompt shifts in community dynamics2. In host-associated systems, such as the human gut microbiome, they have the potential to impact the host’s physiology and health10,18. For example, their bacterial hosts’ genomes and phenotypes can undergo changes which has the potential for the emergence of strains and diversification; this has implications in the hosts’ virulence and antibiotic resistance19 which in turn could affect the overall health of hosts such as humans.
Phages can contain auxiliary metabolic genes (AMGs) which are involved in numerous metabolic processes and can alter host metabolism. AMGs are host-derived genes normally acquired by the phage through recombination and which can be expressed during infection by the phage to improve its fitness20. Their presence is of high relevance given that these phage-encoded genes are involved in host metabolism and respond rapidly to environmental cues21. Additionally, through the incorporation of AMGs, prophages have the potential to influence ecosystem biogeochemistry, thereby impacting global biochemical processes. However, much remains to be unraveled to fully understand the extent of their impact22. Overall, our knowledge about prophages and their AMGs and the relationships between their host remains poor. Even though we know about certain relationships such as prophages being more frequent in pathogenic and fast growing bacteria23, or that viral lifestyle is a major driver of AMG composition24, we still need to strive for a more universal understanding regarding of such relationships.
Current limitations in prophage identification arise from the fact that viruses are polyphyletic, meaning they have multiple evolutionary ancestors. As a consequence, they do not display universal markers that ease their identification in datasets and hence, the use of databases only allows for the identification of a few. As a result, the ability to discover novel viruses remains restrained. Metagenomic studies have provided extensive insights into viruses, giving rise to the concept of 'viral dark matter.' This term refers to the unknown identities and functions of viruses and their proteins, the majority of which have no known function12,25. Chevallereau et al. defined the concept of 'viral dark matter' as viral species that have not been characterized but for which their existence has been revealed by metagenomic sequencing, or phage genes that have no assigned functions3. Surprisingly, in certain datasets, 'viral dark matter' may constitute up to 90% of sequences. Studies have identified three factors contributing to the existence of ‘viral dark matter’: the divergence and length of virus sequences, limitations in alignment-based classification, and the inadequate representation of viruses in reference databases26. As a result, a considerable number of viruses resist taxonomic classification or association with a bacterial or archaeal host which underscores the necessity to expand our knowledge to fully comprehend their potential effects on their respective microbial systems.
Several methods have been developed to circumvent the limitations of the exclusive use of viral hallmark genes and homology-based methods, and many integrate a combined approach to complement sequence similarity comparisons. Notable examples include the use of machine learning or the search for composition patterns such as k-mer frequency among others. Examples of tools that use a combined approach include DEPhT14, geNomad27, PHASTEST28, PhiSpy19, VIBRANT20, VirSorter229, and VirFinder30. Although the methods used by these tools have been effective in identifying phages, much remains undiscovered. This can be facilitated by the creation of standardized databases that can drive the discovery of novel phage signatures.
Despite the numerous methods/tools that have been developed for the study of phages, the number of databases lags behind. For the most part, current databases are associated with software and their purpose is to be used as a reference for homology-based comparisons. Only a small subset of databases exist as catalogues that can be used to explore the phages contained within. And from this subset, only a few are comprehensive as several databases focus on a single microbial system, such as the gut microbiome or marine environments. Examples of comprehensive prophage studies/datasets focused on specific microbial systems include Marine Temperate Viral Genome Dataset (MTVGD)31 and Human Gut-derived Bacterial Prophages32. Examples of comprehensive databases that contain prophage sequences are Microbe Versus Phage database33 intended for exploring the relationships between phages and their hosts; IMG/VR v434, which contains a significant number of prophages and metadata such as environment; and PhageScope35, a bacteriophage database encompassing temperate and lytic viruses.
Here, we present Prophage-DB, a comprehensive database of prophages that will serve as a standardized resource to facilitate viral diversity and ecology studies.