We have carried out a comparative genomic analysis for the purpose of characterizing the prophages of Salmonella enterica. Both dsDNA and ssDNA viruses were represented in our collection of 142 phage genomes. The four ssDNA phages present in our collection belonged to the family Inoviridae. In contrast, the dsDNA phages were spread over four of the five known families of the order Caudovirales, i.e., Myoviridae, Podoviridae, Siphoviridae and the rare Ackermannviridae. Within these four families, a total of 27 different phage genera were represented (Table 1). Earlier studies using core genes analysis indicated that Salmonella phages could be classified into five groups, namely: P27-like, P2-like, lambdoid, P22-like, and T7-like [9, 10], and all of which were present in our prophage collection. From our classification, we have identified two new members of Cluster D namely, ST64T and ST104 which are related to the previously described P22-like group. We have described an additional 13 members in this group (Table S1). Similarly, we detected the P2-like PSP3 phages and were able to cluster them with an additional 12 double stranded phage viruses to make up Cluster K. In addition, three lambdoid phages, namely Gifsy 1, Gifsy 2 and lambda were assigned to lambdoid phage group Cluster M (Table S1). This work has extended published observations by identifying additional members of previously described, albeit small groupings, and has achieved a more discriminative and extensive characterization of Salmonella prophage sequences.
An earlier genomic comparison of tailed phages showed 337 fully sequenced lytic and temperate phages in the entire Enterobacteriaceae family [28], and based on this observation, a large number of diverse phages could potentially infect Salmonella. We observed the presence of the same phages infecting different bacteria and whether this is an outcome of the shared location or relatedness among hosts cannot be ascertained at this time. It is possible that both phylogeny, i.e., the relatedness among hosts such as belonging to the same family, or occupation of the same niche, i.e., gastrointestinal tract location may facilitate the presence of same prophages in different hosts. As examples, we observed phages X29 and KSF-1phi in Salmonella, which were first found in Vibrio cholerae according to Virus-Host DB [TAX:666; https://www.genome.jp/virushostdb/]. On the other hand, 38 other phages known to infect Vibrio cholera have not been reportedly found in S. enterica and given that the two organisms belong to different orders, this suggests that hosts phylogeny rather than co-location plays the primary role whether prophages are shared among hosts. Nevertheless, it is difficult to entirely discount the role of a shared niche since the virus will still have to find the new host before infection can take place. Furthermore, 33 phages analyzed here were observed to have originated from Escherichia coli strains [TAX:562] (Table S1). Enterobacteria phage fiAA91-ss is also able to infect at least two more hosts, namely, Shigella sonnei [TAX:624] and Escherichia coli O157:H7 [TAX:83334]. Haemophilus phage Aaphi23 can also infect Aggregatibacter actinomycetemcomitans [TAX:714] and Haemophilus [TAX:724]. The species A. actinomycetemcomitans has now been renamed Haemophilus actinomycetemcomitans by Potts et al. (1985) [38]. Based on our observations, studies of phage host range should not be restricted to specific species but should comprehensively involve as many different host genera as possible to capture all available information, even if the focus is a particular host species. This will help provide a broader perspective of the distribution of phages and understand how they contribute to the evolution of the host.
The occurrence of the same phage sequences in different hosts may also imply horizontal viral gene transfer among hosts belonging to different genera. Genome clustering facilitates the identification of genes that are in greatest genetic flux and are more likely to have been exchanged horizontally during a relatively recent evolutionary time. Such viral sequence exchanges may help a phage increase its fitness to invade a new host, and evade selective pressure such as anti-phage defense mechanisms [11]. Given the biological arms race between bacteria and phages, and in order to thrive in most environments, phages have evolved multiple tactics to avoid, circumvent or subvert bacterial anti-phage mechanisms [21]. Ironically, these viral sequences once established in Salmonella may help the host to thrive in specific ecological niches, including the gut [39].
Diverse phage genomes were identified in our Salmonella phage collection. As shown in Fig. 2, the highest number of matching prophages were named after the genus Escherichia (n=53) while Salmonella ranked second (n=34). Regarding the lineage for their original known host, three phyla (Firmicutes, Proteobacteria and Cyanobacteria), four classes (Bacilli, Betaproteobacteria, Alphaproteobacteria and Gammaproteobacteria) and 24 unique genera could be identified (Table 1). Such a wide host span provides further evidence of the diversity of Salmonella prophages analyzed in this study. In a study of prophages integrated in a single host species Mycobacterium smegnatis, a threshold of 50% nucleotide identity was used for genome cluster assignment [24]. The threshold was slightly reduced (45%) for clustering Pseudomonas phages because phages infecting a genus would be expected to show greater variation in genome sequences than one infecting a single species [23]. Among the 56 phage clusters reported for the Enterobacteriaceae family, the sequence similarity was substantially less between clusters [28], indicating a higher degree of variation and justifying a lower threshold of nucleotide identity for certain clusters in Salmonella phages, a large proportion of which may infect or have previously infected other hosts.
It should be noted that nucleotide identity is not the only parameter for assessing genome properties, because the nucleotide alignments for thousands of homologous protein are not significant based on nucleotide alignment, but are clearly homologous based on statistically significant protein structural similarity or strong sequence similarity to an intermediate sequence [40]. Thus, there may not be a linear relationship between sequence identity and function [41]. In our set of phage genomes, except for Cluster B, L and M showed a lower pairwise ANI of 41%, all the other clusters Clusters E (59%), F (75%) and J (57%) displayed high nucleotide identity (Table S3). Their assignment to each of these clusters was supported by results of analysis using the dot plot program, Kalign genome alignment and gene content analysis. For instance, the dotplot (Fig. 4) and Kalign analyses grouped members of Clusters B, G and N, even though some of their respective nucleotide identities were 40.7%, 42.2%, and 42.3% (Fig. 4 and Figure S1). A similar phenomenon was also observed for Cluster L made up of members belonging to the same P2 virus group showing a nucleotide identity of 41.3%. The differences in the output of the different tools should not be surprising because of their unique underlying algorithms. While Kalign focuses more on analyzing larger genomes in general, MUMmer focuses more on the similar DNA fragment identification. Despite the high degree of diversity in our prophage collection, we were still able to cluster related isolates using congruent results from at least two bioinformatics analyses.
The genome size ranges of the prophages documented for the different bacteria genera are fairly similar: Salmonella (6.4 - 358.7 kb), Pseudomonas (3.0 - 316.0 kb), Staphylococcus (15.6 - 138.7 kb), Gordonia (17.1 - 103.4 kb), Bacillus (14.3 - 497.5 kb) and Mycobacterium (41.9 - 164.6 kb). The ranges of the GC content showed less of an overlap: Salmonella (35.5 to 65.4%), Pseudomonas (37.0 to 66.0%), Staphylococcus (29.3 to 38.0%), Gordonia (47.0 to 68.8%), Bacillus (29.9 to 49.9%) and Mycobacterium (56.3 to 69.1%) [23-27]. Salmonella and Pseudomonas both belong to the Enterobacteriaceae family and their phages share very similar genome sizes and GC content. Despite the similarities between the phages of Pseudomonas and Salmonella, the former appears to display better clustering pattern (fewer singletons) based on the grouping of 100 out of 130 phages [23] compared to 90 out of 142 Salmonella phages with 52 singletons. However, as Pseudomonas bacteriophages were collected only using “Pseudomonas” as host for the search in the database [23], the set most likely did not represent the full complement of viruses capable of infecting Pseudomonas and integrating into the genome and would have excluded bacteriophages of this group but first found or described in another bacterial host. We expect that more diverse prophage patterns would be obtained for Pseudomonas and other bacterial hosts if a more comprehensive search of bacterial genomes is carried out with tool such as PHASTER [34].
The diversity of Salmonella prophage genomes was also reflected in the total number of phamilies for the ORFs in the analyzed prophage genomes: 5796. One phamily with a Pham number of 2217 was observed to be dominant and was present in 49 prophages (34.5% of 142 phages) whereas 4330 phamilies were each present in a single prophage, which makes it challenging to select conserved genes for all the 142 prophage genomes. Clustering of the viral genome was useful in establishing relatedness of Salmonella bacteriophages. In each assigned cluster, some conserved Pham numbers (containing different ORFs) are present. For example, Pham 180 (portal protein), Pham 2012 (recombination protein) and Pham 2217 (endopeptidase) are commonly present in Cluster D; Pham 321 (phage head-tail connector protein), Pham 415 (terminase large subunit) and Pham 1522 (terminase small unit) in Cluster E; Pham 1995 (lysozyme), Pham 2370 (terminator) and Pham 1332 (attachment invasion locus protein precursor) in Cluster F; Pham 27 (phage tail protein), Pham 519 (phage portal protein), and Pham 1717 (assembly protein) in Cluster H; Pham 528 (major capsid protein), Pham 297 (terminase large subunit) and Pham 666 (tail protein) in Cluster J; Pham 963 (base plate assembly protein) in Cluster K (Document S3). Specifically, some proteins are unique to one cluster, for example, four members of Pham 4878 (a hypothetical protein), Pham 1893 (a hypothetical protein) Pham 2968 (a hypothetical protein) Pham 2849 (a hypothetical protein) in the Cluster E. These may be good markers for characterizing prophage members of the different clusters (Document S3, Table S5).
The observations reported in this study are quite relevant for the application of bacteriophages as antibacterial agents and in cloning vector construction. Our list of Salmonella bacteriophages can be used for screening a novel, candidate bacteriophage identified as a potential anti-bacterial agent for Salmonella or any host described in this study. The implication is that because the bacteriophages present in our collection induce lysogeny, the bacterial host will be immune to infection or lysis by the same bacteriophage; a bacteriophage on our list will likely not be an effective antibacterial agent for the hosts identified in this study. Thus, a distinct bacteriophage may be a better anti-bacterial candidate than one on our list. Similarly, the Salmonella prophage database in the Pharmerator can be used to evaluate a candidate antibacterial agent even if it is distinct from members on our list. Because bacteriophages are prone to recombination leading to a mosaic profile, the protein components can be used to assess relatedness with the goal of choosing a candidate antibacterial agent that is phylogenetically distant from any of the isolates in our collection to increase the chance of success. In the same vein, knowledge from our collection can be used in strategies to design phage vectors. For example, λ cloning vectors require a lytic cycle and their ability to package large foreign DNA fragments have relied on the removal of lysogenic genes from the vectors. Thus, the removal of lysogenic fragments in a temperate phage can probably deviate the life cycle into a lytic path making them more relevant for vector construction especially if the bacteriophage has signature genetic markers that can be exploited for selection or vector purification, e.g., antibacterial resistance genes or a target for a widely used ligand.