Bacterial pathogens confront difficult conditions due to numerous unpredictable, frequently abrupt, and dynamic changes that occur in the host environment or during transmission from one host to another (Bayliss 2009). Bacterial adaptation to their hosts entails either a system for sensing and responding to environmental changes or selecting mutation-induced variations within their contingency loci (Moxon, et al. 2006). These loci allow bacterial populations to adapt to or survive selective pressures by generating and spreading genetic variants that are "fitter," or better adapted to, a particular environment than the majority of the population. Through comparative genomics, it is now possible to find differences in genetic variants across entire genomes, and to tie those differences together to biological function, as well as to learn more about selective patterns of gene transfer and evolutionary pressures or loss, especially when it comes to virulence in pathogenic organisms (Fitzgerald and Musser 2001). We analysed the frequency and distribution of long SSRs in sequenced species of human pathogenic Staphylococcus, Streptococcus, and Enterococcus bacteria. Among the three, Staphyloccus has a relatively higher frequency of long repeats as compared to the others. To find an appropriate justification for higher repeats in Staphylococcus bacteria, we looked into their G+C content. In our previous reports on fungi, we observed a positive correlation between G+C content and the frequency of SSRs (Mahfooz, et al. 2017, Mahfooz, et al. 2016). However, we found a statistically significant negative correlation (r2 = -0.83, p = 0.0001) between the two in this study. Hence, we can hypothesise that higher A+T content is a good predictor of the frequency of repeats among these human pathogenic bacteria. We found S. epidermidis harbouring the maximum frequency of long repeats among all the pathogenic bacteria. A close comparison between S. epidermidis and S. aureus revealed a higher percentage (~9) of genomic elements (genome islands and Staphylococcus cassette chromosome-like elements, insertion sequence elements, integrated prophage, integrated plasmids, and composite transposons) as compared with its closest species, S. aureus (~7)(Hiramatsu, et al. 2001). We can speculate that this could be the possible reason why we obtained a higher frequency of long SSRs in S. epidermidis.
Among the classes of SSRs, we found an abundance of tetranucleotide SSRs in the intergenic region, whereas trinucleotide SSRs were found dominant in the genic regions. Tetranucleotide repeats present in the intergenic regions are reported to modulate transcription factor binding and consequently modulate gene expression (Martin, et al. 2005). The presence of trinucleotide repeats in the genic regions is expected as it avoids frameshift mutations and these triplets could code for amino acid runs that may have specific functions in the protein structure (Metzgar, et al. 2000). Further analysis of repeat classes at the whole genome level revealed that in most of bacterial species, the RA and RD of tetranucleotide SSRs were the highest. Its occurrence in intergenic regions is fine as it regulates gene expression by binding the RNA polymerase. However, its presence in genic sequences is surprising as it could change the open reading frame. It is believed that rearrangement within these tetrameric repeats could work as a switch-on/off mechanism in phase variable genes (De Bolle, et al. 2000).
Further analysis of the data showed hexanucleotides constituted the longest SSRs in most of the species, which is expected as they contribute the highest number of repeats when compared to other classes of SSRs. We observed an unexpectedly higher repeat number (25) of a pentanucleotide repeat (gagca) that codes for a hypothetical protein in S. suis. We can hypothesise that S. suis could have acquired this repeat through horizontal gene transfer (Perna, et al. 2001).
We further examined the tri-nucleotide SSRs that have a probability of being transcribed by codons and translated in frame into amino acid residue repeats. Isoleucine was the most abundant amino acid, followed by lysine and glutamic acid. It has been reported that in bacterial physiology, branched-chain amino acids like isoleucine play a variety of roles, from promoting protein synthesis to signalling and fine-tuning the adaptation to amino acid deficiency. In some pathogenic bacteria, the response to amino acid deficiency includes activation of virulence gene expression. As a result, isoleucine aids not just infection but also evasion of host defences (Kaiser Julienne, et al.). The second most abundant amino acid, lysine, is used for protein synthesis and the peptidoglycan layer of Gram-positive bacterial cell walls. Additionally, the relevance of lysine for bacterial cell survival is highlighted by the availability of numerous biosynthetic routes for lysine synthesis in bacteria (Gillner, et al. 2013). The quantity of glutamic acid in bacteria appears to be linked mostly to tolerance to acidic environments. Food borne diseases and spoilage bacteria would be able to grow on acidic foods if they developed acid resistance. This feature is also a virulence factor, as it permits pathogens to pass past the stomach barrier's very acidic conditions, respectively (Feehily and Karatzas 2013). We wanted to explore whether any association could be found based on amino acids encoded by tri-nucleotide SSRs. To do this, we used principal component analysis (PCA), where a genus-wide clustering of Staphylococcus and Streptococcus bacteria was observed. The genus-wide clustering of Staphylococcus and Streptococcus species was expected as, in the course of evolution, positive selection tends to prefer those amino acids which are required for organism growth and survivability (Loewe and Hill 2010). The distant clustering of Enterococcus species could be due to the marked difference in the amino acids encoded by the trinucleotide SSRs.
We then performed functional annotation on the genes containing these long SSRs. This allowed us to figure out if these long SSRs were linked to any particular biological process. SSRs were found in the majority of the important biological function pathways.This raises the risk of non-functional proteins as a result of frameshifts, implying that these species may have evolved SSRs in coding areas to promote phase variation(Lin and Kussell 2012). Notably, SSRs were located in pathogensity genes in Staphylococcus and Streptococcus bacteria, which is direct evidence of their involvement in pathogenicity.
SSRs were found in housekeeping genes in our study; however, the majority of SSRs were tri-nucleotide motifs, with a few tetra- and hexanucleotide motifs. These housekeeping genes are linked to a variety of biological functions. The existence of SSRs in housekeeping genes is surprising because SSRs are known to be mutation hotspots(Lin and Kussell 2012), and any mutation in the housekeeping gene would be fatal. Because most SSRs are trinucleotides, their chances of causing phase variation are low. In a previous study, a significant difference was reported in the 5'-UTR while comparing the densities and repeat types of SSRs between housekeeping and tissue-specific genes. According to the report, the GC content of trinucleotide SSRs in the 5'-UTRs of housekeeping genes is higher than that of tissue-specific genes (Lawson and Zhang 2008). Motif (gtg)4 was found to be conserved in the rplD gene in eight species of Staphylococcus and Streptococcus bacteria. GTG is also an alternative start codon in bacteria (Hwang, et al. 2005), but its presence as repeated motifs ruled out any such role and focused on its translation to valine, which has a multifaceted physiological role in bacterial survival and pathogenicity (Kaiser Julienne, et al.).
We further attempted to construct a phylogenetic tree using the presence of SSRs, in particular housekeeping genes, as these genes are more conserved with minimal selection pressure. The highest and lowest polymorphism levels were obtained among ribosomal 50S subunit genes. The 50S ribosomal subunit is always an easy target for various antibiotics (Champney, et al. 2003), but with the presence of repetitive sequences, the bacteria may evade the binding of these antibiotics. A SSR-based phylogeny grouped Enterococcus species along with Streptococcus, whereas a 16S ribosomal region-based phylogeny grouped Enterococcus and Staphylococcus together. It was until 1984 that Enterococcus species were classified as Streptococcus (Andrewes and Horder 1906). However, with the advancement of techniques like DNA-DNA hybridisation and 16S RNA sequencing, a new genus Enterococcus was formed (Schleifer and Kilpper-Bälz 1984). We observed that SSR-based phylogeny's resolution power was low as it could not differentiate between five species of Streptococcus, which 16S ribosomal region-based phylogeny easily did.