Alleleomes characterize the survivors of 3.5 billion years of bacterial evolution

doi:10.21203/rs.3.rs-3168663/v1

Download PDF

Article

Alleleomes characterize the survivors of 3.5 billion years of bacterial evolution

https://doi.org/10.21203/rs.3.rs-3168663/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Bacteria are thought to have appeared on Earth some 3.5 billion years ago. Widespread sequencing has uncovered the set of surviving genetic alleles (termed the alleleome) for tens of thousands of bacterial strains. Here, we characterize over 1.3 billion mutations across 54,191 sequenced genomes that define the alleleomes of 184 bacterial species. The alleleomes are surprisingly conserved, and even the most variable codons encode only a few alternate amino acids with predictably-benign consequences on protein function. Furthermore, the evolutionary stabilities of amino acids are shared across species. Lastly, the global ratio of nonsynonymous-to-synonymous mutations (dN/dS) is 0.32. Notably, human pathogens exhibit the most variation and the highest dN/dS ratios, suggesting that their genes are under increasingly positive selection. As more genome sequences become available, alleleomes provide a context to study sequence diversity across the phylogentic tree and can reveal data-driven insights into the genetic basis for natural selection in bacteria.

Biological sciences/Microbiology/Microbial genetics/Bacterial genes

Biological sciences/Computational biology and bioinformatics/Sequence annotation

Biological sciences/Computational biology and bioinformatics/Phylogeny

Biological sciences/Evolution/Evolutionary genetics

Big data, computation, and complex algorithms are changing biology¹. With the growing number of publicly available genome sequences^2–5, we can now obtain enough alleles of genes to conduct an assessment of their phenotypic consequences⁶. The open reading frame (ORF) alleleome of a bacterial species has been formally defined as all the alleles of all genes that are found in the genome sequences of all strains of the species⁷. We now have enough genome sequences across numerous species of bacteria to assess the alleleome at the phylogenetic scale. At this scale, we can view every genome of modern bacteria as a successful outcome of the natural selection process and perform a large-scale assessment of the characteristics of alleles that have survived billions of years⁸ of bacterial evolution.

There are several mechanisms that generate allelic diversity. Point mutations, known as single nucleotide polymorphisms (SNPs), turn out to be the most common source of variations. SNPs in an ORF can be synonymous (S) or non-synonymous (N) mutations due to the redundancy of the genetic code: the locations of 20 amino acids are specified by 61 codons (a ratio of 20/61). Other, less frequent, sequence changes include small-scale insertions and deletions (known as indels) that can be in-frame (i.e., involve multiples of three nucleotides that lead to a loss or gain of a discrete number of amino acids in a protein). The addition or removal of any one of the three stop codons will lead to gene truncation and elongation, respectively.

Evolution acts upon an organism by providing a selection pressure that favors the propagation of certain alleles. A nonsynonymous mutation changes the amino acid sequence of an allele and may provide the bacteria with a selective advantage in one environment, but is more often deleterious when the environment changes^9–12, leading to its elimination over time¹³.

The relative strength and mode of selection pressures is measured by the ratio of nonsynonymous-to-synonymous mutations (dN/dS) across the genomes^14–15. When this ratio is greater than one, a positive selection pressure that favors allelic diversity is presumed to be acting upon the bacteria¹⁶. Traditionally, the strength and mode of evolutionary selection pressures on bacteria are determined by dN/dS values calculated across multiple, short sequence alignments of a select group of homologous ORFs between closely-related strains. Thus, dN/dS values are estimates of selection pressures acting upon the organism’s genome.

As the alleleome characterizes the totality of all sequence variation within a species, the occurrence and characteristics of sequence diversity can now be assessed across the bacterial phylogenetic tree and can be used to gain insight into the natural selection process.

In this study, we present the ORF alleleomes of 184 species of bacteria across ten phyla from the whole genome sequences of 54,191 strains. This first multi-phyla-scale analysis gives us insights into the physio-chemical basis of the evolutionary landscape in surviving alleles. For instance, we reveal quantifiable differences in the alleleomic variation between pathogenic and non-pathogenic species. Furthermore, we are able to provide a genome-scale measurement (rather than an estimate) of the relative strength and mode of evolutionary selection pressures acting upon each species, both at the genome-level and at the gene-level. Taken together, the ORF alleleomes presented in this study provide the foundation on which evolutionary theories can be tested against observable data, and the evolutionary trajectories taken by surviving alleles can be understood on a case-by-case basis across the phylogenetic tree of bacteria.

Multi-phyla alleleomes of 184 bacterial species from 54,191 genome sequences

To quantify the natural sequence diversity in bacterial genomes, we generated the ORF alleleomes of 184 bacterial species across 10 phyla from a collection of whole genome sequences of 54,191 strains (Fig. 1a, Fig. S1, Dataset S1). For each codon position in an open reading frame (ORF), the dominant (most frequently observed) codon/amino acid was determined, and thus a “consensus sequence” was defined for each ORF present within a species. The wild-type occurrence (allele frequency) of the dominant codon (Fig. S2a) and dominant amino acid (Fig. 1b) are shown, illustrating the extent of conservation (on a per codon or a per amino acid basis) in the genomes of species. Extensive conservation (> 96% of amino acids and > 90% of codons, see Fig. S2b) is a hallmark of the alleleomes across all species and does not seem to be influenced by sample size (i.e., the number of genomes studied, Fig. 1c). Likewise, we plot the wild-type occurrence of unique non-dominant amino acids (variants, i.e., the sequence variation) to identify the number of significant variants across genomes of a species (Fig. 1d, codon variants in Fig. S2c) and find a logarithmic relationship between sample size and a lower occurrence of any one specific variant (Fig. 1e, Fig. S2d). Thus, as more genome sequences become available, the relative abundances of non-significant variants decreases, elucidating significant branch points in the diversity of existing alleles.

Pathogens exhibit wider alleleomic diversity

We next determined the range of available alternate amino acid and codon substitutions that define the sequence diversity of each genomic position in all 184 species. In total, we identify 1.3 billion mutations across the dataset. We define the range of allelic variation as the dominant (D + i, where i = 0) and non-dominant (D + i, where i ≥ 1) codons (or amino acids) present across all strains of a species at a specific genomic position (Fig. 2a). The redundancy of the genetic code reduces the diversity observed in the codon sequence to that observed in the amino acid sequence. Both at the DNA and at the protein level, the majority of genomic positions across all bacterial species constitute a ‘narrow’ (D + i where i ≤ 1) diversity (Figs. 2b-c).

The urgent threat ESKAPEE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp., andEscherichia coli) constitute a subset of organisms that are responsible for a majority of nosocomial infections and exhibit an alarming rate of antibiotic resistance¹⁷. Here, we find that ESKAPEE pathogens exhibit a wider alleleomic diversity (Fig. 2b-c, cyan) in comparison to other species. When considering only the fraction of the ORF alleleome that is invariant (i.e., 100% conserved), we find that ESKAPEE pathogens represent a subset of species that are the most variant (Fig. 2d), suggesting a link between pathogenicity and genome variability. To investigate this link further, we stratify the codon-invariance data by bacterial phyla and find that human pathogens (see Dataset S2) often display the least codon-invariance (greatest variance) in each phyla (Fig. S3).

Evolutionary landscape of amino acid substitutions

The observed alleleomes of 184 species include hundreds of millions of amino acid substitutions occurring at specific positions across the genomes of strains within a species. We find that the average Grantham score¹⁸ value of all nonsynonymous mutations is 63, indicative of a moderately-benign amino acid substitution (Fig S4a). We also show that the majority of amino acid substitutions are less severe than predicted by a null-hypothesis model of point mutation⁷ (Fig. S4b-c).

We examine all variable amino acid positions across all strains of a species to determine the frequency (mutant fraction) and average predicted severity (Grantham score) of all amino acid substitutions stemming from the same amino acid (Fig. 3a). The observed distributions of mutant fraction and mutation severity for each amino acid form discrete point-clouds in a 2D space. These distributions combined represent 312 million amino acid substitutions across the 54,191 strains studied. We define a probability function (see Methods) that represents the evolutionary landscape of amino acid substitutions (Fig. 3b, z-axis). The sharp peaks in the probability function confirm that the Grantham scores and mutant fractions observed for nonsynonymous mutations stemming from the same amino acid are distinct, even if they appear to overlap when plotted in 2D (i.e., Fig. 3a).

The prevalence of specific amino acids in the open reading frames across strains of a species can influence the rate of observed amino acid substitutions. That is, amino acids that are rarely found in the genome are also less likely to represent a significant fraction of the observed mutations. We thus define an amino acid’s proclivity for nonsynonymous mutation as the ratio of its amino acid substitution rate to its genome usage rate (i.e., its genome fraction). Within each species, we calculate the mutation proclivity for all 20 amino acids and find a clear pattern of constitutive enrichment of amino acid substitutions in Ala, His, Asn, Asp, Val, Thr, and Ser residues and a constitutive reduction in Trp, Gly, Tyr, Leu, Phe and Cys residues (shown for Firmicutes in Fig. 3c, and all species in Fig. S5).

Observable evidence of natural selection

The ratio of non-synonymous-to-synonymous mutations (dN/dS) is used as a metric to evaluate the relative strength and mode of the evolutionary selection pressure acting on bacterial genomes^14–15. This ratio reflects the difference between a reference genome and a closely related strain. Values greater than 1 are associated with positive selection (although we note that positive selection has been observed^16,19 at dN/dS < 1). Historically, this ratio is estimated at the genome-level by averaging the measured dN/dS values across multiple, short sequence alignments of related strains. Using the alleleomic consensus sequence as a reference, we measure the genome-scale dN/dS values for 184 species of bacteria from our collection of 54,191 whole genome sequences (Fig. 4a). Thus, the scale of the data set allows reference strain-agnostic assessment. Notably, we find that the species with the largest dN/dS values (dN/dS > 0.8) are human pathogens.

The alleleome also allows us to accurately measure dN/dS values at the gene-level to identify genes under the strongest selection pressure (776,054 ORFs in Fig. 4b). Thus, for any gene(s) in any one (or across multiple) species of bacteria, the alleleome can measure the relative strength and mode of the selection pressures acting upon any gene or subset of genes. As an example, AMR-conferring genes²⁰ are shown in Fig. S6a-b.

Additional types of mutations (e.g., mutations in terminating codons and “gap” mutations – small in-frame deletions or insertions) have been observed in bacteria^21–22, and various studies^23–24 have examined the impact of these mutations and the mechanisms by which they are created. As the alleleome describes the totality of all sequence variation within a species, we identified these additional mutant types and found that they make up less than 2% of all mutations for a majority of the 184 species (Fig. 4c). The occurrence of the various sequence changes is shown in Fig. 4c, and single nucleotide polymorphism is by far the most common source of sequence variation.

The rapid growth in the number of sequenced bacterial genomes¹ offers us a broad readout of the result of selection pressures that operate on genomes and the amino acid sequence variation in the proteins they encode. The availability of sequenced genomes for numerous strains of many bacterial species allowed us to perform an initial global assessment of the hallmarks of the alleles of open reading frames that have survived eons of natural selection.

In this study, we present the alleleomes of 184 species of bacteria across 54,191 strains whose sequence variation consists of 1.3 billion mutations, including 312 million amino acid substitutions. Our alleleomic workflow generates a ‘consensus’ sequence that reflects the most abundant codon at each position across all open reading frames in every strain of a species (Fig. 1). Traditionally, mutations are studied in the context of a reference strain. We emphasize that the large scale of our alleleomic analysis establishes strain-agnostic consensus sequences for a species, eliminating biases caused by the choice of a single reference stain.

Analysis of the 184 bacterial open reading frame alleleomes reveals key characteristics: 1) alleleomes are highly conserved, 2) the range of amino acid substitutions in variable codons is narrow, 3) the predicted consequences of observed amino acid substitutions on protein function are predominantly mild, 4) the evolutionary stability of specific amino acids is relatively conserved across all species, and 5) the observed dN/dS ratio across all strains is 0.32 (close to the ratio of the redundancy of the genetic code – 20 amino acids for 61 codons). In addition, the new scale of sequence that we analyze allows us to explore deeper characteristics of evolutionary sequence variation, such as the low relative abundance of gap mutations and terminal codon mutations within a species.

The high conservation (i.e., mutations are found in relatively few genomic positions, Fig. 1.) and narrowness (i.e., sequence variation in these positions is described by few unique variants, Fig. 2) of the alleleome helps focus the search for causal mutations in an organism. In fact, ‘significant’ variants (positions where the allele frequency is evenly distributed between 2–3 variants, Fig. 1d) comprise only 0.1-1% of codons in bacterial genomes, providing for a manageable view of the sequence variation underlying natural selection.

The scale of our dataset (1.3 billion mutations, Fig. S1) allows us to reveal previously unknown features of natural sequence variation. For instance, in comparing the relative substitution (nonsynonymous mutation) rate to the relative usage rate (in the coding region) of amino acids within a species, we found global trends (shown in Firmicutes in Fig. 3c, and across all species in Fig. S5) in the evolutionary stability (i.e., proclivity for substitution) of specific amino acid residues (e.g., Trp, Gly, Tyr, Leu, Phe and Cys are under-substituted in nearly all species). Additionally, we were able to define a probability function that describes the expected severity (i.e., Grantham score) and frequency of nonsynonymous mutations originating from a specific amino acid, and observed distinct peaks in the probability landscape that are suggestive of constraints (in the predicted severity and observed frequency) on the mutations originating from each residue (Fig. 3a-b). Furthermore, we find that the Grantham scores associated with observed amino acid substitutions are indicative of moderately-benign mutations, and are lower than predicted by a null-hypothesis model⁷ of point mutations (see Methods and Fig. S4).

The observed alleles are the survivors of the evolutionary pressures experienced by the ancestors of present-day strains. In comparing nonsynonymous and synonymous mutations, we were able to provide observable – rather than estimated – genome-wide indicators (dN/dS ratios) for selection pressures across hundreds to thousands of strains of a species (Fig. 4a). Importantly, we also calculated dN/dS ratios for all 776,054 genes identified in this study (Fig. 4b), allowing for the direct measurement of the evolutionary pressures acting at the gene-level. As gene-centricity is a hallmark of natural selection²⁵, the measurement of gene-level selection pressures may aid in identifying cooperativity between groups of related genes (e.g., AMR-conferring genes, Fig. S6a-b) with similar evolutionary trajectories. Furthermore, we were able to provide the locations and relative abundances of less-studied mutation types (e.g., “gap” and terminal codon), providing an impetus for future analyses (Fig. 4c). Finally, we note that the organisms with the highest genome-wide dN/dS scores are human pathogens (Fig. 4a).

As the alleleome describes codon-level wild-type sequence variation within a species, it provides a basis to determine the ‘novelty’ of new mutations and other sequence variation. For example, we previously analyzed⁷ over 33,000 unique mutations acquired in laboratory strains^26–27 of E. coli and found that the majority of them had no natural homologs. Thus, the alleleome can help assess the patentability²⁸ of specific engineered (or laboratory-evolved) mutants. Since the alleleome calculates the frequency (Fig. 2b,d) and severity (Fig. 3a-b) of sequence variation at each position across all open reading frames (Fig. 4b), it may provide contextual background for bacterial microbial forensics^29–30. For example, a newly sequenced strain whose sequence variation deviates significantly from the normal alleleomic diversity of its species may be indicative of a concerning evolutionary phenomenon, or of a non-natural evolutionary trajectory. Finally, the alleleome’s positional accounting of the relative abundances of all types of sequence variation may have implications for protein engineering. Combined with the ability to generate a 3D structure of every protein complex coded by an organism’s genome³¹, mapping the alleleomic variation onto the protein structure may reveal the tolerances of specific protein folds to various types of sequence variation and guide protein design efforts²⁴.

The large number of available sequenced bacterial genomes^2–5 has enabled big data analysis of the sequence variants of genes, and led to the definition of the alleleome⁷. In contrast, pan-genome analyses are based on presence/absence calls for individual genes in a set of genomes for strains of a species^32–38. Thus, the alleleome is a fine-grained version of the pangenome and illuminates new information about sequence variation and selection. Undoubtedly, as the number of available sequenced genomes increases, and additional alleleome workflows are created, phylo-scale alleleome analyses will uncover additional features of sequence diversity and generate new and informative insights into the process of natural evolution.

Acknowledgments

We would like to thank Marc Abrams for help with manuscript editing. This work was funded by Novo Nordisk Foundation (Grant Number NNF20CC0035580) (E.A.C. and B.O.P.) and NIH (Grant R01 GM057089) (B.O.P.).

Author Contributions

E.A.C. and B.O.P. contributed to Conceptualization. E.A.C. contributed to Methodology, Data Curation, Investigation, Formal Analysis, Validation and Visualization of the results. J.C.H. contributed to Data Acquisition and Data Curation. E.A.C. and B.O.P. prepared the Original Draft. All authors were involved in Review and Editing. B.O.P. contributed to Funding Acquisition and Project Supervision. The author(s) read and approved the final manuscript.

Competing Interest

The authors declare no conflict of interest.

Materials & Correspondence

All correspondence and material requests should be addressed to Bernhard O Palsson ([email protected]) or Edward Catoiu ([email protected]).

Data Availability

All data used in this study is publicly available online. Genome IDs used in this study are provided in the supplementary datasets.

Land, M. et al. Insights from 20 years of bacterial genome sequencing. Functional & Integrative Genomics 15, 141–161 (2015).
Federhen, S. The NCBI Taxonomy database. Nucleic Acids Research 40, D136–D143 (2011).
Federhen, S. Type material in the NCBI Taxonomy Database. Nucleic Acids Research 43, D1086–D1098 (2014).
Benson, D. A. et al. GenBank. Nucleic Acids Research 41, D36–D42 (2012).
Snyder, E. E. et al. PATRIC: The VBI PathoSystems Resource Integration Center. Nucleic Acids Research 35, D401–D406 (2007).
Kavvas, E. S. et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nature Communications 9, (2018).
Catoiu, E. A., Phaneuf, P. V., Monk, J. M. & Palsson, B. O. Whole-genome sequences from wild-type and laboratory-evolved strains define the alleleome and establish its hallmarks. PNAS 120, (2023).
Schopf, J. W. & Packer, B. Early Archean (3.3-Billion to 3.5-Billion-Year-Old) Microfossils from Warrawoona Group, Australia. Science 237, 70–73 (1987).
Utrilla, J. M. E. et al. Global Rebalancing of Cellular Resources by Pleiotropic Point Mutations Illustrates a Multi-scale Mechanism of Adaptive Evolution. Cell Systems 2, 260–271 (2016).
Travisano, M. & Lenski, R. E. Long-Term Experimental Evolution in Escherichia coli. IV. Targets of Selection and the Specificity of Adaptation. Genetics 143, 15–26 (1996).
Noda-Garcia, L. et al. Chance and pleiotropy dominate genetic diversity in complex bacterial environments. Nature Microbiology 4, 1221–1230 (2019).
Kinsler, G., Geiler-Samerotte, K. & Petrov, D. A. Fitness variation across subtle environmental perturbations reveals local modularity and global pleiotropy of adaptation. eLife (2020).
Chen, P. & Zhang, J. Antagonistic pleiotropy conceals molecular adaptations in changing environments. Nature Ecology & Evolution 4, 461–469 (2020).
Kimura, M. Recent development of the neutral theory viewed from the Wrightian tradition of theoretical population genetics. PNAS 88, 5969–5973 (1991).
Jeffares, D. C., Tomiczek, B., Sojo, V. & dos Reis, M. A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome. Methods Mol Biol 65–90 (2014) doi:https://doi.org/10.1007/978-1-4939-1438-8_4.
Kryazhimskiy, S. & Plotkin, J. B. The Population Genetics of dN/dS. PLoS Genet 4, e1000304–e1000304 (2008).
Rice, L. B. Federal Funding for the Study of Antimicrobial Resistance in Nosocomial Pathogens: No ESKAPE. The Journal of Infectious Diseases 197, 1079–1081 (2008).
Grantham, R. Amino Acid Difference Formula to Help Explain Protein Evolution. Science 185, 862–864 (1974).
Rahman, S., Pond, K., Webb, A. G. & Hey, J. Weak selection on synonymous codons substantially inflates dN/dS estimates in bacteria. PNAS 118, (2021).
Hyun, J. C., Monk, J. M., Szubin, R., Hefner, Y. & Palsson, B. Global pathogenomic analysis identifies known and novel genetic antimicrobial resistance determinants in twelve species. bioRxiv (2023) doi:https://doi.org/10.1101/2023.05.26.542542.
Mira, A., Ochman, H. & Moran, N. A. Deletional bias and the evolution of bacterial genomes. Trends in Genetics 17, 589–596 (2001).
Gregory, T. Ryan. Insertion–deletion biases and the evolution of genome size. Gene 324, 15–34 (2004).
Long, H., Miller, S. F., Williams, E. & Lynch, M. Specificity of the DNA Mismatch Repair System (MMR) and Mutagenesis Bias in Bacteria. Molecular Biology and Evolution 35, 2414–2421 (2018).
Savino, S., Desmet, T. & Franceus, J. Insertions and deletions in protein evolution and engineering. Biotechnology Advances 60, 108010 (2022).
Dawkins, R. The Selfish Gene. (Oxford University Press, 1976).
Phaneuf, P. V., Gosting, D., Palsson, B. O. & Feist, A. M. ALEdb 1.0: a database of mutations from adaptive laboratory evolution experimentation. Nucleic Acids Research 47, D1164–D1171 (2018).
Tenaillon, O. et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature 536, 165–170 (2016).
U.S. Supreme Court. Diamond v. Chakrabarty, 447 U.S. 303 (1980). Justia Law (2023).
Chen, X., Pasternak, Z., Mason, C. E. & Eran Elhaik. Forensic Applications of Microbiomics: A Review. Front. Microbiol. 11, (2021).
Schmedes, S. E. & Budowle, B. Microbial Forensics. Encyclopedia of Microbiology 134–145 (2015) doi:https://doi.org/10.1016/b978-0-12-801238-3.02483-1.
Catoiu, E. A., Mih, N., Lu, M. & Palsson, B. O. Establishing comprehensive quaternary structural proteomes from genome sequence. (In Review).
Eisenstein, M. Every base everywhere all at once: pangenomics comes of age. Nature 616, 618–620 (2023).
Norsigian, C. J., Fang, X., Palsson, B. O. & Monk, J. M. Pangenome Flux Balance Analysis Toward Panphenomes. in The Pangenome: Diversity, Dynamics and Evolution of Genomes (eds. Tettelin, H. & Medini, D.) 219–232 (Springer, 2020). doi:https://doi.org/10.1007/978-3-030-38281-0_10.
Udaondo, Z., Molina, L., Segura, A., Duque, E. & Pablo, J. Analysis of the core genome and pangenome of Pseudomonas putida. Applied Microbiology International 18, 3268–3283 (2015).
Corredor, M., Patiño-Salazar, J. D., Castaño, D. C. & Muñoz-Gómez, A. The Pangenome of Pseudomonas aeruginosa. in Pseudomonas aeruginosa - New Perspectives and Applications (IntechOpen, 2023). doi:https://doi.org/10.5772/intechopen.108187.
Norsigian, C. J. et al. Systems biology approach to functionally assess the Clostridioides difficile pangenome reveals genetic diversity with discriminatory power. PNAS 119, (2022).
Hyun, J. C., Monk, J. M. & Palsson, B. O. Comparative pangenomics: analysis of 12 microbial pathogen pangenomes reveals conserved global structures of genetic and functional diversity. BMC Genomics 23, (2022).
Omkar Satyavan Mohite, Lloyd, C. J., Monk, J. M., Weber, T. & Palsson, B. O. Pangenome analysis of Enterobacteria reveals richness of secondary metabolite gene clusters and their associated gene sets. Synthetic and Systems Biotechnology 7, 900–910 (2022).
Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113–113 (2004).
Zhu, Q. et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nature Communications 10, (2019).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology 36, 996–1004 (2018).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research 25, 1043–1055 (2015).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, (2010).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 1658–1659 (2006).

Extracting alleles from WGS of 54,191 strains of bacteria.

The alleleomes used in this study were adapted from our previous work²⁰, constructed as follows. First, genomes from the Web of Life collection ⁴⁰ were filtered based on the following criteria: 1) GTDB species classification (release 202) is available⁴¹, 2) CheckM⁴² contamination < 10% and completeness > 80%, 3) number of contigs is within three times the median number of contigs for all assemblies for the genome’s species, and 4) total assembly length is within three standard deviations from the mean of all assemblies for the genome’s species. After filtering, 184 bacterial species defined by GTDB were identified with at least 50 genomes passing all criteria, totaling 54,191 genomes. The genome IDs and metadata are provided in Dataset S1 of this study. The quality metrics of the genomes can be found in Dataset S1 of Hyun et. al. 2023²⁰).

Next, open reading frames (ORFs) for each genome were identified using Prodigal⁴³ v2.6.3 with parameters '-c', '-m', '-g', '11', '-p', 'single', '-q', based on those used in by the Prokka annotation platform⁴⁴. For each species, “genes” were defined using the approach previously described³⁷: all unique ORF protein sequences were clustered using CD-HIT⁴⁵ v4.6 with minimum identity 80% and minimum alignment length 80% and each cluster was treated as a gene. Finally, the nucleic acid and amino acid alleles of each gene were defined as the set of amino acid and protein sequences corresponding to each ORF in the gene’s cluster, respectively.

Muscle alignment of allele sequences reveals the consensus allele and wild-type variants

As previously described⁷, we generate the consensus sequence of a species from the whole genome sequences of all strains within a species. Briefly, all alleles (gene-variants) for all genes are identified within a species. Alleles that contain early terminations resulting in greater than 20% gene loss are removed. Alleles that are more than 2 standard deviations shorter from the mean allele length are also removed. Genes with remaining allele representation in less than 5% of strains are not considered (“low-frequency genes”, Fig. 1a).

MUltiple Sequence Comparison by Log- Expectation (MUSCLE³⁹) is used to align all amino acid (AA) alleles of a gene. For each position across the length of the gene, the most common amino acid residue (dominant allele) is chosen to represent the allele consensus. For positions that are represented in less than 5% of strains (e.g., insertions at the termini of genes) are removed (“low-frequency terminal insertion, Fig. 1a). The consensus sequence of each gene is used to characterize all deviations (non-dominant alleles, ”mutations”, “variants”) from consensus. The collection of consensus sequences for all genes is used to define the alleleome of a species at genome-scale. We keep track of each amino acid allele and its mRNA transcript such that the alleleome can be reverse-translated and described at the codon-level (as in Fig. S2). Unless otherwise specified, we hereafter use the term ‘allele’ to describe the sequence variation at the level of individual amino acids or codons, rather than at the gene-level.

Alleleometrics quantify conservation and sequence variation across the genome of a species

For each position across the consensus genome sequence of a species, we find the allele frequency of the dominant allele (amino acid & codon) across all strains from which the position is defined. The frequency of the dominant allele is equivalent to the conservation strength of the genomic position and is plotted for all positions across the genome of a species (Fig. 1b). This allows us to define regions of the genome that are variable (low consensus strength) or conserved (high consensus strength).

The sequence variation is defined by the allele frequency of all variants across all positions of the genome (Fig. 1d). Because there can be multiple variants at the same genome position, we refer to variants as “significant” if they are present in a large fraction of strains and as “rare” if they are only found in a minute fraction of strains. For the purposes of this study, these terms are qualitative.

Alleleome ‘narrowness’ is defined by non-dominant alleles at each genomic position

We rank-order all alleles at each genome position by their allele frequencies. The dominant (most common) allele and any additional non-dominant (less common) alleles can be used to describe the sequence diversity at each genomic position. Positions whose sequence diversity can be described by the dominant allele and up to 1 additional non-dominant allele are considered to have a ‘narrow’ sequence diversity. Positions whose sequence diversity can only be described after considering the dominant allele and at least 3 additional non-dominant alleles are considered to have a ‘wide’ sequence diversity. Trivially, positions with no non-dominant alleles are invariant. In this manner, we describe the consensus genomes of 184 species of bacteria (Fig. 2) and show that pathogenicity is often associated with lesser invariance (Fig. S3).

The evolutionary landscape of amino acid substitutions

Across all species, the set of all mutations (deviations from consensus sequence, 1.3 Billion) that change the amino acid sequence were identified (312 Million, Fig. S1). Mutations stemming from the same original amino acid were grouped together. For each amino acid, the frequency that it is substituted (i.e., its ‘nonsynonymous mutant fraction’ normalized across all amino acid substitutions in the species) and the average predicted severity (as estimated by the Grantham score¹⁸) of all mutations in the group were plotted for each species (20 amino acids x 184 species, Fig. 3a).

The distributions of mutant fractions and Grantham scores for each amino acid mutation group were used to determine the probability function that describes the evolutionary landscape of amino acid substitutions. Briefly, we sample points across the range of observed mutant fractions (x) and Grantham scores (y) (shown in Fig. 3a). For each point (X,Y) in the sample space, we calculate the probability (P_C, p-value) associated with that point belonging to each point-cloud (C). We assign point (X,Y) to the point-cloud associated with the highest p-value (i.e., the nearest point-cloud in the probability-space). Shown in Fig. 3b, the evolutionary landscape of amino acid substitutions is described by a probability function with the equation:

Across the sample space, the sharp peaks (z-axis, Fig. 3b) we observe above each amino acid point-cloud in Fig. 3a confirms that the distributions of observed Grantham scores and mutant fractions are quite distinct, even if they seem to overlap when plotted in 2D (i.e., Fig. 3a).

Proclivity for mutation of amino acids

The mutant fraction of each group of mutations stemming from the same amino acid was normalized by the fraction of the genome occupied by the amino acid across all species of bacteria. Thus, we can determine the enrichment of mutations in specific amino acids by the ratio of mutation fraction and genome fraction of the amino acid. We use seaborn.clustermap (‘single’ linkage, ‘euclidean’ metric) to plot the clustered heatmap of mutation enrichment/reduction values for each amino acid. We observe distinct (column) clusters across the range of amino acids, suggesting a physio-chemical basis for evolutionary stability of specific amino acids. Furthermore, clustering the data also shows the conservation of mutation proclivity of amino acids across families of bacteria (row clusters). The results of this analysis are shown in Fig. 3c (for Firmicutes) and Figure S4 (across all phyla).

Cross-phyla measurement of dN/dS at the genome and gene-level

For the consensus genome of each species, the set of all mutations was identified (see above, Fig. S1). The total number of synonymous and non-synonymous mutations in each species are graphed (Fig. 4a). A global dN/dS value (of 0.32) was determined using the total counts of non-synonymous and synonymous mutants across all strains for all species. For each gene within each species, the number of synonymous and non-synonymous mutations are counted. We draw a contour map to visualize the distribution of gene-level dN/dS values observed (Fig. 4b).

We previously identified 69,607 gene-variants of 7,710 AMR-conferring genes²⁰. Across all alleles of all strains, we find 43,042 direct sequence matches to this set of gene-variants, representing unique 3,757 AMR genes across 141 species. Of these, 2,549 AMR genes are present in the alleleome (> 5% of strains in a species) and are plotted. We note that an AMR-allele may have a direct sequence match to alleles in multiple species, thus multiple dN/dS values may be calculated for the same AMR gene (Fig. 4c). Furthermore, since we used a direct-sequence-match (to previously identified AMR genes) criteria to determine AMR genes in our dataset, it is likely that we underestimate a large fraction of AMR genes that share high sequence similarity with (but are not identical to) our previously identified AMR allele sequences. We expect this effect is more pronounced in species that are distantly related to the 12 species for which AMR genes were previously identified²⁰.

There is NO Competing Interest.

CatoiuAlleleomeMSDatasetS1genomeIdsandmetadata.xlsx
DATA SET S1
CatoiuAlleleomeMSDatasetS2speciesIddesignationandalleleomegsproperties.xlsx
DATA SET S2
CatoiuAlleleomeMSSIfiguresanddatasetlegends.pdf
SUPPLEMENTARY INFORMATION

Download PDF

Version 1

posted

You are reading this latest preprint version

Alleleomes characterize the survivors of 3.5 billion years of bacterial evolution

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Declarations

References

Methods

Additional Declarations

Supplementary Files

Status:

Version 1