Extracting alleles from WGS of 54,191 strains of bacteria.
The alleleomes used in this study were adapted from our previous work20, constructed as follows. First, genomes from the Web of Life collection 40 were filtered based on the following criteria: 1) GTDB species classification (release 202) is available41, 2) CheckM42 contamination < 10% and completeness > 80%, 3) number of contigs is within three times the median number of contigs for all assemblies for the genome’s species, and 4) total assembly length is within three standard deviations from the mean of all assemblies for the genome’s species. After filtering, 184 bacterial species defined by GTDB were identified with at least 50 genomes passing all criteria, totaling 54,191 genomes. The genome IDs and metadata are provided in Dataset S1 of this study. The quality metrics of the genomes can be found in Dataset S1 of Hyun et. al. 202320).
Next, open reading frames (ORFs) for each genome were identified using Prodigal43 v2.6.3 with parameters '-c', '-m', '-g', '11', '-p', 'single', '-q', based on those used in by the Prokka annotation platform44. For each species, “genes” were defined using the approach previously described37: all unique ORF protein sequences were clustered using CD-HIT45 v4.6 with minimum identity 80% and minimum alignment length 80% and each cluster was treated as a gene. Finally, the nucleic acid and amino acid alleles of each gene were defined as the set of amino acid and protein sequences corresponding to each ORF in the gene’s cluster, respectively.
Muscle alignment of allele sequences reveals the consensus allele and wild-type variants
As previously described7, we generate the consensus sequence of a species from the whole genome sequences of all strains within a species. Briefly, all alleles (gene-variants) for all genes are identified within a species. Alleles that contain early terminations resulting in greater than 20% gene loss are removed. Alleles that are more than 2 standard deviations shorter from the mean allele length are also removed. Genes with remaining allele representation in less than 5% of strains are not considered (“low-frequency genes”, Fig. 1a).
MUltiple Sequence Comparison by Log- Expectation (MUSCLE39) is used to align all amino acid (AA) alleles of a gene. For each position across the length of the gene, the most common amino acid residue (dominant allele) is chosen to represent the allele consensus. For positions that are represented in less than 5% of strains (e.g., insertions at the termini of genes) are removed (“low-frequency terminal insertion, Fig. 1a). The consensus sequence of each gene is used to characterize all deviations (non-dominant alleles, ”mutations”, “variants”) from consensus. The collection of consensus sequences for all genes is used to define the alleleome of a species at genome-scale. We keep track of each amino acid allele and its mRNA transcript such that the alleleome can be reverse-translated and described at the codon-level (as in Fig. S2). Unless otherwise specified, we hereafter use the term ‘allele’ to describe the sequence variation at the level of individual amino acids or codons, rather than at the gene-level.
Alleleometrics quantify conservation and sequence variation across the genome of a species
For each position across the consensus genome sequence of a species, we find the allele frequency of the dominant allele (amino acid & codon) across all strains from which the position is defined. The frequency of the dominant allele is equivalent to the conservation strength of the genomic position and is plotted for all positions across the genome of a species (Fig. 1b). This allows us to define regions of the genome that are variable (low consensus strength) or conserved (high consensus strength).
The sequence variation is defined by the allele frequency of all variants across all positions of the genome (Fig. 1d). Because there can be multiple variants at the same genome position, we refer to variants as “significant” if they are present in a large fraction of strains and as “rare” if they are only found in a minute fraction of strains. For the purposes of this study, these terms are qualitative.
Alleleome ‘narrowness’ is defined by non-dominant alleles at each genomic position
We rank-order all alleles at each genome position by their allele frequencies. The dominant (most common) allele and any additional non-dominant (less common) alleles can be used to describe the sequence diversity at each genomic position. Positions whose sequence diversity can be described by the dominant allele and up to 1 additional non-dominant allele are considered to have a ‘narrow’ sequence diversity. Positions whose sequence diversity can only be described after considering the dominant allele and at least 3 additional non-dominant alleles are considered to have a ‘wide’ sequence diversity. Trivially, positions with no non-dominant alleles are invariant. In this manner, we describe the consensus genomes of 184 species of bacteria (Fig. 2) and show that pathogenicity is often associated with lesser invariance (Fig. S3).
The evolutionary landscape of amino acid substitutions
Across all species, the set of all mutations (deviations from consensus sequence, 1.3 Billion) that change the amino acid sequence were identified (312 Million, Fig. S1). Mutations stemming from the same original amino acid were grouped together. For each amino acid, the frequency that it is substituted (i.e., its ‘nonsynonymous mutant fraction’ normalized across all amino acid substitutions in the species) and the average predicted severity (as estimated by the Grantham score18) of all mutations in the group were plotted for each species (20 amino acids x 184 species, Fig. 3a).
The distributions of mutant fractions and Grantham scores for each amino acid mutation group were used to determine the probability function that describes the evolutionary landscape of amino acid substitutions. Briefly, we sample points across the range of observed mutant fractions (x) and Grantham scores (y) (shown in Fig. 3a). For each point (X,Y) in the sample space, we calculate the probability (PC, p-value) associated with that point belonging to each point-cloud (C). We assign point (X,Y) to the point-cloud associated with the highest p-value (i.e., the nearest point-cloud in the probability-space). Shown in Fig. 3b, the evolutionary landscape of amino acid substitutions is described by a probability function with the equation:
Across the sample space, the sharp peaks (z-axis, Fig. 3b) we observe above each amino acid point-cloud in Fig. 3a confirms that the distributions of observed Grantham scores and mutant fractions are quite distinct, even if they seem to overlap when plotted in 2D (i.e., Fig. 3a).
Proclivity for mutation of amino acids
The mutant fraction of each group of mutations stemming from the same amino acid was normalized by the fraction of the genome occupied by the amino acid across all species of bacteria. Thus, we can determine the enrichment of mutations in specific amino acids by the ratio of mutation fraction and genome fraction of the amino acid. We use seaborn.clustermap (‘single’ linkage, ‘euclidean’ metric) to plot the clustered heatmap of mutation enrichment/reduction values for each amino acid. We observe distinct (column) clusters across the range of amino acids, suggesting a physio-chemical basis for evolutionary stability of specific amino acids. Furthermore, clustering the data also shows the conservation of mutation proclivity of amino acids across families of bacteria (row clusters). The results of this analysis are shown in Fig. 3c (for Firmicutes) and Figure S4 (across all phyla).
Cross-phyla measurement of dN/dS at the genome and gene-level
For the consensus genome of each species, the set of all mutations was identified (see above, Fig. S1). The total number of synonymous and non-synonymous mutations in each species are graphed (Fig. 4a). A global dN/dS value (of 0.32) was determined using the total counts of non-synonymous and synonymous mutants across all strains for all species. For each gene within each species, the number of synonymous and non-synonymous mutations are counted. We draw a contour map to visualize the distribution of gene-level dN/dS values observed (Fig. 4b).
We previously identified 69,607 gene-variants of 7,710 AMR-conferring genes20. Across all alleles of all strains, we find 43,042 direct sequence matches to this set of gene-variants, representing unique 3,757 AMR genes across 141 species. Of these, 2,549 AMR genes are present in the alleleome (> 5% of strains in a species) and are plotted. We note that an AMR-allele may have a direct sequence match to alleles in multiple species, thus multiple dN/dS values may be calculated for the same AMR gene (Fig. 4c). Furthermore, since we used a direct-sequence-match (to previously identified AMR genes) criteria to determine AMR genes in our dataset, it is likely that we underestimate a large fraction of AMR genes that share high sequence similarity with (but are not identical to) our previously identified AMR allele sequences. We expect this effect is more pronounced in species that are distantly related to the 12 species for which AMR genes were previously identified20.