Dataset overview
We established a large-scale GAS genomics dataset of 2,598 GAS strains from public databases and published literature (Additional File 1). The dataset encompasses more than 25 pathogenic phenotypes and 152 different emm serotypes from 29 isolation countries. The clinical information and geographical distribution of the top 14 phenotypes of the GAS strains was presented in Fig. 1. Three notable features were observed: (i) no single phenotype was dominated by strains of a single emm serotype (Fig. 1a), (ii) the majority of the strains were collected in the last two decades when high-throughput sequencing technologies became available (Fig. 1b), and (iii) the strains are broadly distributed across global geographic locations (Fig. 1c). Those features make the current GAS genomics data suitable for systematic analysis of statistical association between phenotypes and genotypes.
We identified the core genome of GAS using a selected set of 165 strains with complete genomes and high sequence quality (Additional File 2), yielding 1,566,429 core genomic sites distributed in 1584 core genes. The majority of the core genes have a high level of mapping coverage among the GAS strains and 138 genes have a low level of coverage (the number of strains whose gene coverage less than 50% > 7) (Figure S1). We excluded 104 of the 138 genes from association tests due to the low mapping coverage (Additional File 3).
Mutation detection
We selected a training cohort of 790 strains with balanced emm serotype distributions and full-range coverages of isolation time points for mutation detection (Additional File 4) and focused on the top eight phenotypes containing a sufficient number of strains, i.e., invasive (INV), pharyngitis (PHY), superficial soft tissue infection (SSTI), superficial soft tissue infection (SSTI), deep soft tissue infection (DSTI), rheumatic fever (RF), sepsis (SP), scarlet fever (SF), and pneumonia (PM), for model fitting and association tests (Fig. 1d-1e). The phylogenetic structure based on the GAS core genome shows that the strains do not cluster by phenotypes, emm serotypes, or isolation year, and therefore the population stratification or epidemiological factors will not significantly influence the statistic association test (Figure S2). Furthermore, our method assesses the relative allele frequencies only between paired phenotypes, and therefore the overall population stratification will not confound the association test as it does for conventional case-control tests.
A total of 153,348 bi-allelic SNPs and 15,468 InDels were detected across the core genome for the training cohort GAS strains. Among them, 138,907 (82.3%) fall into coding regions and 63,321 (37.5%) cause translational changes (non-synonymous SNPs and frame-shift InDels) (Figure S3a). The gene-wise profiling of the mutations reveals a quasi-normal distribution across the core genes with an average number of 80 mutations per gene, corresponding to an average mutation ratio of 8.6% per gene (Figure S3b). The majority of the mutations have a minor allele frequency (MAF) < 10% in all phenotypes, but the distribution pattern of MAF varies among the phenotypes (Figure S3c). Further inspection of the pairwise correlation of MAF between phenotypes shows that DSTI is significantly different from other phenotypes, and therefore, was removed from the association test (Figure S3d-S3i). We used the remaining seven phenotypes (INV, PHY, SSTI, RF, SF, SP and PM) for downstream analysis. We then took a strategy of case-case association analysis to identify genetic signatures differentiating pair-wise phenotypes of GAS by constructing ten binary comparison groups for association tests, i.e., INV-PHY, INV-SSTI, INV-PHYSSTI, INV-RF, PHY-SSTI, PHY-RF, RF-SSTI, PM-SF, PM-SP, and SF-SP.
New InDel-based and CDS-based GWAS methods enable detection of missing association between GAS phenotypes and genotypes
In order to perform association tests for a broader range of genetic variants in addition to SNPs, we developed new methods IndelWinScan for InDel-based GWAS and CDS2GWAS for CDS-based GWAS (Fig. 2a), which allowed association tests for InDel mutational status of nonCDS fragments and the coding status of CDS, respectively. Our methods treated nonCDS or CDS as genetic units and evaluated the overall genetic status of the units caused by integrative effects of SNPs and InDels within the units, and then transformed the genetic status to categorical genotypes, i.e., intact vs. inactivated or intact vs. mutated (see Methods & Materials). The transformed genotypes were finally used for testing associations with the phenotypes of interest. For those genes without being inactivated, we employed the traditional SNP-based GWAS analysis to identify the associations between SNPs in those genes and phenotypes of interest.
Using these newly developed methods, we performed GWAS analysis in three modalities across the whole genomes of GAS, namely, SNP-based GWAS, InDel-based GWAS, and CDS-based GWAS for each of the ten paired groups of phenotypes we constructed. The association results for the ten groups are summarized in Fig. 2b. It is seen that the highest number of genotype-phenotype associations was detected from the coding SNPs in all tested groups, indicating that SNPs in the coding regions are still the major source of genetic variations. The InDel mutational fragments in nonCDS regions are the second leading source of genetic variations with phenotypic associations, highlighting the importance of investigating mutations in these regions. The truncated CDS genes account for the lowest number of associations.
In addition to the genomic region-specific distribution of the variants, a clear phenotype-dependent pattern was also observed. For instance, the INV-containing comparison groups (INV-PHY, INV-SSTI and INV-PHYSSTI) yield the highest number of truncated CDS significantly associated with phenotype differentiation, suggesting frequent alterations of gene coding status in invasive GAS strains in comparison with non-invasive strains.
To further assess the loci associated with the phenotypes, we took the comparison groups INV-SSTI as an example to visualize the variants across the genome with the Manhattan plot (Fig. 3, the full set of Manhattan plots can be found in Figure S4) and then further validated the most significant candidates in the validation cohort of GAS strains (Fig. 4). The most significant variants from the SNP-based GWAS were detected in the gene nga, encoding NAD glycohydrolase. Seven of the variants were confirmed between INV-SSTI in the validation cohort, including two non-coding SNPs in the promoter spacer region (T-22G at 150429 and C-18T at 150433) and four non-synonymous SNPs in the coding region (Ala99Val, Arg103His, Gly136Arg, Met195Ile, and Leu199Ile) with high association significance (P < 10− 3, Fig. 4a). The two SNPs in the nga promoter at -22 and − 18 have been reported to be related with enhanced virulence in animal models of pharyngitis and necrotizing fasciitis [32].
For InDel-based GWAS, 244 mutated non-coding fragments showed significant association in the testing for the group INV-SSTI. Those fragments are located at the 5’ upstream regions of 37 genes (with an average distance of 198 bp) or the 3’ downstream regions of 32 genes (with an average distance of 214 bp) (Fig. 3b and Figure S5). Among these fragments, the two most significant fragments associated with INV are in the promoter region of the small subunit ribosomal RNA (SSU rRNA) located at around 1.59 Mb (AP53_1607) and 16 kb (AP53_15) of the reference genome, respectively (P < 10^-4) (Fig. 4b and 4c). The two fragments were further verified in the validation cohort, which contain multiple InDels, including one in the UP element of the SSU rRNA promoter region [33]. Previous studies showed that the UP element can enhance rRNA transcription upon bound by the RNA polymerase α subunit in both Escherichia coli and Bacillus subtilis, although to different extents [34, 35].
Notably, for the CDS-based GWAS, two known virulence regulator gene, covS and rocA, were identified to be among the most significant genes whose inactivated coding status is associated with INV relative to SSTI (Fig. 3c). Although the inactivation genotype of covS has been experimentally determined to be associated with hypervirulence and invasiveness in several GAS isolates [13–16], our study first identifies the association on the population level by leveraging the advantage of our new CDS-based GWAS method, which allows testing the gene coding status regardless of the specific variant forms in the gene. This association was confirmed in the validation cohort (Fig. 4d-e). Specifically, we detected 12 different forms of allelic variants, including SNPs, insertions, and deletions in the covS coding region independently occurred in 26 GAS strains (Fig. 4d). The 12 variants all led to covS inactivation and when tested collectively as a gene-inactivation genotype, showed highly significant association with GAS invasiveness (β = 2.3, P = 0.002). Similarly, we detected three different forms of inactivating mutations in the gene rocA from seven GAS strains. By treating them collectively as a gene-inactivation genotype, we identified the significant association between the inactivated rocA with GAS invasiveness (β = 2.1, P = 0.017) (Fig. 4e). Previous experiments demonstrated that truncated rocA or suppressed expression of rocA resulted in enhanced GAS survival in human blood in M18, M3, and M89 GAS isolates [36–38]. The successful identification of the association of covS/rocA inactivation with GAS invasiveness demonstrates the capability of our new method of GWAS in detecting key associations which have been missed by other methods.
The variants in different genomic partitions are involved in distinct biological functions
In order to characterize the pattern of genotype-phenotype associations at different genomic regions, we made genomic partitioning for the associated variations detected in the three GWAS modalities into four categories based on their genic locations and functional impacts, i.e., the 5’ upstream non-coding regions of genes (5’-nonCDS), the 3’ downstream non-coding regions of genes (3’-nonCDS), intact genes without being truncated (intactCDS), truncated CDS by any kind of mutations (trunCDS) (Fig. 5a). To reduce the influence of redundant signals, we only selected the genetically independent and functionally effective variants using a pruning and thresholding strategy based on linkage disequilibrium (LD) and functional annotation [39] (Figure S6, See Methods & Materials). As a result, we obtained 1,361 variants linked with 783 genes across the four partitions (Additional File 5). On the one hand, 251 (32.0%) genes are linked with variants from more than one partition and 496 (63.3%) have variants significant in more than one binary-phenotype tests (Fig. 5b). On the other hand, all 10 binary-phenotype tests detected associated variants from at least two partitions and at least two genes (Figure S7). The genotype-phenotype association relationships clearly reveal a pleiotropic and polygenic pattern.
To obtain functional insights of the variants significant in the association tests in distinct genomic partitions, we classified the 783 genes linking with associated variants into 30 functional classes (Fig. 5c and Additional File 6). The distribution of the genes in the functional classes is highly dependent on the genomic partitions of the variants and the binary-phenotype tested. Specifically, the variants significant in the phenotype pairs INV-SSTI, INV-PHYSSTI, RF-SSTI, PM-SF, and SF-SP are more frequently localized in the genes involved in metabolism, such as “Carbohydrates”, “DNA Metabolism”, “Protein Metabolism”, and “RNA Metabolism” (P < 0.05), whereas the variants significant in the phenotype pairs INV-RF, PHY-RF, and PM-SP are sporadic in the metabolic functional classes, suggesting higher metabolic similarities between the three phenotype pairs. The associated variants in the partition of 5’-nonCDS are more significantly enriched in the genes involved in “Virulence, Disease and Defense”, and the associated SNPs in the partition of intactCDS are enriched in several categories, such as “Virulence, Disease and Defense”, “Protection System”, “Iron Acquisition and Metabolism”, and “Carbohydrates”, indicating multi-faceted influences of the coding SNPs on a wide range of functions in GAS. The translational truncation of coding genes (trunCDS) avoids most of the functional classes but is predominantly enriched in the category “Regulation and cell signaling”. The truncated genes in this category include not only the well-known virulence regulator covS, but also several other less-studied regulators, such as rocA, rivR, mutR, and luxR. It indicates that the transcriptional shift caused by transcriptional regulators is a common mechanism for GAS to switch the pathogenicity or phenotypes [40, 41].
Genetic architecture of GAS phenotypes is a hybrid model
The polygenic nature of GAS phenotypes prompted us to investigate the detailed genetic architecture of the phenotypes. By examining the relationship between MAF and effect size of the genetic variants detected in GWAS, our data supports a multifaceted genetic architecture of GAS phenotypes, which is manifested as a combination of a long tail of common variants (MAF ≥ 5%) with low effect sizes and a small proportion of low-frequency variants (1% ≤ MAF < 5%) with large effect sizes (Fig. 6a-d). Of particular interest is the partition of trunCDS, where the genetic architecture is dominated by low-frequency variants of large effect size, signifying a trend of lower population frequency and elevated impact of gene inactivation on phenotypic differentiation of GAS relative to the non-inactivating variants. A notable example in this partition is rivR, a regulator of virulence factor expression, which has an effect size as high as e3.3~27.1 and MAF ~ 2.1%. Interestingly, the covS inactivation in this partition also has a high effect size of e3.0~20.1, but the MAF ~ 7.8% at a modest level. It provides the evidence for the presence of high-effect variants segregating at an appreciable frequency in polygenic phenotypes of GAS.
Overall, the genetic architecture of the complex phenotypes of GAS is characterized by highly polygenic variants covering a wide spectrum of allelic frequency and effect size along with a few close-to-common variants with large effect sizes.
To further delineate the functional implications of the polygenic genetic architecture of GAS, we evaluated the association on the gene level and function level. We first constructed the bipartite network graph between the eight phenotypes and the genes in the four genomic partitions (Fig. 6e-h). A phenotype and a gene are connected by a positive/negative edge if that gene is linked with variants positively/negatively associated with that phenotype. Each of the eight phenotypes is associated with multiple genes across the four partitions and the gene number varies in a wide range from 6 to 148 depending on the specific phenotypes (Fig. 6i and Additional File 7). Regardless of the genomic partitions, PHY has the most positively associated genes and SSTI has the most negatively associated genes, indicating substantial genetic differences between these two major symptomatic phenotypes. In the partition of trunCDS, the phenotype INV is associated with multiple inactivated genes in both positive and negative direction with covS the most significant one (P = 5.0 x 10− 5).
By integrating the contributions of multiple genes involved in the same functional category, we then constructed the bipartite network between the eight phenotypes and the functional categories (Figure S8). The overall network reflects the complex multifaceted etiology of GAS phenotypes and pathogenesis. The phenotypes are specifically associated with certain functional categories in each partition. For instance, the phenotype INV is most significantly associated with the functional category “Virulence, Disease and Defense” via the SNPs in the partition of intactCDS (P = 5.1 x 10− 16, weight = 15.3) and via the variants in the partition of 5’-nonCDS (P = 4.0 x 10− 18, weight = 17.4). Its association with the category “Regulation and Cell signaling” is prominent in the partition of trunCDS (P = 2.2 x 10− 7, weight = 6.65). The strong association between INV and these two functional categories highlights the key functional modifications and the way the functions were modified underlying GAS invasiveness. For PHY and SSTI which have the highest number of positive and negative gene-level associations, respectively, the two phenotypes exhibit highly significant and opposite associations with several common functional categories (P < 10− 8), such as “Carbohydrates” (P < 2.5 x 10− 11), “DNA Metabolism” (P < 4.0 x 10− 9), “RNA Metabolism” (P < 3.1 x 10− 9), “Protein Metabolism” (P < 10− 8), “Virulence, Disease and Defense” (P < 1.6 x 10− 12), “Cofactors, Vitamins, and Prosthetic Groups” (P < 1.3 x 10− 5) in the partition of intactCDS or 5’-nonCDS. Given that GAS strains of PHY and SSTI phenotype have been known to diverge in two virulence-related gene clusters, i.e., mga regulon and pilus assembly, our data here provides complementary genetic factors contributing to the differences between these two highly prevalent GAS phenotypes. The data supports a paradigm of phenotypic differentiation of GAS determined by a combination of high divergence in several key genes and many small-effect variants in genes responsible for a wide spectrum of cellular functions, which together shape final pathogenesis, survival and adaptation capacity of GAS.
Pleiotropicity of the associated genes
As a reciprocal side of polygenicity, we next investigated the pleiotropicity of the associated genes across the eight phenotypes. It is manifested by an average of ~ 38.6% of the genes associated with at least two phenotypes and 5.3% associated with at least four phenotypes (Fig. 7a-d and Additional File 7). Further functional enrichment analysis showed that the genes with pleiotropicity ≥ 4 are more significantly enriched in the category “Virulence, Disease and Defense” (in the genomic partition of intactCDS and 5’-nonCDS) and “Iron Acquisition and Metabolism” (in the partition of intactCDS), in clear contract to the functional enrichment pattern for the genes with pleiotropicity < 4, which expands a wide spectrum of categories, such as “Regulation and Cell signaling”, “Iron Acquisition and Metabolism”, and “Cofactors, Vitamins, and Prosthetic Groups” (Fig. 7e) [42, 43]. The distinct functional enrichment pattern of polygenicity groups highlights the critical roles of the genes in the two functional categories “Virulence, Disease and Defense” and “Iron Acquisition and Metabolism” in diversifying GAS phenotypes or pathogenesis.
The pleiotropicity of the associated genes is further illuminated by 26 genes harboring distinct non-synonymous SNPs within the same genes but associated oppositely with the same phenotype (Additional File 7). Among them, the most notable gene is nga which is associated with all eight phenotypes via ten distinct variants from three genomic partitions, i.e., the 5’-nonCDS, 3’-nonCDS, and intactCDS (Fig. 7f and Additional File 8).
The genetic variants from three GWAS modalities explained a large proportion of GAS phenotypic variance
Next, we assessed the phenotypic variance explained by the polygenic variants detected in each phenotype-pair test. That will help to estimate the potential of the variants for GAS phenotype differentiation or disease-causal prediction. In order to minimize the influence of co-linear variants, we first selected the variants without mutual co-linearity within partitions using the LASSO regression method (Figure S9 and Additional File 9), and then calculated the contribution of the variants to the phenotypic variance using the coefficients of determinant measure (R2) on the liability scale [31, 44]. The relative proportions of variance explained by the variants from different GWAS modalities are highly dependent on the phenotypes tested (Fig. 8a). For the phenotype pairs involving INV, i.e., INV-PHY, INV-PHYSSTI, INV-SSTI, and INV-RF, the variants from the SNP-based GWAS (SNP-coding + SNP-nonCoding) explained 34%-48% of the pair-wise phenotypic variance and the gene inactivations from the CDS-based GWAS explained an additional 10% (4%-15%) of the variance, supporting the roles of gene inactivation for GAS invasiveness transition. Again for the four phenotype pairs, the variants from the InDel-based GWAS explained another 0%-4% of the variance. The combination of the gene inactivation from CDS-based GWAS and InDels from InDel-based GWAS together contributed additional 7%-16% of the phenotypic variance, pushing the total average explained variance above 50% (47%-57%). The results show that our new method of GWAS is able to significantly narrow the gap between GAS phenotypes and genotypes.
An exception is the phenotype pair PHY-RF, where all variants only explained a total of 27% variance, implying that many other factors may contribute to the differences between PHY and RF, such as epigenetic modifications and host interactions. For the other phenotype pairs PM-SF, SF-SP, and PM-SP, the explained variance did not increase by additional types of variants. This scenario is partially due to the small sample size of the three phenotypes which biased the estimation of effect sizes and association significance.
Development of a GWAS polygenic score model for phenotypic differentiation of GAS
The large proportion of explained variance by the variants from multiple GWAS modalities motivated us to explore the capability of the variants as genetic markers for GAS phenotype stratification or disease prediction. The polygenic nature of the phenotypes enabled us to develop a GWAS polygenic score (GPS) as a surrogate of multiple genetic markers. GPS is a composite measure of genetic mutation burden over polygenic loci for single individuals and has been frequently used for human disease prediction or risk stratification [45–47]. Here, by leveraging the polygenic nature of GAS phenotypes, we developed GPS-based prediction models for GAS phenotype differentiation by summing over the independent variants from three GWAS modalities (Figure S9 and Additional File 9 (see Methods & Materials). We then evaluated the performance of the GPS models in discriminating GAS phenotypes in the validation cohort of 710 GAS strains using the Receiver Operating Characteristic (ROC) analysis (Additional File 10). The strains of the validation cohort have the similar emm serotype distribution and time coverage to those of the training cohort (Fig. 8b and 8c). The ROC analysis showed that the GPS calculated from each of the three GWAS modalities alone (i.e., SNP-based, CDS-based, and InDel-based GWAS) generated models with an average AUC of 77.5%, 64.4%, and 69.4%, respectively (Fig. 8d). Noticeably, incorporating the CDS-based GWAS made AUC increase by 3.8% and combining all three GWAS modalities made AUC increase by 4.4%, leading to a high discriminatory level above 80% (Fig. 8d). Our results demonstrate the considerable improvement in the performance of the GPS prediction models by adding CDS-based and InDel-based GWAS markers. This high level of discriminatory ability of the integrated GPS score facilitates its promising role in phenotype stratification for this highly versatile organism.
A low-end exception in the AUC values occurred for the phenotype pair PHY-RF, which has an AUC of 74.4%. This modest value of AUC for PHY-RF agrees well with the small proportion of explained variance (27%) by all variants between the two phenotypes, further supporting the possibility of other contributing factors than genetic mutations to the phenotypic difference between PHY-RF.