Metagenomic studies of the human microbiome enable the characterization of the microbial and functional diversity in health and disease1. Advances in metagenome assembly and various clustering methods enabled the generation of metagenome species2-6. Most of these studies focused on unveiling new uncultured genomes, while only few focused on investigating the functional potentials and dynamic changes of the gut microbiome7-9. Understanding the functional and temporal behaviour of the microbiome may have great implications for the identification of its global signature in health and disease10-12. Additionally, short-term perturbations may trigger gut microbiota dysbiosis and changes at compositional and functional levels. Specifically, the negative selective microbe-microbe and host-microbe interactions, in the context of metabolism or antimicrobial machinery, could be the main mechanism underlying microbial dysbiosis13. Large-scale integration of microbiome functional changes and their associations with clinical data may provide novel information on temporal changes in the microbiome and host physiology14.
Herein, we integrated publicly available data from a large number of studies across different countries from healthy and diseased individuals. The analysis is presented in an open-access Human Gut Microbiome Atlas (www.microbiomeatlas.org), allowing researchers to explore for the first time an integrative analysis on composition, functional, richness, diseases and region signatures for the gut microbiota across 19 geographical regions and 20 diseases. This analysis was followed by investigating the gut microbiome of healthy Swedish individuals with four times sampling across one year to study the longitudinal variability of the microbiota. This revealed the tendency of disease-associated species to decrease in abundance. In contrast, the health-associated species tended to increase in abundance. We suggest that this dynamic contributes to the maintenance of gut microbial homeostasis in healthy individuals. These findings were further validated by follow up sampling from same cohort with additional two time points across 6 months.
Human Gut Microbiome Atlas; Pan-metagenomics study on compositional and functional changes of human gut microbiome
We performed large-scale integrative analysis of 4,880 publicly available shotgun metagenomics stool samples, with at least 10 million high-quality sequencing reads from healthy and diseased cohorts from 19 different countries across five continents (Fig. 1a-b and Supplementary Table S1). We rarefied all metagenomic samples into 10 million reads per sample, which enable comparative analysis across different cohorts. We created the Human Gut Microbiome Atlas (HGMA) using quantitative analysis of shotgun metagenomics based on microbial genomes assembled using Metagenomic Species Pan-genomes (MSPs) (Fig. 1c). The MSP number was increased from 1,661 (previous release5) to 1,989 (average number of genes 1,894 ± 1,616) (Methods), and their taxonomy was updated. We generated gene counts and MSP abundances for all the samples using the 10.4 million gene catalogue15. We also characterized the functions and phenotype of the MSPs in 7 different categories (KO, PFAM, CAZyme, Mustard, JGI-GOLD phenotype, PATRIC virulence factor, and antiSMASH biosynthetic gene clusters) and identified co-conserved functional clusters across species (7,763 clusters) (Methods). This information was completed with 344 newly sequenced longitudinal samples from 86 Swedish individuals, described in detail in a subsequent section (Fig. 1d). All the data are freely available in the HGMA, without restrictions, in the public open access database (www.microbiomeatlas.org) that is part of the Human Protein Atlas program (https://www.proteinatlas.org). All MSPs and functions are highlighted together with the 5,224 samples across 19 countries with disease and healthy cohorts.
Using the 3,039 samples obtained from healthy individuals across 18 countries, including westernized and non-westernized regions, we uncovered the geographical distribution of the healthy gut microbiome. To this end, we applied the unsupervised clustering method, monocle, to MSP abundance profiles of the 3,039 samples (Methods)16,17. We observed that there were two distinct ordinations of non-westernized and European samples of healthy subjects (Fig. 1e and Supplementary Fig. 1). Based on comparative analysis across different regions, we also identified 783 MSPs specifically enriched in certain countries (See Methods, Extended Fig 1a-d, Supplementary Fig. 2 and Supplementary Table S2). Functional annotation-based analysis across geographical clusters, revealed an enrichment of CAZymes for degrading host mucins and storage carbohydrates in westernized population, where antimicrobial resistance (AMR) and virulence factors were also more prevalent (Fig. 1f). Comparison of functions of region-enriched MSPs in European countries and in US/China/Japan revealed that fosfomycin and aminoglycoside resistance were the significant AMR, as deduced from the explained variances of ANOVA (Methods). Among the biosynthetic genes encoding the production of secondary metabolites, resorcinol, lanthipeptide, bacteriocin and homoserine lactone were enriched in the European and US/China/Japan populations. These secondary metabolites, together with AMR, appeared to be the key feature in the westernized countries.
To distinguish diseased and healthy microbiome from multiple cohorts, we performed a pan-metagenomics association study (Pan-MGAS) of multiple disease cohorts (18 diseases across 28 cohorts from 11 westernized countries). We determined the enriched and depleted species in diseased compared to healthy control samples, with an effect size > 0.3 (Fig. 1g and Supplementary Table S3). Specific species were either enriched or depleted in a certain disease(s), regardless of geographical differences. For instance, in individuals with fatty liver disease, Gordonibacter urolithinfaciens and Allisonella histaminifomans were depleted and enriched, respectively. We also found that species associated with low gene richness18, such as Clostridium bolteae, were enriched in several diseases. Similarly, Parvimonas micra was enriched in colorectal cancers, regardless of geographical differences. Furthermore, Akkermansia muciniphila was often depleted in several diseases. A. muciniphila is associated with improved intestinal barrier function and its depletion suggests intestinal barrier disruption in these different diseases19. The analysis of the frequency of the enriched/depleted MSPs among diseased cohorts showed that there were common species that could initially disrupt the microbiome balance and cause gut dysbiosis or foster microbial symbiosis that promotes health (top-left and top-right boxes in Fig. 1h, Supplementary Table S4).
Interestingly, we observed that MSPs commonly depleted in diseases were highly country-specific, while commonly enriched MSPs were usually shared by several diseases and were less related to geographical origins (Extended Fig. 1e-i and Supplementary Table S4). Moreover, we observed that MSPs commonly enriched or depleted in diseases were associated with different temporal behaviours among healthy individuals, as detailed in the following sections.
Dynamic changes of gut microbiome composition; inflow and outflow species
We investigated the temporal changes in microbiome composition at the individual level by applying Markov chain models (MCMs) to the MSPs identified in the longitudinal cohort of 86 healthy Swedes (Methods). This analysis enabled us to estimate the transition probability of individual MSPs from presence to absence (outflow probability) and vice versa (inflow probability) across different sampling times. We identified two groups of MSPs preferably transiting from presence to absence or from absence to presence; for brevity, we term them “outflow species (OFS)” and “inflow species (IFS)” respectively (Fig. 2a, Extended Fig. 2a-f, Supplementary Table S5). Clearly, declaring a species absent or present depends on the detection threshold, which is in turn determined by sequencing depth. We performed the analysis at three depth levels for 15, 10, and 5 million reads, and observed largely concordant results (Extended Fig. 3). For instance, 35 IFS (90%) were detected at both 10 and 15 million reads levels, while 4 and 6 species were detected only at former and latter, respectively. Similar results were observed for OFS: 447 (88%) were detected at both levels, while 62 and 27 species were detected at 10 and 15 million reads only. Overall, inflow and outflow probabilities were highly correlated at three different depth levels, with a slight reduction for outflows at 5 million reads (Supplementary Table S5).
To determine the robustness of these findings, we compared species-retaining probability (Kaplan-Meier estimates) with outflow probability, expected to be inversely proportional (Extended Fig. 4a and Methods). We observed that the species that had lower outflow probability, such as Bacteroides vulgatus and Prevotella copri, indeed had the highest retaining probability, whereas those of higher outflow probability (e.g. Veillonella infantium and Streptococcus parasanguinis) had reduced retaining probability. The species retaining probabilities were correlated with their mean abundance (Extended Fig. 4b), even if the associations did not appear significant for any individual species based on Cox regression (p-values > 0.1, Extended Fig. 4c).
We observed that the changes of OFS abundances among 86 individuals (ΔOFS) were inversely correlated with those of IFS (ΔIFS), i.e. suggesting competition between IFS and OFS (Spearman’s correlation = -0.334, p-value = 4.6 × 10-8; Fig. 2b-c, Extended Fig. 2g). A higher abundance of OFS increased the gut microbiome imbalance between different visits, as the similarity of the gut microbiome compositions between the visits was decreased in 31 OFS-enriched individuals compared to 37 OFS-depleted individuals (Wilcoxon one-sided test p-value < 0.01; Fig. 2d). IFS-enriched individuals maintained their gut microbial composition such that it was similar between different time points (Extended Fig. 2h). Interestingly, increasing abundance of IFS was associated with increasing gene richness, known to be related to better health18, suggesting that IFS may be beneficial (Spearman’s correlation = 0.206, p-value = 9.0 × 10-4; Fig. 2e and Extended Fig. 2i).
We hypothesised that IFS and OFS may differ in their growth rates, the former outgrowing the latter, and tested this hypothesis in three ways. First, we estimated species growth rates from metagenomic samples by Growth Rate InDex (GRiD) analysis 20 (Methods). For that, we stratified individuals into groups, enriched in IFS or OFS and found that in both groups GRiD scores of IFS were higher than OFS; they were the highest among IFS-enriched group (Fig. 2f-g). Second, we assessed species growth rates in bioreactors inoculated with healthy human stool samples, via GRiD analysis (Fig. 2h-i, Methods). We observed that the growth of IFS increased significantly over 24 hours, whereas that of OFS did not change, demonstrating that IFS could outgrow the OFS. Third, we used genome scale modelling to simulate species growth rates21-24 (Methods) on four different putative diets (high fibre and high protein for plant- and animal-based diets) for highly prevalent 34 OFS and 30 IFS. The predicted growth rates of the IFS were significantly higher than of OFS (Fig. 2j-k, Supplementary Table S6). Furthermore, the reaction essentiality analysis indicated that the outflow GEMs were significantly dependent on the substrate, often displaying amino acid auxotrophy (Supplementary Figure 3 and 4). We suggest that the differences in growth rates and substrate dependence between IFS and OFS could underlie the directionality of the gut microbiome dynamics we report.
To test whether the inflow and outflow assignments of species deduced from the analysis of the four time points in our cohort persist over time, we collected and analysed two additional time points with 6 months intervals from the same individuals (Fig. 2l-m). Furthermore, to examine whether the assignments defined from a Swedish study are also found in other, geographically different regions, we analysed two publicly available cohorts, from Italy and US9,25(Fig. 2n-q). In all cases, for both IFS and OFS, the transition probabilities were well correlated with those found for the first time points of our longitudinal cohort (Spearman's correlation coefficients > 0.56) for all comparisons). We conclude that IFS and OFS are largely conserved and are thus a global feature of the human gut microbiome.
Enrichment of IFS and OFS species determines richness, dysbiosis, and host physiology
To further explore links between gene richness and IFS and OFS, we determined the gene richness distribution of HGMA samples and grouped healthy samples as either high or low gene richness (HGR and LGR) based on the top and bottom 25% gene counts of HGMA samples, respectively (Extended Fig. 5a-c). We then identified species enriched in HGR and LGR by comparing MSP abundances: total 759 MSPs were enriched in HGR and 95 MSPs were enriched in LGR (|Log2 fold change| > 2, Wilcoxon rank two-sided test adjusted p-value < 10-3) (Supplementary Table S7). Interestingly, LGR-enriched MSPs were significantly enriched in OFS (i.e. higher outflow probability), while no enrichment was observed for HGR-enriched MSPs (Fig. 3a-b). Low gene richness was previously associated with a risk of developing chronic diseases related to the metabolic syndrome; the enrichment of disease associated OFS in LGR individuals was thus coherent with that observation18. We observed similar species enriched with low gene richness and associations with metabolic phenotypes (Extended Fig. 5d).
To investigate the impacts of IFS and OFS species on host physiology, we next examined the association of IFS and OFS with clinical parameters among healthy individuals (Supplementary Table S8). We traced the changes in 40 clinical parameters of 86 Swedish individuals (Supplementary Table S9) and linked them with the abundances of IFS and OFS species using linear mixed effect models (Methods). We found that IFS abundances were associated with muscle composition (p-value = 0.018, Fig. 3c), thus showing associations with microbial core metabolism such as amino acid metabolism, whereas OFS abundances were associated with a more diverse spectrum of clinical parameters, such as blood glucose, urate, B-type natriuretic peptide (BNP), ApoA1, and erythrocyte particles (p-value < 0.05). Interestingly, the OFS abundances were positively associated with BNP as a key heart failure marker (Extended Fig. 6c-g, Supplementary Table 10).
Next, we associated the common depleted and enriched species in diseases (Fig. 1h), from Pan-MGAS analysis of 18 diseases, to the IFS and OFS. The commonly depleted MSPs had a greater tendency to be IFS (Fig. 4a-b). On the contrary, the commonly enriched species (top-right, Fig. 1h) were likely to be OFS, compared to the depleted species across all diseases. Out of the 23 commonly enriched species in diseases (enriched in at least 3 different disease cohorts), 14 (61%) were OFS species, significantly more than the OFSs species within all species detected in the cohorts (36%) (Chi-square test, p-value = 0.015, Supplementary Table S4). In addition, among the 14 OFS species, 5 (36%) were opportunistic pathogens reported previously 26 (Chi-square test, p-value = 10-9). These observations support the view of microbiome dynamics lowering the abundance of potentially harmful species in healthy adults.
Functional understanding of region-enriched species, IFS and OFS
Our functional analysis indicated that the IFS species were enriched in core metabolism, essential for energy homeostasis and for biosynthesis of macromolecules (i.e., amino acids, carbohydrates and fatty acids; Fig. 4c and Supplementary Table S11 and S12, Methods). They were also enriched in: (i) processes associated with increased survival, such as sporulation, cobalamin biosynthesis (CobS), and sirohydrochlorin cobaltochelatase (CbiK); (ii) secondary metabolites (bacteriocin); (iii) proteins related to starch and plant-based fibre use (CAZymes GT5, GH13, GH51); (iv) anaerobic phenotypes (Supplementary Table S11). By contrast, the OFS species were enriched in accessory metabolism, such as biodegradation of xenobiotics (benzene, toluene, ethylbenzene, and xylenes - BTEX), paralleled by that of ABC transporters, possibly involved in the import of xenobiotics, suggesting that exposure to pollutants may promote their appearance (Fig. 4d, Extended Fig. 6a-b, and Supplementary Table S11). They were also enriched in (i) active sugar transport (i.e., phosphotransferase system (PTS); (ii) virulence factors (VFs) and trigger factors; (iii) putative competence protein ComGF and type IV secretion systems, the latter two being important mechanisms for horizontal gene transfer27. Finally, those microbes tended to be facultative anaerobes and microaerophiles.
Outflow enriched-functional clusters showed distinct links to gut microbiome dysbiosis
To better understand potential impact of IFS and OFS at the functional module level, we identified co-conserved functional clusters of microbiome by applying an unsupervised clustering approach on MSPs (Fig. 4e, Extended Fig. 7 and Methods). This analysis provided a better representation of microbial functions than single annotations or known pathway definitions (e.g. KEGG) (Extended Fig. 8). From the community detection algorithm, we identified 7,763 functional clusters, 6,297 singletons, and 591 representative clusters (Methods, Supplementary Table 13). For example, antimicrobial resistance and secondary biosynthetic genes were found to be singletons and not co-conserved with other functional genes. After excluding singletons and unreliable functional clusters detected in less than three species, we retained 591 representative clusters of microbial functions. One of the two largest clusters (CL-12 in Supplementary Table 13, named “comm-cluster” herewith) was over-represented among many commensal species, while the other (CL-10, named “patho-cluster”) was enriched in a few pathobionts, such as Klebsiella spp., Enterobacter spp., and E. coli. Interestingly, the comm-cluster was enriched with genes involved in the biosynthesis of amino acids indicative of functions enriched in IFS species. In contrast, the patho-cluster was enriched in functions associated with the uptake of several substrates, again indicative of transporters enriched in OFS species. These included siderophore, ion, amino acid, and vitamin transport, thus competing with host and commensal bacteria. We also found other enriched-functional clusters, such as butyrate metabolism, propionate metabolism, vitamin B12, coenzyme metabolism, chemotaxis, ATPase, and mobile genetic elements (i.e., integrase and transposase) and the CRISPR-cas system (Fig. 4e); a number of these were correlated with phylum-level taxonomy (Extended Fig. 7c).
We next projected the functional clusters on enriched/depleted MSPs in HGMA disease cohorts (Fig. 4f and Supplementary Fig. 5: hypergeometric tests, p-value < 10-3). We found that many of disease-enriched functional clusters were enriched in the OFS species, for example, isoprenoid biosynthesis, competence proteins for DNA transformation, key signatures of OFS species, virulence factors, and nutrient uptake (e.g. ascorbate and mannose). It has been previously shown that isoprenoid biosynthesis initiates the majority of secondary metabolism28. However, we found a few functional clusters associated with species depleted in diseases, such as the CRISPR-cas system (i.e., the bacterial immune system) and teichoic acid transport.