An Extensive Chinese Population Dataset Integrating Microbiome, Phenotypic, and Genomic Information
To perform a thorough investigation of the human vaginal microbiome, we established the Peacock project in China and recruited 10,281 Chinese participants, from whom we collected cervicovaginal samples during two phases of sampling (Fig. 1a, Supplementary Table 1). During Phase I (spanning 2018–2019), samples were procured from three separate urban centers, encompassing 3,307 samples from women in Suzhou engaged in routine cervical and breast cancer screenings (SU-CCS), 1,263 samples from women in Shenzhen attending annual health check-ups (SZ-4D), and 2,409 samples from women visiting the gynecology clinic at Peking Union Medical College Hospital in Beijing (BJ-GC). The samples yielded an average of 22.6 Gb raw sequencing data. In Phase II (conducted in 2021), 3,571 samples were obtained from women in SU-CCS, with samples generating an average of 34.1 Gb raw sequencing data. The extensive phenotypic data compiled showcased a broad age spectrum within our cohort, ranging from 20 to 85 years (Supplementary Fig. 1b), and encompassed both healthy individuals and those afflicted with various medical conditions, including bacterial vaginosis (BV), HPV infection, uterine fibroids, and uterine prolapse (Supplementary Fig. 1d). Furthermore, the predominance of host DNA in the vaginal metagenomic sequencing data, exceeding 90%, indicates that our Phase I cohort’s human genomic data achieved exceed 5X coverage, whereas the Phase II cohort’s data attained a 10.27X genomic coverage (Supplementary Fig. 1c).
Construction of microbial genomes from global female vaginal metagenomes
To characterize the vaginal microbiome at the genome level, we incorporated 6,979 metagenomes from the Peacock dataset (Phase I) to perform de novo assembly and genome reconstruction within each sample (Fig. 1b, Supplementary Fig. 2). This effort yielded 15,704 prokaryotic metagenome-assembled genomes (MAGs) were generated, each with a minimum completeness of 50% and less than 10% contamination. We further included 1,817 publicly accessible vaginal metagenomes from seven datasets, including 1,506 samples from the USA’s VIRGO dataset21 and 311 samples from six other transcontinental countries25–30, to broaden the scope of our genomic resources on a global scale (Fig. 1b, Supplementary Fig. 1a, Supplementary Table 2). This additional analysis produced 9,425 more prokaryotic MAGs, each meeting the same quality standards of ≥ 50% completeness and < 10% contamination. Our collection was further enriched with 314 genomes from in-house isolated vaginal samples and all vagina-derived genomes available in the NCBI RefSeq database, totaling 1,216 isolated genomes and 30 assembled genomes21. An assessment of the 26,689 vaginal prokaryotic genomes—comprising 25,159 MAGs and 1,530 isolated genomes—categorized as 57.45% high-quality and 42.54% medium-quality (Supplementary Fig. 3a, Supplementary Table 3). The median completeness across all genomes was an impressive 94.04%, with a very low contamination rate of 0.59%. In addition, we identified a single fungal genome, Nakaseomyces glabratus, within the non-prokaryotic fraction of the vaginal metagenomes. Virus detection within the metagenome-assembled contigs yielded 20,216 highly credible viral sequences, all with at least 50% completeness (Supplementary Table 4). Of these, 12.07% were estimated to be complete viral genomes, 30.48% were high-quality (≥ 90% completeness), and 57.44% were medium-quality (50–90% completeness) (Supplementary Fig. 3b). Altogether, our catalog of vaginal microbial genomes (VMG) encompasses 46,906 genomes, making it, to the best of our knowledge, the most extensive genomic repository of the human vaginal microbiome currently available.
We organized the 26,689 prokaryotic genomes from the VMG catalog into 913 species-level genome bins (SGBs) using a 95% average nucleotide identity (ANI) threshold, representing distinct vaginal species, and over two-thirds of these SGBs featured at least one high-quality MAG (Fig. 2a and 2b, Supplementary Table 5). Taxonomic classification using the Genome Taxonomy Database Toolkit (GTDB-Tk) and phylogenetic calibration identified 905 bacterial (spanning 17 phyla, 23 classes, 56 orders, 97 families, and 296 genera) and 8 archaeal species (Fig. 2b). Notably, 217 SGBs (23.8%) were classified as “unknown SGBs” (uSGBs) due to the absence of corresponding sequences in public databases, while the remaining 696 (76.2%) were recognized as known species (kSGBs) (Fig. 2b). Most uSGBs were assignable to known high-level taxa (spanning 12 phyla, 16 classes, 35 orders, 56 families, and 101 genera), except 16 uSGBs that further can be grouped into 16 novel genera and 2 novel families (Supplementary Table 5). Notably, the uSGBs were widely distributed across various phyla, particularly within the four dominant ones—Bacteroidota, Firmicutes_A, Firmicutes, and Actinobacteriota—comprising 19–24% of total SGBs for each phylum (Fig. 2c). Additionally, 84% of SGBs within the Patescibacteria phylum were unknown (Fig. 2c). We also discovered that many species previously thought to be “singular” in nature actually encompass a multitude of SGBs (henceforth referred to as “genomospecies”), aligning with findings from prior research31,32. A notable example included Bifidobacterium vaginale (formerly known as Gardnerella vaginalis) and Fannyhessea vaginae (formerly known as Atopobium vaginae)33, which comprised 16 (13 known SGBs and 3 uSGBs) and 13 (3 known SGBs and 10 uSGBs) genomospecies, respectively (Supplementary Fig. 4).
The majority of prevalent SGBs were known: of the 46 SGBs with over 100 genomes, 42 were kSGBs and 4 were uSGBs (Fig. 2d, Supplementary Table 5). Lactobacillus iners and Lactobacillus crispatus were the most prevalent species in the vagina, constituting 14.2% and 8.1% of prokaryotic genomes, respectively. Other common species included those previously associated with reproductive tract diseases, such as B. vaginale genomospecies (including k55, k48 and k44; these represent unique species IDs for each SGB in this study), F. vaginae genomospecies k94, Lachnospiraceae-UBA629 sp005465875 (known as BVAB1, an uncultured bacterial vaginosis-associated bacterium), and Mageebacillus indolicus (known as BVAB3), as well as species with limited prior information, like Megasphaera-28L spp. (k601, 953 MAGs; k602, 459 MAGs), Fastidiosipilaceae-KA00274 sp902373515 (k552, 763 MAGs), Eggerthellaceae-DNF00809 sp001552935 (k109, 687 MAGs). The prevalent uSGBs included members from the Fannyhessea (u24; these represent unique species IDs for each SGB in this study, 410MAGs) and Prevotella (u35, 176 MAGs; u36, 370 MAGs).
In investigating the vaginal virome, we classified 20,216 viral sequences into 3,763 viral operational taxonomic units (vOTUs) using a 95% ANI threshold for species-level clustering (Supplementary Table 6). Cross-referencing with human gut and oral viral catalogs and the NCBI RefSeq database, we found 75.1% of vOTUs unique to vaginal metagenomes, highlighting a substantial expansion of the viral content (Fig. 2e). A phylogenetic tree of these vOTUs displayed a diverse viral taxonomies and host relationships (Supplementary Fig. 3c). Over 61% of vOTUs were linked to known viral families, encompassing 2,245 vOTUs distributed across 14 prokaryotic virus families (dominating by Siphoviridae and Myoviridae) and 66 vOTUs across 6 eukaryotic virus families (Fig. 2f, Supplementary Table 6). Most eukaryotic vOTUs were Papillomaviridae family, commonly referred to as human papillomaviruses (HPV), with 53 identified as known types and 6 potentially new HPV homologs (Supplementary Fig. 3d, Supplementary Table 7). Collectively, these insights accentuate the pivotal role of vaginal viruses as foundational references for future research.
Functional repository of the female vaginal microbiota
To enhance our comprehension of the roles played by the vaginal microbiota, we utilized existing databases to annotate the gene functions within both prokaryotic and viral genomes. Notably, 13.0% (23,250 out of 178,904) of viral genes were confidently mapped to KEGG (Kyoto Encyclopedia of Genes and Genomes) orthologs, with 50.9% of these annotated genes involving in genetic information processing and 17.7% identified as viral auxiliary metabolic genes, including those related to peptidoglycan metabolism (Supplementary Fig. 5c-5d). For prokaryotic species, our annotations encompassed 1941 KEGG modules, 406 families in carbohydrate-active enzymes (CAZymes), and 1,654 non-redundant gene cluster families (GCFs) for biosynthetic gene clusters (BGCs) (Supplementary Table 8–9, Supplementary Fig. 5a-5b).
Next, we delved into a range of functional modules implicated in BV (e.g. sialidases, vaginolysin, and biogenic amines), as well as those integral to the metabolism of short-chain fatty acids (e.g. lactate, succinate, and butyrate) (Fig. 3a, Supplementary Table 10–11)34,35. L. crispatus exhibited a simple functional profile, predominantly focused on the biosynthesis of lactic and propionic acids. It exhibited minimal activity of sialidase and cytolysin, enzymes collectively known to disrupt the vaginal epithelium, with sialidase compromising mucin glycosylation chains, thus paving the way for cytolysin to lyse host cells and mobilize resources. These findings suggested that L. crispatus did not participate in the degradation of the host’s protective mucus, effectively safeguarding the critical barrier of the vaginal epithelium. In contrast, 90% of the genomes from L. iners were found to possess functional orthologs for cytolysin synthesis, with L. iners known to express higher levels of this enzyme, particularly within diverse communities at low to moderate relative abundances36. Notably, B. vaginale genomospecies and Prevotella species stood out as significant producers of sialidases and cytolysins, despite marked variability in the distribution of these functions among different SGBs. For example, over 93% of genomes within B. vaginale C and Prevotella bivia contained genes for the synthesis of both sialidase and cytolysin. On the other hand, Prevotella (u35) and B. vaginale G predominantly exhibited cytolysin synthesis (> 90%) with scant sialidase synthesis (< 2%). Prevotella sp001553265 and Prevotella sp00047900 were primarily characterized by their sialidase synthesis capability. Furthermore, RC9 (u68), Mobiluncus curtisii, and Streptococcus agalactiae were identified as having extensive sialidase functions. These organisms appear to synergistically harness these metabolic functions to exploit a wider range of host-derived resources. We also explored the biogenic amine profile of the vaginal microbiota, recognizing that these compounds were culprits in vaginal odor and additionally possess the ability to prolong the lag phase or hinder the growth of vaginal lactobacilli. Peptoniphilus A vaginalis, Peptoniphilus A lacrimalis, and BVAB3 were identified as producers of cadaverine, while multiple Prevotella SGBs, along with Porphyromonas sp001552775 and BVAB3, could synthesize N-acetylputrescine. Bulleidia (u86) and Anaerococcus tetradius were capable of generating trimethylamine, associated with the characteristic fishy odor of BV. Prevotella species also demonstrated a common ability to produce butyrate, a short-chain fatty acid potentially involved in modulating the immune response in the vagina.
Expanding on the detailed annotation of prokaryotic BGCs, we explore the biosynthetic potential of the microorganisms within the VMG (Supplementary Fig. 6). Caulobacteraceae CAISGS01 u199 (4.2 Mb), classified within the Proteobacteria phylum and showing close phylogenetic kinship to Phenylobacterium zucineum according to SGB analysis, showcased a remarkable repertoire of GCFs and a rich tapestry of biosynthetic diversity among VMG species (Fig. 3b, Supplementary Fig. 6b). Probing its biosynthetic capabilities, u199 emerged as a prolific producer of homoserine lactones, rivaled only by Pseudomonas aeruginosa, with BGC1248, BGC1252, and BGC1258 making notable contributions (Supplementary Fig. 6c, Supplementary Table 12). Acyl-homoserine-lactones, quintessential signaling molecules in quorum sensing (QS), play a pivotal role in directing microbial congregation and biofilm genesis37, reinforcing our prior findings of u199’s exceptional biofilm-forming capabilities (Fig. 3a). Furthermore, u199 exhibited a pronounced capacity to produce triacsins, which were notable as potent acyl-CoA synthetase inhibitors in lipid metabolism, possessing broad-spectrum antimicrobial activity38. BVAB1 also exhibited distinct BGC profiles among VMG species (Supplementary Fig. 6c). It displayed pronounced toxicological properties, underscored by BGC0012 and BGC0051 (both harboring adhE, which enhances toxicity through redox imbalance; and speA, associated with biogenic amine synthesis and pH modulation), BGC0050 (Enterocin A Immunity, belonging to bacteriocins that target closely-related competitor species), and BGC0013 (lagD, involved in the transport of bacteriocin Lactococcin G) (Supplementary Fig. 6c, Supplementary Table 12). Intriguingly, these BGC modules were significantly more prevalent in the American demographic, hinting at a population-specific variation in BVAB1’s pathogenic potential (Fig. 3c, Supplementary Table 13). Additionally, BGC0012 and BGC0051 were distinguished as representatives of Ranthipeptides, particularly SCIFF-Derived Ranthipeptides (Supplementary Table 13), which have been implicated in quorum sensing mechanisms39.
Intraspecies Phylogenetic Analysis Highlights Strain Divergence Across Chinese and North American Populations
Our extensive VMG catalog opens a window for investigating the taxonomic and functional disparities across the global vaginal microbiome at the genomic level. We observed notable differences in species relative abundance and MAG counts between Chinese and North American populations, with these populations contributing to 90% of the cataloged MAGs (Supplementary Fig. 7, Supplementary Table 14). Specifically, CAISGS01 u199 (41 MAGs), a homoserine lactone producer described above, was exclusively found in Chinese samples. Conversely, SGBs represented solely by MAGs from North American samples were predominantly associated with B. vaginale and F. vaginae genomospecies, particularly the latter (Supplementary Fig. 4, Supplementary Fig. 7). Among 13 genomospecies, F. vaginae A (k96, 117 MAGs) and eight Fannyhessea uSGBs (from u17 to u23, totaling 48 MAGs) were uniquely reconstructed from American samples. The phylogenetic patterns within F. vaginae genomospecies indicated a further population-specific specialization at the subspecies level, with North American genomes forming a distinct cluster separated from their Chinese counterparts, including F. vaginae (k94) and F. vaginae A (k95), which were not listed as North American-specific (Supplementary Fig. 4f). For B. vaginale genomospecies, B. vaginale C (k51) and Bifidobacterium sp002884815 (k44) were also unique to non-Chinese samples (Supplementary Fig. 4a). These patterns demonstrate the VMG’s ability to detect intraspecies diversity and suggest geographic influences on the vaginal microbiome. Moreover, SGBs like Parvimonas sp001552895, BVAB3, and BVAB1 were more prevalent in North Americans, with BVAB1 showing a 2.30-fold MAG increase and higher relative abundance in North American women (averages of 5.09% in North American versus 0.31% in Chinese) (Supplementary Fig. 7, Supplementary Table 14).
Delving deeper into the phylogenetic intricacies of population-specific genomes, we found that a significant number of SGBs (50 out of 64 with over 50 MAGs each) exhibited substantial geographic genetic variation based on the patristic distance calculated from the phylogenetic tree for each SGB (PERMANOVA, FDR < 0.05) (Fig. 4a, Supplementary Table 15). Notably, BVAB1 exhibited the second strongest population-specific genetic signature, surpassed only by Arcanobacterium A sp000758825 (Fig. 4a-4c). Furthermore, several common vaginal microbes, including L. iners, L. crispatus, F. vaginae A (k95), and B. vaginale (k48), exhibited significant genetic variation, with their genomes forming two primary phylogenetic clades closely aligned with either Chinese or North American populations (Fig. 4a and 4b, Supplementary Fig. 8). These insights underscore the pronounced species-specific prevalence and geographic genetic diversification. Functional disparities were also evident, as exemplified by BVAB1, which was subjected to KEGG module and pan-genome analyses (Fig. 4d-4e). BVAB1 from Chinese samples showed a significant preference for transposase genes involved in replication and repair (K07483), while the North American BVAB1 strains tended to possess genes associated with transport and signalling processes, such as the vitamin B12 transporter (K16092), iron complex outer-membrane receptor proteins (K02014), and trimeric autotransporter adhesin (K21449)—the latter being implicated in biofilm formation and cellular adhesion (Fig. 4d). Intriguingly, the North American-specific BVAB1 strains exhibited an enrichment of ATP synthase (atpH_2 gene), suggesting a potential for enhanced energy production to fuel robust signalling and cellular functions (Fig. 4e). In contrast, genes related to cell wall organization (mnaA), alkyl hydroperoxide reductase (ahpF_1), and the positive regulation of adenylate cyclase activity (CdaR) were more common in the Chinese BVAB1 genomes, indicating a tendency towards self-protective mechanisms, including safeguarding against DNA damage and maintaining cell wall stability (Fig. 4e). These findings were in line with those of the previous section, which indicated that BGC modules with potent virulence traits were more likely to be prevalent in the North American cohort (Fig. 3c), suggesting that the potential pathogenicity may exhibit population-specific variations.
Upon analyzing the SNPs of BVAB1, we identified 20,852 significant SNPs from 1,025 genes that exhibited significant differences between the two populations (Fig. 4f, Supplementary Table 16). Notably, 675 of these genes (1 ~ 213 SNPs within each gene) had specific function annotations rather than being annotated as hypothetical proteins. Genes with a high number of SNPs were associated with key metabolic functions, including DNA replication (dnaE, polC, and polA, etc.), DNA damage response and repair (addA, mfd, and addB, etc.), glycoside hydrolases (glgX and fruA, etc.), protein biosynthesis (valS, mupB, and alaS_1, etc.), and lipid metabolism (ppc) (Fig. 6f). Additionally, we also identified significant SNPs in antibiotic resistance-related genes, which encompass four mechanisms: inhibition of bacterial cell wall synthesis (mrcA and pbpD with penicillin resistance), interaction with the cell membrane (drrA with daunorubicin/doxorubicin resistance), disruption of protein synthesis (mupB with mupirocin resistance), and inhibition of DNA replication and transcription (rpoB with rifampin)40 (Fig. 6f. Significant SNPs were also observed in genes involved in notable regulations, such as flagellar filament for mucin adhesion (fliD)41, polyamine biosynthesis accompanied by pH regulation (speA), and spore germination (prkC) (Fig. 4f). These findings highlight the strain divergence between Chinese and North American populations. BVAB1 serves as an example of the rich resource available in the VMG, which can aid researchers in selecting species of interest and facilitate in-depth exploration of strain variability.
Host genetics strongly associated with vaginal bacteria
We next assessed the impact of host genetic variants on the vaginal microbiome through M-GWAS. The SU-CCS (Phase II) cohort, comprising Zones S (n = 1,919) and D (n = 1,217), featured host data with approximately 10X genomic coverage (Supplementary Fig. 1c). Accordingly, Zone-S was designed as the Discovery cohort, and Zone D as Validation cohort 1. In addition, we incorporated the SU-CCS (Phase I) cohort, encompassing 3,226 individuals with lower sequencing depth, and a subset of 531 individuals from the BJ-GC (Phase I) cohort with higher sequencing depth, to serve as Validation cohorts 2 and 3, respectively (Supplementary Fig. 1c). Host genetic data were acquired by aligning sequencing reads to the human reference genome (hg38). After employing a greedy algorithm to filter out highly correlated features, we isolated 56 distinct vaginal microbial taxa from the Discovery cohort (Supplementary table 17, r2 < 0.995, Methods). The M-GWAS examined 7.5 million human genetic variants (MAF ≥ 1%), adjusting for confounders such as age, sequencing read counts, and the top ten host principal components (PCs). This analysis yielded 83 independent associations spanning 73 loci (within a 1Mb distance and r2 < 0.2) and 19 taxa that reached genome-wide significance (p < 5×10− 8; Fig. 5a and Supplementary Table 17). Employing a more stringent Bonferroni-corrected significance threshold of 9.09 × 10− 10 (= 5 × 10− 8/56 for 56 M-GWAS tests), we pinpointed 31 genomic loci associated with 7 vaginal microbial taxa. Among them, 13 of these associations were robustly replicated across two independent cohorts, with 29 confirmed in Validation cohort 1 and 14 in Validation cohort 2, including the most significant association of Prevotella u35 with loci on 22p11.2 (beta = 1.51, p = 6.826×10− 38), 1q12.1 (beta = 1.22, p = 1.786×10− 32) and 1q12 (beta = 1.13, p = 2.721×10− 29), which were also consistently replicated in Validation cohort 1 and Validation cohort 2 with corresponding betas and p-values indicating strong statistical significance (Fig. 5a and 5b, Supplementary Table 17). These loci were also associated with an increase in BMI in the Chinese 4D-SZ cohort42, corroborating the link between Prevotella and BMI as suggested by prior studies43. The second most significant association was observed with rs9328781, near the CWH43 gene, where the minor allele C (MAF = 0.371) was inversely associated with the presence/absence phenotype of the species Aerococcus christensenii (beta=-0.83, p = 2.05×10− 23) and family Aerococcaceae (beta=-0.72, p = 2.862×10− 18), an association that was also replicated in these two validation cohorts (Fig. 5a, 5c, and 5d, Supplementary Table 17). This SNP was also associated with an increase in the Albumin-globulin-ratio in the Chinese 4D-SZ cohort42 (Supplementary Table 17). In parallel, Bifidobacterium dentium was positively associated with host SNP rs71202526, located near the CBWD5 gene, where the minor allele T (MAF = 0.347, beta = 1.09, p = 2.792×10− 22) was implicated. This SNP was also linked to elevated blood copper levels (beta = 0.05, p = 0.03). Moreover, to enhance the credibility of our results, we focused on validating loci with p-values below 5×10− 8 for the previously mentioned four most significant bacterial species in Validation cohort 3, successfully confirming 46–60% of these loci. Although the precise mechanisms behind many of these associations are currently unclear, our research underscores the significant role of host genetics in shaping the vaginal microbiome and calls for further research to unravel the underlying processes.