Insufficient coverage of Patescibacteria by commonly used 16S rRNA gene primer pairs and design of a new primer pair
In silico coverage data of the three most widely used primer pairs for 16S rRNA gene amplicon sequencing (Table S1), targeting the V1-V3, V3-V4, or V4 hypervariable region, were compared for nineteen bacterial phyla and one archaeal phylum with the highest database representation. In these analyses, we focused on perfect matches (0 mismatches) between primers and 16S rRNA gene sequences in the SILVA SSU rRNA gene reference database [23] and the WWTP-specific MiDAS 16S rRNA gene database [5], but also included a coverage analysis allowing for a single mismatch with one of the two primers (Table S2-S5).
We first evaluated the coverage of these primers against the SILVA SSU rRNA gene database. As expected, the V1-V3 and V3-V4 primers showed low coverage of archaea, since these primers were designed to target only bacterial 16S rRNA genes. The V1-V3 primer pair had the lowest overall in silico coverage, with less than 50% cumulative coverage across all bacterial phyla and only 5.3% for all Patescibacteria. The V4 primer pair, designed to target both bacterial and archaeal 16S rRNA genes, had the best overall coverage. Over 85% of sequences from most phyla showed no mismatches to this primer pair, but it still covered only 19.6% of Patescibacteria. The V3-V4 primer pair showed better but still only moderate coverage of Patescibacteria (57.6%) and much poorer coverage of some other bacterial phyla (e.g. Chloroflexi (40.8%) and Armatimonadota (31.5%)) than the V4 primers (Table S2). Within the Patescibacteria, both the V3-V4 and V4 primer pairs had mismatches to nearly all Microgenomatia and Dojkabacteria sequences. The V4 primer pair also showed low coverage of Saccharimonadia (4.7%), Parcubacteria (6.3%), ABY1 (0.5%) and WWE3 (0%). The V3-V4 primer pair showed better coverage of ABY1 (75.6%), Parcubacteria (55.6%) and Saccharimonadia (84.8%) than the V1-V3 primer and V4 primer pairs (Table S3).
In silico coverage of the 16S rRNA gene sequences from the MiDAS 4.8.1 database was also evaluated for the V3-V4 and V4 primer pairs. The V1-V3 primer pair (27F/534R) could not be evaluated against MiDAS 4.8.1 because sequences from this database were trimmed after the 27F primer binding site [5]. For sequences included in MiDAS 4.8.1, the V4 primer set showed > 70% coverage of the 20 most represented phyla, except for Patescibacteria (15.2%) and Chloroflexi (60.9%) (Table S4). For Patescibacteria, the V4 primer pair targeted only 8% of the Saccharimonadia, 0.5% of the Microgenomatia, 0.2% of the ABY1, and 0% of the Parcubacteria 16S rRNA gene sequences in MiDAS 4.8.1, whereas it covered 93.4% of gracilibacterial 16S rRNA gene sequences retrieved from WWTP systems. The V3-V4 primers showed better coverage of the WWTP Patescibacteria (67.0%) than the V4 primers, covering more Saccharimonadia (88.8%), Parcubacteria (71.8%), ABY1 (87.2%), and a similar fraction of Gracilibacteria (84.0%) 16S rRNA gene sequences (Table S5). But consistent with the results obtained using the SILVA database, the V3-V4 primer pair showed poorer coverage of other, non-patescibacterial phyla in the WWTP database.
We additionally evaluated a recently published V4-V5 primer pair (515Yp-min/926Rp-min), modified specifically to capture patescibacterial 16S rRNA gene sequence diversity in marine environments [8, 24] using the SILVA and MiDAS 4.8.1 databases. While being well suited for the analysis of Patescibacteria in marine samples, this primer pair shows only overall moderate coverage (37.1%) of Patescibacteria in the SILVA database, and likewise only a moderate coverage of 30.8% of Patescibacteria sequences in the WWTP-specific MiDAS 4.8.1 database (Table S2-S5).
Based on the high general coverage of bacterial and archaeal 16S rRNA gene sequences, and the high overall coverage of WWTP derived 16S rRNA gene sequences by the V4 primer pair (515F-806R) in the in silico analysis, this primer pair was chosen for modification to increase its coverage of Patescibacteria. Another reason for choosing the V4 primer pair for further improvement was that this primer pair is predicted to produce shorter amplicons than the V3-V4 primers. Currently, most 16S rRNA gene amplicon studies use short-read Illumina platforms to generate sequence data, which balances quality and expense to produce useful community profiles [25]. Shorter amplicons have also been shown to recover community structure more accurately than longer amplicons [26].
The original V4 primer pair was modified by adding degeneracy bases at 5 positions (8/11/12/13/18) of the forward primer (515F) and 4 positions (1/7/10/14) of the reverse primer (806R). Additionally, we changed the 9th position of the forward primer from M to A (Table S6). These modifications of the V4 primer set resulted in a significantly improved full-match in silico coverage against the SILVA database for Chloroflexi (from 56.7–76.0%), Spirochaetota (from 71.7–79.6%), and Patescibacteria (from 19.6–88.9%). Notably, coverage was strongly increased for several Patescibacteria classes including the Microgenomatia (90.9%), Parcubacteria (80.8%) and Dojkabacteria (85.5%) which were poorly covered (1.2%, 6.3% and 0%, respectively) with the original primers. However, due to the change in the ninth position in the original forward primer, the in silico coverage of the modified primer set for the Euryarchaeota decreased (Table S2; Table S4).
A global collection of wastewater treatment plant samples
The samples (Fig. 1) and metadata (Table S7) used for our analysis were collected by the MiDAS global consortium. This sample set represented 565 WWTPs from 6 continents, 30 countries and 380 cities. Most of the samples (448/79.3%) were from activated sludge plants, but samples from biofilters, moving bed bioreactors (MBBR), membrane bioreactors (MBR), and granular sludge were also included (Fig. 1). Among the activated sludge samples, most (n = 194) were from plants designed for carbon removal with nitrification and denitrification (C, N, DN; 43.3%), 92 were from plants designed for carbon removal only (C; 20.5%), 90 were from plants designed for carbon removal with nitrogen and enhanced biological phosphorus removal (C, N, DN, P; 20.1%), and 46 were from plants designed for carbon removal with nitrification (C, N; 10.3%).
Modified primers reveal a more complete picture of Patescibacteria diversity and abundance in wastewater treatment plants
Consistent with the in silico primer coverage predictions, 16S rRNA gene amplicon datasets generated from the global activated sludge sample collection using our modified V4 primers detected significantly higher Patescibacteria ASV richness and diversity than 16S rRNA gene sequencing datasets generated with the original V4 primers (Fig. 2a-d). At the phylum level, the average relative abundance of Patescibacteria in the activated sludge samples increased dramatically from 1.5% ± 1.4% (mean ± sd) to 18.5% ± 11.1% (Supplementary Fig. 1), and the cumulative number of patescibacterial ASVs increased from 5.9–23.8% across all samples. The diversity of the total microbial communities in datasets generated using the modified V4 primers was not significantly different compared to datasets obtained using the original V4 primers (Fig. 2e-f). After removing reads classified as Patescibacteria, a linear regression was performed comparing the ASV-based richness of non-patescibacterial taxa detected in each sample by the modified and original primer sets, showing a slightly decreased richness detected by the modified primer sets (Supplementary Fig. 2). We performed a phylum level regression analysis to investigate how the ASV richness of non-patescibacterial phyla was affected by the primer modification. Most phyla showed only a slight difference in observed richness, with a predicted slope of ASV richness of original vs. modified primers between 0.6 and 1.6 (Supplementary Fig. 3). However, as expected, Euryarchaeota were strongly underestimated by the modified primers, which only detected 15% of the ASV richness detected by the original primers, likely due to the change from M(A/C) to A in the modified forward primer (Table S7). On the other hand, we observed an increased richness of several non-patescibacterial phyla, i.e. Chloroflexi and eight additional phyla in the dataset obtained with the modified primers (Supplementary Fig. 3).
Next, the ASV-based richness within each genus with > 0.01% average relative abundance across all samples detected by both the original and modified primer pairs was examined to determine whether the modified V4 primers systematically detected more or less intra-genus diversity than the original primers. It is important to note that the term “genus” used in our study refers to a 94.5% 16S rRNA gene sequence identity threshold [27, 28]. Higher ASV richness of some non-patescibacterial genera was detected by the modified primers, for example a 4.27 times higher ASV richness for the genus Neochlamydia within the phylum Verrucomicrobiota or a 2.45 higher ASV richness for the genus Ca. Villigracilis within the phylum Chloroflexi. The phylum Chloroflexi encompasses most genera (n = 29) for which we detected a higher ASV richness with the modified primers (Table S8).
Novel genus-level diversity of Patescibacteria in wastewater treatment plants detected by the modified V4 targeted primers
To compare the genus level diversity of Patescibacteria detected with the two V4 targeted primer pair versions, we selected genera that were detected with at least one primer pair version in at least one sample with more than 0.1% relative abundance. Genera detected at a lower relative abundance were excluded from this analysis. In total, we detected ASVs affiliated with 959 patescibacterial genera with the modified primers and 224 genera with both primers. ASVs affiliated with 7 genera of Patescibacteria were only detected in amplicon datasets generated with the original primers. These 7 genera reached a maximum cumulative relative abundance of 0.005% of the total community and 0.39% of Patescibacteria in the dataset generated with the original primer pair. Differential abundance analysis of ASVs affiliated with the 224 genera detected with both primer pairs resulted in 62 (27.7%) genera being significantly more relatively abundant in datasets generated with the modified primer pair (Wald test p-adj < 0.05) (Fig. 3).
Novel amplicon sequence variants revealed by the modified primer pair
16S rRNA genes of Patescibacteria are highly divergent and have thus far likely been undersampled by amplicon studies [10]. To evaluate whether the modified V4 primers enable the detection of previously unknown lineages of Patescibacteria, we determined the proportion of ASVs from each sample that could not be mapped to existing databases (MiDAS 4.8.1 and SILVA r138.1).
We found that within individual samples, significantly more patescibacterial ASVs generated with the modified primers (25.0% ± 11.5%) could not be mapped to the MiDAS database with a high identity (> 99%), when compared to patescibacterial ASVs generated with the original primers (17.5% ± 13.9%), which reflects a higher level of phylogenetic novelty detected by the modified primers (Fig. 4a). It is also noteworthy that as many as 83.1% ± 7.4% of patescibacterial ASVs within individual samples generated by the modified primers and 59.6% ± 16.8% generated by the original primers could not be mapped as high identity hits to the SILVA database (Fig. 4c). For non-Patescibacteria ASVs, a similar fraction of ASVs obtained with the modified primer pair and the original primer pair within individual samples could be mapped to the two databases (Fig. 4b, 4d). Cumulatively, out of all the 9,197 patescibacterial ASVs detected with the modified primers, 4,927 (53.5%) could not be mapped to the MiDAS database with a high identity (> 99%), while this was the case for only 655 (36.1%) of the 1,816 patescibacterial ASVs generated with the original primers. For the SILVA database, as many as 8,221 (89.3%) and 1,501 (82.7%) of the patescibacterial ASVs generated with the modified and original primers, respectively, could not be mapped with a high identity (> 99%). These higher cumulative unmapped patescibacterial ASV fractions compared to individual sample-based unmapped ASV fractions result from the higher prevalence of ASVs with high identity hits in both databases across all samples (Supplementary Fig. 4).
Then, we compared the percentage of ASVs within individual samples generated with the modified primer pair that could not be mapped to the MiDAS 4.8.1 and the SILVA r138.1 database at the genus level threshold (94.5%). Using the MiDAS 4.8.1 database and taxonomy framework, 14.0% ± 9.0% of Patescibacteria ASVs and 3.0% ± 1.8% of non-Patescibacteria ASVs could not be mapped at the genus level (Fig. 4e, 4f). To the SILVA database, 34.1% ± 12.2% of Patescibacteria ASVs and 5.1% ± 2.6% of non-Patescibacteria ASVs in each individual sample could not be mapped at the genus level (Fig. 4g, 4h). We further explored the taxonomic affiliation of ASVs that could not be classified at genus level by the MiDAS 4.8.1 database. Most of these ASVs were from the classes Microgenomatia (n = 837), Parcubacteria (n = 472) and Saccharimonadia (n = 188). Furthermore, 260 patescibacterial ASVs that could not be classified at class level (in the MiDAS 4.8.1 database) were also detected (Supplementary Fig. 5). At the genus level threshold (94.5%), 2,868 (31.1%) of all patescibacterial ASVs could not be mapped to the MiDAS database, while this was the case for only 3,345 (11.3%) of all 29,430 non-patescibacterial ASVs. For the SILVA database, 4,966 (53.9%) of the patescibacterial and 4,339 (14.7%) of the non-patescibacterial ASVs could not be mapped at the genus level (Fig. 4).
Introns in the 16S rRNA gene are frequently detected in Patescibacteria, but do not frequently occur in the V4 hypervariable region [10]. Yet, using the modified V4 primers we detected 17 ASVs with ≥ 300 bp amplicons that were classified as Patescibacteria, and found that three of those ASVs can be fully mapped to MAGs (metagenome-assembled genomes) from Danish WWTPs [29] (Table S9).
Core genera of Patescibacteria in wastewater treatment plants
Core taxa are characteristic microbial community members of a given environment. Identifying and characterising such taxa is essential for a comprehensive understanding of the microbiology of a system. For WWTPs, core community members have been defined using a relative abundance threshold of 0.1% and a set of prevalence thresholds [30]. According to this classification system, “strict core” community members occur in more than 80% of WWTPs, “general core” members in 50%, and “loose core” members in 20% of the plants [30]. Activated sludge systems harbor a notable loose core community of genera that occur in at least 20% of all plants, constituting more than 50% of the reads in 16S rRNA gene amplicon datasets [5]. In addition to core genera which are globally prevalent, conditionally rare or abundant taxa (CRAT) describe genera that exist in at least one sample with > 1% relative abundance, but are not part of the core taxa. Combined, core and CRAT genera can constitute up to 80% of the reads in amplicon datasets from WWTPs [5]. Given the results of the in silico primer coverage analysis presented above, the prevalence and relative abundance of Patescibacteria in WWTP has likely been significantly underestimated in previous studies. Consistently, no patescibacterial genera were previously described to be part of WWTP core communities [5].
We compared the richness and relative abundance of Patescibacteria in the 16S rRNA gene amplicon dataset generated with the modified V4 primers across different plant types and we found the activated sludge systems shared similar Patescibacteria abundance and richness with other plant types except biofilters (Supplementary Fig. 6). For identification of patescibacterial core genera, we focused on the activated sludge system because it was the most deeply sampled system. We identified 23 patescibacterial genera in the four main process types of activated sludge systems from the global WWTP sample set, which were detected in at least 20% of samples with > 0.1% relative abundance, fitting the definition of “loose” core taxa (Fig. 5). In addition to their frequent occurrence across activated sludge plants, these genera show a global distribution (Table S10). Among these genera, midas_g_2215 is the most abundant genus by average relative abundance (0.72% average, 15.27% highest abundance), followed by the genera midas_g_67 (0.48% average, 9.26% highest abundance), midas_g_4375 (0.37% average, 11.09% highest abundance), midas_g_363 (0.36% average, 7.12% highest abundance) and Ca. Saccharimonas (0.35% average, 8.46% highest abundance). Interestingly, only two of the 23 loose core genera have a given genus name (Ca. Saccharimonas and TM7a/Ca. Mycosynbacter). Members of the genus Ca. Saccharimonas are known for their fermentative sugar metabolism [32]. Recently, a member of TM7a has been characterised as a predator of Gordonia, which is an infamous foam producer in WWTPs worldwide [21]. Using the GTDB database (r214) as well as metagenomic datasets from Danish WWTPs [29], we assigned 16S rRNA genes from all MAGs in these two datasets to a MiDAS taxonomy to evaluate the MAGs representation of the core WWTP taxa of Patescibacteria. This analysis revealed that 30.4% (7/23) of the core patescibacterial genera are not represented by the current genome databases (Table S11).
We further identified patescibacterial core genera for each of the four main process types of the activated sludge systems, using the same thresholds as described above. 11, 19, 32 and 30 genera were identified as core genera for each process type (C / C & N / C & N & DN / C & N & DN & P) respectively (Table S12-S15). In this more refined analysis, one, three, three, and five genera were identified as WWTP-type specific core genera of each process type (C / C, N / C, N, DN / C, N, DN, P) respectively (Fig. 5). Among these core genera, we found one core genus present in the C, N, DN process type and five core genera specific to EBPR plants exceeding 0.1% abundance in > 50% samples, which were identified as “general core” members of these two process types. (Table S14, Table S15). These WWTP function-specific enriched patescibacterial genera may have a potential impact on the respective nutrient removal processes.
We also identified 310 CRAT patescibacterial genera in activated sludge samples. The CRAT genera are not part of the core genera defined above but occasionally show high abundance. While most of the CRAT genera (170) were only detected in one sample with > 1% relative abundance, eight genera were detected in more than ten samples with > 1% relative abundance (Table S16).
Potential host-Patescibacteria pairs revealed by co-occurrence network analysis
Patescibacteria have been predicted to lead a symbiotic lifestyle based on their small cell and genome sizes, and their limited metabolic capability [19]. Several studies have successfully applied network-based methods for the inference of symbiotic relationships which were subsequently validated by experimental evidence [33, 34]. Here, we performed network analysis at both the ASV and genus levels to answer the questions: (i) which bacteria or archaea are correlated with Patescibacteria in activated sludge systems, (ii) whether different genera of Patescibacteria correlate with the same potential host, and (iii) whether a similar correlation pattern is observed for different ASVs within the same genera.
Genus level network analysis was performed with 395 samples of the four main process types of activated sludge system for which we obtained > 5K reads by the modified primer pair. This analysis identified 10 genera of Patescibacteria, all belonging to either activated sludge or process-specific patescibacterial core genera, that significantly correlate with genera from Actinobacteriota, Bacteroidota, and Chloroflexi (Fig. 6; Table S17). Although many of these potential hosts belong to uncharacterized taxa (with MiDAS placeholder names), we also identified some potential host taxa for Patescibacteria with important roles in WWTPs. For example, we observed a strong correlation between the genus Tetrasphaera and four other genera, including three patescibacterial genera of Saccharimonadia (Ca. Saccharimonas, midas_g_67, and midas_g_363) and the genus Ca. Accumulibacter. Interestingly, genera midas_g_67 and midas_g_363 are both identified as EBPR specific general core genera (exceeding 0.1% abundance in at least 50% EBPR plants) (Table S15). Tetrasphaera and Ca. Accumulibacter are both abundant polyphosphate accumulating organisms (PAO) in WWTPs across the world, mainly thrive in EBPR plants and are thus expected to be correlated with each other. Another patescibacterial genus - midas_g_2020 of Saccharimonadia - shows an association with the genus midas_g_399 from the class Actinobacteria, which was also found enriched in EBPR plants, and might represent PAO or glycogen-accumulating organisms [35].
Another important group of microbes that was found to be associated with Patescibacteria is filamentous bacteria of the genera Ca. Microthrix and Ca. Sarcinithrix, the former one known to cause severe foaming problems in WWTPs. In our network analysis, Ca. Microthrix is connected with three Saccharimonadia genera (midas_g_3760, midas_g_8524 and midas_g_1533), while Ca. Sarcinithrix is connected with midas_g_3319 of the patescibacterial class Microgenomatia. In addition, several correlations between uncharacterized genera and a patescibacterial genus were also detected. For example, midas_g_370 (class Microgenomatia) correlates with midas_g_72 (class Anaerolineae), which is a core member of anaerobic digester communities and enriched in EBPR plants [5].
We further performed an ASV level network analysis to test if the same correlation patterns could be observed (Supplementary Fig. 7; Table S18), and explored potential host-symbiont relationships at lower taxonomic levels. Consistent to the network analysis performed at genus level, a correlation was observed between Tetrasphaera and members of the Saccharimonadia at the ASV level (two ASVs from midas_s_5 are correlated with two ASVs of midas_s_67 and midas_g_363, respectively). Notably, the midas_s_5 was recently characterised as the most abundant group of PAOs in Danish and global WWTPs and it was proposed to rename it as “Candidatus Phosphoribacter” [29]. Furthermore, associations between midas_g_72 and midas_g_370, as well as Ca. Microthrix and midas_g_8524 detected in the genus level network were also found in the ASV level network. Additionally, many connections were newly detected, e.g. a connection between ASVs of the genus Ca. Villigracilis, which encompasses filament formers that have been proposed to be structurally important for activated sludge flocs [36] and the patescibacterial midas_g_2215 (class Microgenomatia).