Filtering threshold for handling spurious sequences
We first used bacterial communities of known composition (simplified communities) to assess the occurrence of spurious taxa and to determine at which relative abundances they start to appear. To propose a cutoff potentially applicable to different 16S rRNA gene amplicon studies, we included reference data obtained with different variable regions and sequencing pipelines and originating from both in vitro an in vivo communities varying in number and type of species (max. 58) (Table 1 and Table 2).
Without any filtering, sequence clustering generated an average of 508 ± 355 OTU (min. 52; max. 1,081) per mock community (10 to 58 target species in theory) and 105 ± 50 OTU (min, 55; max. 215) per gnotobiotic community (4 to 12 target species in theory). Up to 87% of these OTUs were spurious (i.e. they did not match the expected classification of species contained in the corresponding artificial community) (Fig. 2a). On average, the proportion of spurious OTU in both mock communities and samples from gnotobiotic mice was slightly lower after removing singletons, without reaching significance (50.8 vs. 64.3%, p = 0.227; 57.5% vs. 65.7%, p = 0.70). Interestingly, the proportion of spurious OTUs was higher in gnotobiotic mice independent of filtering (p < 0.001), suggesting that the matrix in which members of the defined communities are (here fecal material) influences the outcome. Besides the goal of removing spurious taxa, it is of course important to include as many true molecular species as possible into the analysis. Even without any cutoff, not all target species could be detected: the percentage of positive hits was 94.9 and 92.3% for mock communities and gnotobiotic mice, respectively (Fig. 2b).
To determine a filtering threshold that allowed exclusion of most spurious taxa, we recorded the relative abundance of the first spurious OTU occurring in each of the reference community datasets (Fig. 2c). Median values of approx. 0.12% relative abundance were observed (Fig. 2d). Besides one outlier in the Mock communities (0.44% relative abundance), all values were below 0.25%, a threshold that was selected for all further analyses. The distribution of spurious OTUs and positive hits obtained after applying this new cutoff on all reference communities is depicted in Fig. 2e. Whilst the number of spurious taxa decreased drastically (4.0 vs. 50.8% for mock communities and 1.0 vs. 57.0% for gnotobiotes; p ≤ 0.01), the number of positive hits was not affected significantly (87.2 vs. 93.7% for mock communities and 82.4 vs. 88.7% for gnotobiotes; p > 0.50) by the 0.25%-cutoff vs. singletons removal, respectively (Fig. 2e). Note that the diversity of reference communities in the gnotobiotic mice was relatively low (4-12 members; Table 2), resulting in a proportionally marked drop in the percentage of positive hit (8-25%) even if only one true member is excluded after filtering due to a low relative abundance (which is an expectable event considering a classical exponentially decreasing distribution of species occurrence in gut environments).
Next, we used another bioinformatic pipeline for ASV analysis to confirm the results aforementioned. Processing of the very same simplified communities generated a total number of 42 ± 25 ASVs (min. 16; max. 98) for mock communities (10 to 58 target species in theory) and 14 ± 8 ASVs (min. 4; max. 25) for gnotobiotes (4 to 12 target species in theory). Altogether, a marked decrease in spurious taxa was observed compared with OTU clustering, with an average of 8.6% ± 11.8 and 4.4% ± 6.4 spurious sequences after singletons removal for Mock and gnotobiotic communities, respectively (Fig. 2f compared with Fig. 2a). Of note, the DADA2 pipeline used for the ASV approach does not infer sequence variants only supported by a single read (singletons), due to a lack of confidence in their existence relative to sequencing errors. Hence, data corresponding to no filtering with the OTU-based approach were not generated. On average, the first spurious ASV occurred at a relative abundance of 0.10% ± 0.32. Applying the cutoff of 0.25% relative abundance completely removed spurious sequences (except for three outlying samples), albeit with a slight drop in positive hits for both Mock and gnotobiotic communities (Fig. 2f).
Ecology of spurious OTUs
We then investigated the diversity and origin of spurious taxa, i.e. those not matching any sequences of the reference species in the defined communities. Therefore, their taxonomic information and occurrence in >100,000 IMNGS-derived amplicon datasets [4] were combined and depicted in a dendrogram (Fig. 3a). Approximately half of the 678 non-redundant spurious OTUs belonged to the phylum Firmicutes followed by Bacteroidetes and Proteobacteria. Most of these OTUs were characterized by highest prevalence in human- and mouse-derived datasets, with values reaching up to 40 % in the thousands of tested samples (Fig. 3a). The majority (>20%) of spurious OTUs detected in human and mouse samples were exclusive to this respective category of habitats (Fig. 3b). This distribution implies that the type of samples multiplexed with target samples within a given sequencing run (in our case mouse and human gut samples) greatly influence the occurrence of spurious OTUs in target samples. Interestingly, >600 of the 678 spurious OTUs occurred in less than five of the ten sequencing runs tested, with approximately 450 of them occurring in only one run (Fig. 3c). This indicates that the majority of spurious taxa are sporadic cross-contaminations rather than generalist artifacts across sequencing runs. Whilst most of the spurious taxa were characterized by relative abundances between 0.25 and 2% in the IMNGS-amplicon datasets tested, they represented very dominant populations in a few samples (Fig. 3d).
Loose taxa filtering inflates alpha-diversity and increases heterogeneity
Spurious OTUs, as looked at in the present study, are per se low abundant: the cumulative relative abundance of spurious OTUs in the reference communities used above was approx. 1% on average. Hence, whilst spurious OTUs are not expected to substantially influence composition data, they can have a major influence on diversity (e.g. richness and evenness for alpha-diversity and between-sample distances for beta-diversity), as presented in the next sections. To assess the impact of filtering thresholds on analysis outcomes, we used recently published amplicon data from two comprehensive studies that included a substantial number of samples analyzed by Illumina sequencing of 16S rRNA gene amplicons and for which raw datasets could be retrieved from public repositories. The study by Flores et al. [12] (hereon referred to as Study-1) focused on dynamics of human body microbiomes over time, collecting samples weekly from 85 college-age adults over a 3-month period; we focused only on gut samples in the present work. The second study published by Halfvarson et al. [11] (hereon referred to as Study-2) focused on shifts in the human fecal microbiota over time in patients with inflammatory bowel diseases vs. controls, including 683 fecal samples from 137 individuals. We emphasize again that the purpose of the present study is not to confirm or refute data from the literature, but rather to draw attention to an analysis parameter that can have a profound impact on results. In all following analyses, outcomes after the most commonly used approach of singletons filtering following de novo OTU clustering was compared with the 0.25%-cutoff introduced above (i.e. keeping only those OTUs occurring at a minimum relative abundance of 0.25% in at least one sample).
In both Study-1 and Study-2, filtering OTUs using the 0.25%-cutoff led to a significant decrease in richness by approx. two-fold, resulting in an average number of about 200 observed species per sample (Fig. 4a). More interestingly, when looking at individual-specific variations in richness by plotting inter-quartile ranges (IQR) across the different time points analyzed in the studies, the 0.25%-cutoff was associated with a significantly lower heterogeneity in richness (Study-1: IQR = 28.0 ± 17.8 vs. 70.6 ± 34.1, p < 0.001; Study-2: IQR = 17.0 ± 3.2 vs. 49.0 ± 10.4, p = 2.5 x 10-13) (Fig. 4a). Another helpful readout of alpha-diversity is the Shannon effective count, which takes into account the evenness of species distribution and can be, simply speaking, considered as a proxy for the number of most dominant species [21, 28]. Altogether, the trend observed for richness (less heterogeneity after 0.25%-filtering) was similar when considering Shannon effective counts (data not shown). However, lower effective counts after stringent filtering (0.25%) were not significantly different for Study-2, showing that Shannon effective counts can be useful to alleviate the impact of low abundant species.
In addition to these two literature studies, which focused on the analysis of distinct samples (different individuals with several time points in both studies), we also analyzed triplicates of six fecal samples from healthy human adults sequenced several times in-house. This dataset that consisted of the very same samples sequenced in seven different runs was ideal to test reproducibility depending on filtering thresholds. Across all runs, the coefficient of variations (CVs) calculated on richness values between the triplicates of each sample within a run were on average <5%, and lowest when applying the 0.25%-cutoff (Fig. 4b). In contrast, CVs of the richness within samples across sequencing runs increased to 20% on average with a peak at 40% when applying singletons filtering, which dropped to approx. 10% (average) and 30% (maximum) when applying the 0.25%-cutoff (Wilcoxon-Mann-Whitney test, p = 0.004) (Fig. 4b). This clearly indicates that 16S rRNA gene amplicon sequencing, at least as performed in our study, generates richness values that vary markedly between sequencing runs for the same sample, especially when following a loose OTU filtering strategy.
Between-sample comparisons are impacted by filtering strategies
We next assessed how filtering influenced beta-diversity analyses. To obtain reference data to which filtering strategies after de novo OTU clustering could be compared, a closed-reference protocol was also used, as in the published studies selected [11, 12].
In Study-1, the median unweighted distance across all individuals was approximately 0.5 after using reference-picking, including a broad range of within-host temporal variations (some individuals being characterized by more stable profiles than others) (Fig. 4c; left panel), as observed in the original study [12]. As expected, the strongest effect of filtering strategies was observed when using unweighted UniFrac distances: singletons removal was characterized by a higher overtime variation in profiles (median value of approx. 0.6 vs. 0.3 for the 0.25%-cutoff) (Fig. 4c; middle panel). Interestingly, whilst using generalized UniFrac distances leveled the difference between filtering approaches, it widened the range of individual-specific overtime variability around the median, potentially enhancing the discriminatory power between ‘stable’ and ‘variable’ individuals (Fig. 4c; right panel).
In Study-2, one of the main findings in the original work was that volatility (i.e. variations overtime within individuals) was highest in patients suffering from Crohn’s disease with an ileal phenotype that underwent ileocecal resection (ICD-r) [11]. We confirmed this finding by using reference-based picking and unweighted distances, as done in the published manuscript (Fig. 4d; left panel). However, whenever applying de novo clustering, this difference could only be observed when using the 0.25%-cutoff in combination with unweighted distances (Fig. 4d; middle panel). The absence of a significant differences when using unweighted distances after singletons removal may indicate the confounding effect of spurious OTUs (Fig. 4d; middle panel). The absence of difference when applying generalized distances in general (Fig. 4d; right panel) suggests that individual-specific overtime variations in this study are bound to the presence/absence of taxa rather than to changes in composition.
Validation studies
In order to confirm utility of the 0.25%-cutoff inferred from data generated at the Core Facility Microbiome of the ZIEL Institute for Food and Health (TU Munich, Germany), additional samples were processed and analyzed independently at the Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna (JMF).
First, processing of a log-distributed version of the ZymoBIOMICS Microbial Community Standard (Zymo Research GmbH) containing eight bacterial strains confirmed the advantage of applying the 0.25%-filtering approach. Twenty-five replicates of the same DNA sample were sequenced on five sequencing runs (1-8 replicates per run) using either the V4 region combined with combinatorial dual (CD) barcoding or the V3/4 regions with unique dual (UD) barcoding (two and three runs, respectively). V4-CD vs. V3/4-UD yielded 31 ± 16/8 ± 2 ASVs (min: 13/5, max: 57/10), respectively. Spurious ASVs (i.e., all sequences with a Hamming distance to the reference > 1) were greatly reduced using a 0.25% filtering step from 73 ± 8 to 2 ± 2 and from 13 ± 15 to 0 in V4-CD vs. V3/4-UD, respectively (Fig. 5a). This occurred at a loss of 15% true taxa in the case of V4-CD whilst no change was observed with V3/4-UD (Fig. 5b). As the highest relative abundance reached by any spurious ASV was 0.28% and the true taxa detected corresponded to dominant members of the standard community, the cumulative relative abundance of true taxa was high (>98%) in all cases (Fig. 5b).
Second, peat soil DNA [17] was analyzed to confirm suitability of our filtering approach for non-gut samples. One identical DNA sample was sequenced on three different runs (3-5 replicates per run) using primers 515F/806R (V4 region) and CD barcoding. The ASV table was rarefied to the minimum sum count (9,104) and analyzed without or with filtering (i.e., only ASVs observed at a relative abundance >0.25% in at least one replicate were kept). Richness was calculated using ampvis2 [29]. Applying the 0.25%-cutoff decreased the number of observed ASVs from 408 ± 71 to 139 ± 5 and, more importantly, the IQR from 101 to 7 (Fig. 5b). Unweighted UniFrac distances within and between runs as calculated using ampvis2 were also compared before and after filtering. Sequences were aligned using MAFFT [30] and phylogeny was inferred using FastTree. Whilst the community makeup in the soil sample varied substantially between sequencing runs without additional filtering, the 0.25%-cutoff reduced this variation to the level observed within runs without filtering (Fig. 5c). Replicates within a run were very similar after applying the 0.25%-cutoff. Altogether, these data demonstrate that stringent filtering delivers more stable values obtained for the very same sample sequenced in replicates across several sequencing runs.