Green Cleaner: decontamination model description
In this study, we developed a novel decontamination algorithm called Green Cleaner, in which various decontamination rules were adapted and integrated to complement their inherent limitations. Green Cleaner is composed of less than ten rules and each rule discriminates between contaminated ASV sequences from true ASV sequences. Green Cleaner s applied to each experimental batch and operated based on the 16S rRNA sequencing results of one blank extraction control processed together in each batch. Also, Green Cleaner only accounts for samples with ASV read counts of more than 500 because inadequately sequenced samples may not effectively reflect the overall bacterial community truly present in the sample.
The samples in the processed batch were first classified into three groups according to the level of contamination, which was determined based on the sum of the relative abundances of the five ASVs identified at the highest abundance in the blank extraction control of each batch in the sample. The five ASVs identified with the highest abundance in the blank extraction control are henceforth referred to as the top 5 ASVs in this paper. Samples in Group 1 are uncontaminated and were defined as samples in which the sum of the relative abundances of the top 5 ASVs was 0. The samples in Group 2 have a low level of contamination, as indicated by the sum of the relative abundances of the top 5 ASVs in the sample of less than 5%. The sum of the relative abundances of the top 5 ASVs in Group 3 samples is 5% or above, suggesting a moderate to high level of contamination. In the Green Cleaner algorithm, different decontamination rules are applied to the three groups, depending on the degree of contamination. The overall workflow of Green Cleaner is schematized in Fig. 1, and the details are as follows (Fig. 1).
For Group 1 samples, all ASVs detected in the 16S rRNA sequencing results were considered valid sequences, and none of them were removed. Because the Group 2 samples have a low level of contamination, and contaminants rather than the top 5 ASVs in the sample were thought to be at a very low abundance, we removed the top 5 ASVs as well as the ASVs with a relative abundance of less than 0.5%.
ASVs found in Group 3 samples were further classified into three categories depending on how abundant they were or whether they existed in the blank extraction control of the experimental batch, as follows: the top 5 ASVs were classified as category 1. ASVs that were not among the top 5 ASVs but were detected in the blank extraction control of the experimental batch were classified as category 2. ASV that were not present in the blank extraction control of the experimental batch were classified into category 3. We applied different decontamination rules according to the ASV category.
1) (Category 1 ASV) the top 5 ASVs
Abundant contamination, such as category 1 ASVs, was robustly detected across all moderately to highly contaminated samples as well as the blank extraction control in a sequencing run. The relative proportions of these abundant contaminants were similar in all contaminated samples as well as in the blank extraction control because multiple taxa present in the contamination source were introduced together in the samples. However, it is possible that this feature is both a contaminant and genuine in the studied ecosystem. In this case, the genuine feature was present in the sample of interest at a much higher prevalence compared to the other abundant contaminants, breaking the similar proportions observed across the blank extraction control and contaminated samples. To distinguish between contaminants and genuine features, we measured the Euclidean distance similarity between the compositional data of each sample and a blank extraction control using biplot analysis, in which the relative abundances of the top 5 ASVs of the samples and blank extraction control were normalized to 100. The larger the Euclidean distance similarity, the more similar the proportion of the top 5 ASVs in the sample to that of the blank extraction control, indicating that the top 5 ASV of the sample were contaminants. The smaller the Euclidean distance similarity, the greater the proportion of the top 5 ASVs in the sample that deviated from that of the blank extraction control; some of these may be genuine features. The cutoff of the Euclidean distance similarity was set at 0.019 pragmatically, based on the observation from our 16S rRNA sequencing data that the composition of the top 5 ASVs in the sample with Euclidean distance similarity below this cutoff seemed to be biased considerably from the composition shown in the blank extraction control. Once the samples with Euclidean distance similarity below this cutoff were identified, the feature with the highest loading vector among the top 5 ASVs was considered a genuine feature, and the remaining features were removed.
2) (Category 2 ASV) ASV detected in the blank extraction control but not top 5 ASVs
Category 2 ASVs indicated a relatively low abundance of contaminant microbial DNA, and the majority of contaminants might fall into this category. These low-abundance contaminant ASVs will be detected at a low prevalence in the sample, similar to the blank extraction control. However, some genuine features in a sample may be classified as category 2 because of the well-to-well leakage phenomenon, in which an abundantly present genuine feature in a sample is cross-contaminated into a blank extraction control. Well-to-well leakage commonly occurs within batches during experimental procedures. While contaminant ASVs belonging to category 2 were detected at low levels in most samples, as well as in the blank extraction control, these genuine features might be detected at a much higher ratio. To distinguish between the contaminant and truly present taxa among the ASVs in this category, we used the Z-score method, which deals with shared information across samples within the batch. Statistically, the Z-score quantifies the distance (in standard deviations) of a data point from the mean of the dataset. It is commonly used to identify outliers, which are data points that deviate significantly from the remaining data. A high absolute z-score indicates that a data point is far from the mean, suggesting that it may be an outlier. Because ASVs that exist as contaminants in an experimental batch will be identified at a proportion similar to that of the blank extraction control in samples where the ASV is found, the Z-score, which is calculated using the proportion of ASVs, will show a low value close to zero for these ASVs. Meanwhile, ASV, which is truly present as a genuine feature in a sample, can be expected to be identified at much higher levels than the other samples as well as the blank extraction control. In this case, the Z-score of the ASVs presented as genuine taxa will show a higher value than the other samples. Therefore, we evaluated the Z-score for each ASV belonging to Category 2 to differentiate the truly present features.
However, because the relative abundance of low-abundance contaminants in biological samples may vary according to their contamination levels, the Z-score calculated using the relative abundance of ASVs may have limited accuracy in distinguishing truly present features from contaminants. Therefore, in our framework, the adjusted Z-score, using the value of the relative abundance of the ASVs divided by the sum of the top 5 ASVs, was used rather than a simple Z-score using the relative abundance of the ASV itself.
Additionally, we utilized a modified Z-score that employs the median rather than the mean to calculate the Z-score because this is a more robust way to detect outliers. The cutoff of the adjusted modified Z-score was pragmatically set to 8, and an ASV with an adjusted modified Z-core of 8 or more was considered an actual feature in the sample. Only ASVs found in three or more biological samples were subjected to the adjusted modified Z-score analysis, whereas ASVs found in two or fewer biological samples were subjected to the decontamination rule for category 3 ASV, which are described in the next section.
3) (Category 3 ASV) ASV not detected in the blank extraction control
According to Dyrhovden et al., the majority of contaminants are present in low abundance and are randomly included during pipetting of the PCR template [31]. They are subject to the rule of small numbers, which states that a random sample is unlikely to accurately represent the population from which it is obtained [32]. Therefore, these contaminants will not occur in the blank extraction control, particularly when the number of controls is limited. Most ASVs belonging to category 3 are likely to be contaminants present in low abundance; consequently, many contaminants may only exist in samples without being detected in the blank extraction control. Therefore, it is necessary to distinguish ASVs that can be represented as contaminants in samples, even if they are not detected in a blank extraction control. To determine whether category 3 ASVs are contaminants, we applied ecological plausibility and created an in-house database in our framework.
To determine ecological plausibility, we used BacDive database that is the largest worldwide database for standardized bacterial information and include isolation sources of the strains [33]. If none of the strains belonging to a genus had been isolated from human-related sources, we classified the genus as a non-human source. As result, among the 3239 genera for which isolation sources are registered in the Bacdive database, 2257 genera were classified as non-human sources. In addition, many studies have reported that Proteobacteria, particularly Alpha- and Beta-proteobacteria, include bacteria that essentially contribute to the nitrogen cycle in ecosystems and are well known to dominate ecological environments, such as soil and water [34, 35]. Overall, taxa belonging to Alpha-Proteobacteria or Beta-Proteobacteria as well as taxa belonging to the genus classified as non-human source from the Bacdive database were defined as non-biological contaminants that do not fall under ecological plausibility, and if an ASV classified as category 3 in the sample falls under this list, it was removed as a contaminant.
Specific contaminants originating from the laboratory environment, consumables, and reagents may also be present. We created an in-house blacklist that referred to features that were specifically and recurrently detected in our 16S rRNA sequencing data produced from 2,912 clinical urine samples and 148 blank extraction controls. To create an in-house blacklist, we assumed two concepts: (1) contaminants should be present in the blank extraction controls at a higher relative abundance compared to the biological samples, and (2) category 3 contaminants might not be discovered in a high ratio. Then, we classified the features into an in-house blacklist that met the following criteria: (1) maximum relative abundance of the feature across all biologic urine samples was less than 1%, (2) maximum relative abundance of the feature across all biologic urine samples was less than 5% when the mean relative abundance of all blank extraction controls was higher than the mean relative abundance of the feature of all biologic urine samples, and (3) maximum relative abundance of the feature across all biologic urine samples was less than 5% when the genus assigned to the feature was listed on the contaminants list in the GRIMER repository more than three times. GRIMER [36] is a tool for analyzing, visualizing, and exploring microbiome studies with a focus on contamination detection and compiles an extensive list of common contaminants containing 210 genera and 627 species reported in 22 published articles. There were 85 genera listed more than three in the contaminant list in the GRIMER repository (Table 1). By applying these criteria for in-house blacklist, 54,721 out of 56,010 ASVs identified from 3,060 16S rRNA sequencing data were classified as blacklists. Of the 54,721 blacklisted ASVs, 491 were detected in more than 100 of the 2912 urine samples (Additional File 1).
The following is the order in which contaminants are removed from ASVs that are classified as category 3. First, ASVs corresponding to non-biological contaminants were removed, and then features corresponding to the in-house blacklist with a relative abundance of less than 5% were removed. Additionally, rare features with a relative abundance of less than 0.1% were removed.
Among the ASV features remaining valid after decontamination processes in all categories, ASV features whose sequences were only assigned up to Class level were additionally removed because it is reasonable for genuine taxa to be appropriately assigned to low-level taxonomies using well-curated and complete reference taxonomy databases and well-performing taxonomy assignment algorithms. Finally, ASVs with read counts below 10 were eliminated to exclude the possibility of low-frequency artifacts (e.g., sequencing artifacts or low-lying PCR contamination).
Table 1
Genera listed more than three times on the contamination list in the GRIMER repository.
Genus |
Abiotrophia | Achromobacter | Acidovorax | Acinetobacter |
Actinomyces | Afipia | Agrobacterium | Anaerococcus |
Aquabacterium | Arthrobacter | Bacillus | Bacteroides |
Blautia | Bosea | Bradyrhizobium | Brevibacillus |
Brevibacterium | Brevundimonas | Burkholderia | Capnocytophaga |
Caulobacter | Chryseobacterium | Cloacibacterium | Clostridium |
Comamonas | Corynebacterium | Cupriavidus | Curvibacter |
Cutibacterium | Delftia | Dialister | Dietzia |
Duganella | Enhydrobacter | Enterococcus | Escherichia |
Faecalibacterium | Flavobacterium | Geobacillus | Granulicatella |
Haemophilus | Halomonas | Herbaspirillum | Janthinobacterium |
Kingella | Kocuria | Lactobacillus | Lactococcus |
Leptotrichia | Massilia | Megasphaera | Mesorhizobium |
Methylobacterium | Microbacterium | Micrococcus | Neisseria |
Novosphingobium | Paenibacillus | Parabacteroides | Paracoccus |
Pedobacter | Pelomonas | Phocaeicola | Phyllobacterium |
Porphyromonas | Prevotella | Propionibacterium | Pseudomonas |
Psychrobacter | Ralstonia | Rhizobium | Rhodococcus |
Roseomonas | Rothia | Sediminibacterium | Shewanella |
Sphingobacterium | Sphingobium | Sphingomonas | Sphingopyxis |
Staphylococcus | Stenotrophomonas | Streptococcus | Variovorax |
Veillonella | | | |
Evaluation of decontamination methods
1) Human vaginal microbial dilution series data
To assess the performance of our decontamination algorithm, we prepared a human vaginal microbial dilution series using ten leftover human vaginal microbiome samples. Vaginal microbiome samples were collected using a sterile swab kit containing preservatives (Noble Biosciences, Republic of Korea). Preservative solutions of each vaginal sample were first diluted to 1/1000 and had further undergone six rounds of serial two-fold dilutions with nuclease-free water (NFW) (Invitrogen, USA). Nucleic acid concentrations of the undiluted vaginal samples were ranged from 4.44 to 39.6 ng/ul. Experiments of 16S rRNA sequencing for a total of 10 sets of the vaginal sample dilution series were conducted in the same manner with other catheterized urine samples requested to the laboratory and processed divided into 6 experimental batches along with urine samples. A blank extraction control was included in each batch. This study was approved by the Ethics Committee of GC Laboratories (GCL-2023-1075-02).
2) DNA extraction and 16S rRNA sequencing
DNA extraction was performed using the MagMAX™ Microbiome Ultra Nucleic Acid Isolation Kit (ThermoFisher Scientific, Waltham, MA, USA) according to the manufacturer’s instructions. The prepared DNA was used for 16S library construction using NEXTflex 16S V4 Amplicon-Seq (Bioo Scientific, Austin, TX, USA). The amplification cycle was 8 cycles for PCR I amplification and 22 cycles for PCR II amplification.The final library products were diluted, pooled, and sequenced using the MiSeq system (Illumina) with a paired-end 500-cycle kit. The vaginal microbial dilution series and blank extraction controls were subjected to the same procedure.
3) Bioinformatic Analysis
QIIME 2 was used to analyze the 16S rRNA sequence data [37]. Demultiplexed and primer-trimmed data were quality-filtered and denoised using DADA2 (Divisive Amplicon Denoising Algorithm 2), which uses a parametric model to infer exact biological sequences from quality-filtered reads, known as ASVs [38, 39]. In DADA2, independently denoised forward and reverse reads were merged at the end of the workflow, and the chimeric ASVs were removed. For the taxonomic classification of ASVs, a multinomial naive Bayes machine-learning classifier in the q2-feature-classifier was used against the refseq database[40]. Finally, ASVs that were not assigned to bacteria at the domain level were removed.
4) Benchmarking of decontamination methods
To evaluate the performance of Green Cleaner, we compared the outcomes of the decontamination process using Green Cleaner with the previously published decontamination method SCRuB. SCRuB [41] is a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination and is reported to outperform alternative decontamination methods under in silico simulations of diverse environments and data types. We tested the decontamination performance of Green Cleaner against SCRuB using 16S rRNA sequencing data produced from the vaginal microbial dilution series, as described above.
To assess the overall performance of the decontamination methods, we calculated the alpha-diversity metrics and Bray–Curtis dissimilarity between the decontaminated data and 16S rRNA sequencing data from the undiluted vaginal samples. We defined ground-truth classification as a contaminant for each ASV based on 16S rRNA sequencing data generated using the undiluted vaginal samples. In other words, a contaminant ASV is an ASV that is not expected to be a part of undiluted vaginal samples, whereas a ground-truth ASV is an ASV that occurs in undiluted vaginal samples. To calculate the overall accuracy of each method, we classified the ASVs as correctly or incorrectly identified as the ground truth or contaminants in the decontamination results. ASVs classified correctly as ground truth or contaminants were referred to as true positives or true negatives, respectively, and ASVs classified incorrectly as ground truth or contaminants were referred to as false positives or false negatives, respectively. Accuracy was calculated using the following formula: (ASV read number of correct predictions) / (total ASV read number of predictions).
5) Statistical Analysis
All statistical analyses were performed using R version 4.0.5. Category 1 ASVs were decontaminated using the prcomp and biplot functions, and Euclidean distance similarity was calculated using the proxy package. When calculating the adjusted modified Z-score, if the median absolute deviation (MAD) value was nonzero, it was multiplied by a weighting factor of 1.4826 and used as the denominator. However, if the MAD value was zero, the average absolute deviation (AAD) was multiplied by 1.2533 and used as the denominator. A statistical hypothesis test comparing the two groups was performed using the Wilcoxon signed-rank sum test, which is a nonparametric test. The smooth curve of the numerical changes according to the proportion of total contaminants in the figures was analyzed using LOESS.