Green Cleaner: Advanced Decontamination Algorithm for Catheterized Urine 16S rRNA Sequencing Data

doi:10.21203/rs.3.rs-4921725/v1

Download PDF

Research Article

Green Cleaner: Advanced Decontamination Algorithm for Catheterized Urine 16S rRNA Sequencing Data

https://doi.org/10.21203/rs.3.rs-4921725/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Contamination of low-biomass samples, such as urine, is a significant challenge in 16S rRNA sequencing. The presence of extraneous DNA in reagents and the environment often obscures microbial DNA, making it difficult to identify and remove contaminants. In silico decontamination algorithms developed so far still have some limitations in identifying and removing contaminants accurately. In this study, we developed a novel decontamination algorithm, Green Cleaner, to enhance the accuracy of 16S rRNA sequencing data by effectively distinguishing and removing contaminants especially from catheterized urine samples.

Results

We evaluated the performance of Green Cleaner against SCRuB using a series of vaginal microbiome dilution experiments as a proxy for low-biomass urine samples. Our results demonstrate that Green Cleaner outperforms SCRuB across all contamination levels, with higher accuracy, F1-score, and lower beta-dissimilarity. Specifically, Green Cleaner showed improved specificity and positive predictive value (PPV), correctly removing more contaminant amplicon sequence variant (ASV) features than SCRuB did. This was evidenced by the more diminished alpha diversity of the decontamination results in Green Cleaner than SCRuB, indicating a more precise elimination of contaminants by Green Cleaner.

Conclusions

Green Cleaner offers a robust solution for decontaminating 16S rRNA sequencing data from low-biomass samples, particularly catheterized urine samples, thus addressing the key limitations of the existing methods. By utilizing a single blank extraction control per batch and a set of intuitive and adjustable decontamination rules, Green Cleaner provides a practical and efficient approach for real-world applications. Our findings suggest that Green Cleaner has the potential to substantially advance urine microbiome research by providing more accurate and reliable microbial profiles.

16S rRNA sequencing

Low biomass samples

Blank extraction control

Microbial contamination

Decontamination algorithms

Urine microbiome

Microbiome research has advanced because of the ability to perform more sensitive surveys of microbial communities, genomes, and functions than previously conceivable, thanks to next-generation sequencing technology. Microbiome studies initially focused on the gut microbiota, followed by other high-biomass organs, such as the vagina, skin, and mouth, which were the major body sites in the Human Microbiome Project launched in 2007 [1] and have been further extended to samples previously known to be sterile, such as urine [2], placenta [3], and the lower airway [4]. Consequently, it has been shown that these sterile samples are inhabited by unique microbiota despite having a low microbial burden. The urinary tract also has a unique microbiota, even in the absence of urinary tract infection [5], and its microbial burden is known as 10^3–5 bacteria per 1 ml of urine. This is at least 10⁶ times smaller than the 10¹¹ bacteria per 1 ml of gut content [6].

Urological disorders were previously thought to have no microbiological etiology; however, the discovery that the urinary tract is not a sterile environment and has a diverse and distinct urobiome has changed our understanding of these conditions. The function of the urobiome in a variety of urological diseases is gaining attention, and its alterations have been reported in a variety of urological diseases, such as chronic recurrent cystitis, neurogenic bladder dysfunction, interstitial cystitis, urgency urinary incontinence, urolithiasis, overactive bladder, and bladder cancer [7–18].

As with other microbiome studies, the most commonly used method for urobiome research is marker gene (amplicon) sequencing because of its low cost and speed. To investigate bacterial communities, a partial hypervariable region of the 16S rRNA gene is specifically targeted. This process consists of extracting bacterial DNA from a sample, amplifying it using polymerase chain reaction (PCR), and sequencing. Amplicon sequencing is an extremely sensitive method, even for low-biomass specimens and has increased our ability to detect microbes in such samples. However, accurate characterization of microbial communities using marker gene sequencing is challenging in low-biomass specimens containing very little endogenous DNA because of bacterial DNA contamination from exogenous sources introduced during sample collection and processing. Numerous studies have demonstrated how contamination can be amplified and negatively affect biological interpretations, skewing result [19–22].

Extracellular microbial DNA can last for thousands of years and is found in nearly all ecosystems, including soils, sediments, freshwater, and oceans [23]. In addition, its effects are widespread in laboratory environments, and contaminant bacterial DNA can be isolated from many sources, including plastic consumables [24], molecular biology grade water [25, 26], nucleic acid extraction kits [19, 22, 27], and PCR master mixes [20, 26, 28, 29]. Contaminated laboratory reagents in 16S rRNA gene-based experiments have long been recognized in the scientific literature [30], and these contaminating sequences have been previously reported to match water- and soil-associated bacterial genera such as Acinetobacter, Alcaligenes, Bacillus, Bradyrhizobium, Herbaspirillum, Legionella, Leifsonia, Mesorhizobium, Methylobacterium, Microbacterium, Novosphingobium, Pseudomonas, Ralstonia, Sphingomonas, Stenotrophomonas, and Xanthomonas [22].

Several methods are available for detecting and eliminating contamination from microbial sequencing data, including (i) removal of sequences that appear in controls, (ii) removal of sequences below an ad hoc relative abundance threshold, (iii) removal of sequences previously identified as contaminants, and (iv) bioinformatics methods. In particular, the most popular method for controlling and mitigating the impact of contaminant bacterial DNA in low-biomass samples is to sequence blank extraction controls along with the samples, relying on the assumption that sequencing of appropriate blank extraction controls will reveal background contaminants that could possibly occur in the associated clinical samples. Various bioinformatics algorithms have been developed using these controls. However, neither method is perfect for recovering endogenous signals, and each has its own limitations.

Furthermore, there is little consensus on how to best mitigate the contamination of microbiome samples, and there is currently no standard technique to remove these contaminants, resulting in inconsistent and controversial results. Catheterized urine samples are also susceptible to contamination, but many urobiome studies have been published without appropriate decontamination procedures. Consequently, it is difficult to reach a consensus on the connection between urological illnesses and the urobiome. Therefore, we developed a decontamination algorithm, Green Cleaner, specifically for catheterized urine microbiome data. We rigorously validated the algorithm using data generated by a multiple dilution series of human vaginal microbial samples and demonstrated that Green Cleaner outperformed an algorithm reported to remove contamination, with the highest accuracy among decontamination tools recently reported to date.

Green Cleaner: decontamination model description

In this study, we developed a novel decontamination algorithm called Green Cleaner, in which various decontamination rules were adapted and integrated to complement their inherent limitations. Green Cleaner is composed of less than ten rules and each rule discriminates between contaminated ASV sequences from true ASV sequences. Green Cleaner s applied to each experimental batch and operated based on the 16S rRNA sequencing results of one blank extraction control processed together in each batch. Also, Green Cleaner only accounts for samples with ASV read counts of more than 500 because inadequately sequenced samples may not effectively reflect the overall bacterial community truly present in the sample.

The samples in the processed batch were first classified into three groups according to the level of contamination, which was determined based on the sum of the relative abundances of the five ASVs identified at the highest abundance in the blank extraction control of each batch in the sample. The five ASVs identified with the highest abundance in the blank extraction control are henceforth referred to as the top 5 ASVs in this paper. Samples in Group 1 are uncontaminated and were defined as samples in which the sum of the relative abundances of the top 5 ASVs was 0. The samples in Group 2 have a low level of contamination, as indicated by the sum of the relative abundances of the top 5 ASVs in the sample of less than 5%. The sum of the relative abundances of the top 5 ASVs in Group 3 samples is 5% or above, suggesting a moderate to high level of contamination. In the Green Cleaner algorithm, different decontamination rules are applied to the three groups, depending on the degree of contamination. The overall workflow of Green Cleaner is schematized in Fig. 1, and the details are as follows (Fig. 1).

For Group 1 samples, all ASVs detected in the 16S rRNA sequencing results were considered valid sequences, and none of them were removed. Because the Group 2 samples have a low level of contamination, and contaminants rather than the top 5 ASVs in the sample were thought to be at a very low abundance, we removed the top 5 ASVs as well as the ASVs with a relative abundance of less than 0.5%.

ASVs found in Group 3 samples were further classified into three categories depending on how abundant they were or whether they existed in the blank extraction control of the experimental batch, as follows: the top 5 ASVs were classified as category 1. ASVs that were not among the top 5 ASVs but were detected in the blank extraction control of the experimental batch were classified as category 2. ASV that were not present in the blank extraction control of the experimental batch were classified into category 3. We applied different decontamination rules according to the ASV category.

1) (Category 1 ASV) the top 5 ASVs

Abundant contamination, such as category 1 ASVs, was robustly detected across all moderately to highly contaminated samples as well as the blank extraction control in a sequencing run. The relative proportions of these abundant contaminants were similar in all contaminated samples as well as in the blank extraction control because multiple taxa present in the contamination source were introduced together in the samples. However, it is possible that this feature is both a contaminant and genuine in the studied ecosystem. In this case, the genuine feature was present in the sample of interest at a much higher prevalence compared to the other abundant contaminants, breaking the similar proportions observed across the blank extraction control and contaminated samples. To distinguish between contaminants and genuine features, we measured the Euclidean distance similarity between the compositional data of each sample and a blank extraction control using biplot analysis, in which the relative abundances of the top 5 ASVs of the samples and blank extraction control were normalized to 100. The larger the Euclidean distance similarity, the more similar the proportion of the top 5 ASVs in the sample to that of the blank extraction control, indicating that the top 5 ASV of the sample were contaminants. The smaller the Euclidean distance similarity, the greater the proportion of the top 5 ASVs in the sample that deviated from that of the blank extraction control; some of these may be genuine features. The cutoff of the Euclidean distance similarity was set at 0.019 pragmatically, based on the observation from our 16S rRNA sequencing data that the composition of the top 5 ASVs in the sample with Euclidean distance similarity below this cutoff seemed to be biased considerably from the composition shown in the blank extraction control. Once the samples with Euclidean distance similarity below this cutoff were identified, the feature with the highest loading vector among the top 5 ASVs was considered a genuine feature, and the remaining features were removed.

2) (Category 2 ASV) ASV detected in the blank extraction control but not top 5 ASVs

Category 2 ASVs indicated a relatively low abundance of contaminant microbial DNA, and the majority of contaminants might fall into this category. These low-abundance contaminant ASVs will be detected at a low prevalence in the sample, similar to the blank extraction control. However, some genuine features in a sample may be classified as category 2 because of the well-to-well leakage phenomenon, in which an abundantly present genuine feature in a sample is cross-contaminated into a blank extraction control. Well-to-well leakage commonly occurs within batches during experimental procedures. While contaminant ASVs belonging to category 2 were detected at low levels in most samples, as well as in the blank extraction control, these genuine features might be detected at a much higher ratio. To distinguish between the contaminant and truly present taxa among the ASVs in this category, we used the Z-score method, which deals with shared information across samples within the batch. Statistically, the Z-score quantifies the distance (in standard deviations) of a data point from the mean of the dataset. It is commonly used to identify outliers, which are data points that deviate significantly from the remaining data. A high absolute z-score indicates that a data point is far from the mean, suggesting that it may be an outlier. Because ASVs that exist as contaminants in an experimental batch will be identified at a proportion similar to that of the blank extraction control in samples where the ASV is found, the Z-score, which is calculated using the proportion of ASVs, will show a low value close to zero for these ASVs. Meanwhile, ASV, which is truly present as a genuine feature in a sample, can be expected to be identified at much higher levels than the other samples as well as the blank extraction control. In this case, the Z-score of the ASVs presented as genuine taxa will show a higher value than the other samples. Therefore, we evaluated the Z-score for each ASV belonging to Category 2 to differentiate the truly present features.

However, because the relative abundance of low-abundance contaminants in biological samples may vary according to their contamination levels, the Z-score calculated using the relative abundance of ASVs may have limited accuracy in distinguishing truly present features from contaminants. Therefore, in our framework, the adjusted Z-score, using the value of the relative abundance of the ASVs divided by the sum of the top 5 ASVs, was used rather than a simple Z-score using the relative abundance of the ASV itself.

Additionally, we utilized a modified Z-score that employs the median rather than the mean to calculate the Z-score because this is a more robust way to detect outliers. The cutoff of the adjusted modified Z-score was pragmatically set to 8, and an ASV with an adjusted modified Z-core of 8 or more was considered an actual feature in the sample. Only ASVs found in three or more biological samples were subjected to the adjusted modified Z-score analysis, whereas ASVs found in two or fewer biological samples were subjected to the decontamination rule for category 3 ASV, which are described in the next section.

3) (Category 3 ASV) ASV not detected in the blank extraction control

According to Dyrhovden et al., the majority of contaminants are present in low abundance and are randomly included during pipetting of the PCR template [31]. They are subject to the rule of small numbers, which states that a random sample is unlikely to accurately represent the population from which it is obtained [32]. Therefore, these contaminants will not occur in the blank extraction control, particularly when the number of controls is limited. Most ASVs belonging to category 3 are likely to be contaminants present in low abundance; consequently, many contaminants may only exist in samples without being detected in the blank extraction control. Therefore, it is necessary to distinguish ASVs that can be represented as contaminants in samples, even if they are not detected in a blank extraction control. To determine whether category 3 ASVs are contaminants, we applied ecological plausibility and created an in-house database in our framework.

To determine ecological plausibility, we used BacDive database that is the largest worldwide database for standardized bacterial information and include isolation sources of the strains [33]. If none of the strains belonging to a genus had been isolated from human-related sources, we classified the genus as a non-human source. As result, among the 3239 genera for which isolation sources are registered in the Bacdive database, 2257 genera were classified as non-human sources. In addition, many studies have reported that Proteobacteria, particularly Alpha- and Beta-proteobacteria, include bacteria that essentially contribute to the nitrogen cycle in ecosystems and are well known to dominate ecological environments, such as soil and water [34, 35]. Overall, taxa belonging to Alpha-Proteobacteria or Beta-Proteobacteria as well as taxa belonging to the genus classified as non-human source from the Bacdive database were defined as non-biological contaminants that do not fall under ecological plausibility, and if an ASV classified as category 3 in the sample falls under this list, it was removed as a contaminant.

Specific contaminants originating from the laboratory environment, consumables, and reagents may also be present. We created an in-house blacklist that referred to features that were specifically and recurrently detected in our 16S rRNA sequencing data produced from 2,912 clinical urine samples and 148 blank extraction controls. To create an in-house blacklist, we assumed two concepts: (1) contaminants should be present in the blank extraction controls at a higher relative abundance compared to the biological samples, and (2) category 3 contaminants might not be discovered in a high ratio. Then, we classified the features into an in-house blacklist that met the following criteria: (1) maximum relative abundance of the feature across all biologic urine samples was less than 1%, (2) maximum relative abundance of the feature across all biologic urine samples was less than 5% when the mean relative abundance of all blank extraction controls was higher than the mean relative abundance of the feature of all biologic urine samples, and (3) maximum relative abundance of the feature across all biologic urine samples was less than 5% when the genus assigned to the feature was listed on the contaminants list in the GRIMER repository more than three times. GRIMER [36] is a tool for analyzing, visualizing, and exploring microbiome studies with a focus on contamination detection and compiles an extensive list of common contaminants containing 210 genera and 627 species reported in 22 published articles. There were 85 genera listed more than three in the contaminant list in the GRIMER repository (Table 1). By applying these criteria for in-house blacklist, 54,721 out of 56,010 ASVs identified from 3,060 16S rRNA sequencing data were classified as blacklists. Of the 54,721 blacklisted ASVs, 491 were detected in more than 100 of the 2912 urine samples (Additional File 1).

The following is the order in which contaminants are removed from ASVs that are classified as category 3. First, ASVs corresponding to non-biological contaminants were removed, and then features corresponding to the in-house blacklist with a relative abundance of less than 5% were removed. Additionally, rare features with a relative abundance of less than 0.1% were removed.

Among the ASV features remaining valid after decontamination processes in all categories, ASV features whose sequences were only assigned up to Class level were additionally removed because it is reasonable for genuine taxa to be appropriately assigned to low-level taxonomies using well-curated and complete reference taxonomy databases and well-performing taxonomy assignment algorithms. Finally, ASVs with read counts below 10 were eliminated to exclude the possibility of low-frequency artifacts (e.g., sequencing artifacts or low-lying PCR contamination).

Table 1

Genera listed more than three times on the contamination list in the GRIMER repository.
Genus
Abiotrophia	Achromobacter	Acidovorax	Acinetobacter
Actinomyces	Afipia	Agrobacterium	Anaerococcus
Aquabacterium	Arthrobacter	Bacillus	Bacteroides
Blautia	Bosea	Bradyrhizobium	Brevibacillus
Brevibacterium	Brevundimonas	Burkholderia	Capnocytophaga
Caulobacter	Chryseobacterium	Cloacibacterium	Clostridium
Comamonas	Corynebacterium	Cupriavidus	Curvibacter
Cutibacterium	Delftia	Dialister	Dietzia
Duganella	Enhydrobacter	Enterococcus	Escherichia
Faecalibacterium	Flavobacterium	Geobacillus	Granulicatella
Haemophilus	Halomonas	Herbaspirillum	Janthinobacterium
Kingella	Kocuria	Lactobacillus	Lactococcus
Leptotrichia	Massilia	Megasphaera	Mesorhizobium
Methylobacterium	Microbacterium	Micrococcus	Neisseria
Novosphingobium	Paenibacillus	Parabacteroides	Paracoccus
Pedobacter	Pelomonas	Phocaeicola	Phyllobacterium
Porphyromonas	Prevotella	Propionibacterium	Pseudomonas
Psychrobacter	Ralstonia	Rhizobium	Rhodococcus
Roseomonas	Rothia	Sediminibacterium	Shewanella
Sphingobacterium	Sphingobium	Sphingomonas	Sphingopyxis
Staphylococcus	Stenotrophomonas	Streptococcus	Variovorax
Veillonella

Evaluation of decontamination methods

1) Human vaginal microbial dilution series data

To assess the performance of our decontamination algorithm, we prepared a human vaginal microbial dilution series using ten leftover human vaginal microbiome samples. Vaginal microbiome samples were collected using a sterile swab kit containing preservatives (Noble Biosciences, Republic of Korea). Preservative solutions of each vaginal sample were first diluted to 1/1000 and had further undergone six rounds of serial two-fold dilutions with nuclease-free water (NFW) (Invitrogen, USA). Nucleic acid concentrations of the undiluted vaginal samples were ranged from 4.44 to 39.6 ng/ul. Experiments of 16S rRNA sequencing for a total of 10 sets of the vaginal sample dilution series were conducted in the same manner with other catheterized urine samples requested to the laboratory and processed divided into 6 experimental batches along with urine samples. A blank extraction control was included in each batch. This study was approved by the Ethics Committee of GC Laboratories (GCL-2023-1075-02).

2) DNA extraction and 16S rRNA sequencing

DNA extraction was performed using the MagMAX™ Microbiome Ultra Nucleic Acid Isolation Kit (ThermoFisher Scientific, Waltham, MA, USA) according to the manufacturer’s instructions. The prepared DNA was used for 16S library construction using NEXTflex 16S V4 Amplicon-Seq (Bioo Scientific, Austin, TX, USA). The amplification cycle was 8 cycles for PCR I amplification and 22 cycles for PCR II amplification.The final library products were diluted, pooled, and sequenced using the MiSeq system (Illumina) with a paired-end 500-cycle kit. The vaginal microbial dilution series and blank extraction controls were subjected to the same procedure.

3) Bioinformatic Analysis

QIIME 2 was used to analyze the 16S rRNA sequence data [37]. Demultiplexed and primer-trimmed data were quality-filtered and denoised using DADA2 (Divisive Amplicon Denoising Algorithm 2), which uses a parametric model to infer exact biological sequences from quality-filtered reads, known as ASVs [38, 39]. In DADA2, independently denoised forward and reverse reads were merged at the end of the workflow, and the chimeric ASVs were removed. For the taxonomic classification of ASVs, a multinomial naive Bayes machine-learning classifier in the q2-feature-classifier was used against the refseq database[40]. Finally, ASVs that were not assigned to bacteria at the domain level were removed.

4) Benchmarking of decontamination methods

To evaluate the performance of Green Cleaner, we compared the outcomes of the decontamination process using Green Cleaner with the previously published decontamination method SCRuB. SCRuB [41] is a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination and is reported to outperform alternative decontamination methods under in silico simulations of diverse environments and data types. We tested the decontamination performance of Green Cleaner against SCRuB using 16S rRNA sequencing data produced from the vaginal microbial dilution series, as described above.

To assess the overall performance of the decontamination methods, we calculated the alpha-diversity metrics and Bray–Curtis dissimilarity between the decontaminated data and 16S rRNA sequencing data from the undiluted vaginal samples. We defined ground-truth classification as a contaminant for each ASV based on 16S rRNA sequencing data generated using the undiluted vaginal samples. In other words, a contaminant ASV is an ASV that is not expected to be a part of undiluted vaginal samples, whereas a ground-truth ASV is an ASV that occurs in undiluted vaginal samples. To calculate the overall accuracy of each method, we classified the ASVs as correctly or incorrectly identified as the ground truth or contaminants in the decontamination results. ASVs classified correctly as ground truth or contaminants were referred to as true positives or true negatives, respectively, and ASVs classified incorrectly as ground truth or contaminants were referred to as false positives or false negatives, respectively. Accuracy was calculated using the following formula: (ASV read number of correct predictions) / (total ASV read number of predictions).

5) Statistical Analysis

All statistical analyses were performed using R version 4.0.5. Category 1 ASVs were decontaminated using the prcomp and biplot functions, and Euclidean distance similarity was calculated using the proxy package. When calculating the adjusted modified Z-score, if the median absolute deviation (MAD) value was nonzero, it was multiplied by a weighting factor of 1.4826 and used as the denominator. However, if the MAD value was zero, the average absolute deviation (AAD) was multiplied by 1.2533 and used as the denominator. A statistical hypothesis test comparing the two groups was performed using the Wilcoxon signed-rank sum test, which is a nonparametric test. The smooth curve of the numerical changes according to the proportion of total contaminants in the figures was analyzed using LOESS.

16S rRNA sequencing of vaginal microbial dilution series and blank extraction control.

The total microbial community of the ten undiluted vaginal microbiome samples consisted of 107 ASV features. Among them, 49 ASV features were found at a relative abundance of 1% or above in at least one undiluted vaginal sample, and they mapped to 34 distinct taxa (Table 2).

Table 2

Assigned taxa corresponding to the 49 ASV features found at a relative abundance of 1% or higher in at least one undiluted vaginal sample.
Assigned taxa
k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Actinomycetales;f_Actinomycetaceae;g_Fannyhessea;s_vaginae
k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Bifidobacteriales;f_Bifidobacteriaceae;g_Alloscardovia;s_omnicolens
k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Bifidobacteriales;f_Bifidobacteriaceae;g_Gardnerella;s_
k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Bifidobacteriales;f_Bifidobacteriaceae;g_Gardnerella;s_vaginalis
k_Bacteria;p_Actinobacteria;c_Coriobacteriia;o_Coriobacteriales;f_Coriobacteriaceae;g_Parvibacter;s_caecicola
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Porphyromonadaceae;g_Porphyromonas;s_asaccharolytica
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_amnii
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_bivia
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_buccalis
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_colorans
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_disiens
k_Bacteria;p_Bacteroidetes;c_Bacteroidia;o_Bacteroidales;f_Prevotellaceae;g_Prevotella;s_timonensis
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Aerococcaceae;g_Aerococcus;s_
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_gasseri
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_iners
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_jensenii
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Streptococcaceae;g_Streptococcus;s_agalactiae
k_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Streptococcaceae;g_Streptococcus;s_anginosus
k_Bacteria;p_Firmicutes;c_Clostridia;o_Eubacteriales;f_;g_;s_
k_Bacteria;p_Firmicutes;c_Clostridia;o_Eubacteriales;f_Oscillospiraceae;g_;s_
k_Bacteria;p_Firmicutes;c_Clostridia;o_Eubacteriales;f_Peptostreptococcaceae;g_Peptostreptococcus;s_anaerobius
k_Bacteria;p_Firmicutes;c_Clostridia;o_Eubacteriales;f_Peptostreptococcaceae;g_Peptostreptococcus;s_stomatis
k_Bacteria;p_Firmicutes;c_Negativicutes;o_Veillonellales;f_Veillonellaceae;g_Dialister;s_
k_Bacteria;p_Firmicutes;c_Negativicutes;o_Veillonellales;f_Veillonellaceae;g_Dialister;s_micraerophilus
k_Bacteria;p_Firmicutes;c_Negativicutes;o_Veillonellales;f_Veillonellaceae;g_Megasphaera;s_
k_Bacteria;p_Firmicutes;c_Tissierellia;o_Tissierellales;f_Peptoniphilaceae;g_Anaerococcus;s_prevotii
k_Bacteria;p_Firmicutes;c_Tissierellia;o_Tissierellales;f_Peptoniphilaceae;g_Finegoldia;s_magna
k_Bacteria;p_Firmicutes;c_Tissierellia;o_Tissierellales;f_Peptoniphilaceae;g_Peptoniphilus;s_
k_Bacteria;p_Firmicutes;c_Tissierellia;o_Tissierellales;f_Peptoniphilaceae;g_Peptoniphilus;s_lacrimalis
k_Bacteria;p_Fusobacteria;c_Fusobacteriia;o_Fusobacteriales;f_Leptotrichiaceae;g_Sneathia;s_sanguinegens
k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Pasteurellales;f_Pasteurellaceae;g_Haemophilus;s_
k_Bacteria;p_Tenericutes;c_Mollicutes;o_Mycoplasmatales;f_Mycoplasmataceae;g_Ureaplasma;s_parvum
k_Bacteria;p_Tenericutes;c_Tenericutes;o_Mycoplasmoidales;f_Metamycoplasmataceae;g_Metamycoplasma;s_hominis

Among the 6 blank extraction controls, 570 ASV features were detected. The most abundant genus was Pseudomonas, followed by Janthinobacterium, Stenotrophomonas, Cutibacterium, and Undibacterium. The assigned taxa of the 13 ASV features that were found in the blank extraction controls and had an average prevalence > 1% are listed in Table 3. The average proportions of phyla for ASV features detected in the blank extraction controls were 70.03, 13.96, 9.47, and 5.28% for Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes, respectively.

Table 3

Assigned taxa of 13 ASV features with an average prevalence of > 1% found in the blank extraction controls.
ASV	Assigned taxa	Average Prevalence (%)
5648dccee530d68ceb3e4d7d22cf8756	k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Pseudomonadales;f_Pseudomonadaceae;g_Pseudomonas;s_	9.29
efbe1f58b1e2984ddc53a64f047d94ff	k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;f_Oxalobacteraceae;g_Janthinobacterium;s_	8.12
dcba105f35d8ebc9e22269c7491ad3a7	k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Xanthomonadales;f_Xanthomonadaceae;g_Stenotrophomonas;s_maltophilia	7.33
da5bc53279a680c25d503cb1bdc0e57a	k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Propionibacteriales;f_Propionibacteriaceae;g_Cutibacterium;s_acnes	4.91
abd34643df4e48940286e05ff8518132	k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;f_Oxalobacteraceae;g_Undibacterium;s_oligocarboniphilum	3.76
d15bc449222795a9ff230013aa633686	k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Moraxellales;f_Moraxellaceae;g_Acinetobacter;s_	3.16
f4801b7a68515d9005fa572ee6afdf41	k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;f_Burkholderiaceae;g_Ralstonia;s_syzygii	3.09
cff91f92ebadff0ecf455925e3e91b54	k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;f_Comamonadaceae;g_;s_	2.69
637b9b3f4d1cbb1a10c07817619cdf69	k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Pseudomonadales;f_Pseudomonadaceae;g_Pseudomonas;s_	1.47
227253fef87fe5013141954fceea878d	k_Bacteria;p_Actinobacteria;c_Actinomycetia;o_Micrococcales;f_Micrococcaceae;g_;s_	1.30
945f562bda86790338922e12f9854407	k_Bacteria;p_Bacteroidetes;c_Flavobacteriia;o_Flavobacteriales;f_Flavobacteriaceae;g_Flavobacterium;s_	1.24
810d24ee924214144b5ce85d1626f9cd	k_Bacteria;p_Bacteroidetes;c_Sphingobacteriia;o_Sphingobacteriales;f_Sphingobacteriaceae;g_Sphingobacterium;s_faecium	1.04
fb67b286b0f781b0de13d50179318995	k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Xanthomonadales;f_Xanthomonadaceae;g_Stenotrophomonas;s_maltophilia	1.03

In the dilution series of vaginal microbial samples, samples with more diluted material were characterized by higher proportions of contaminants, as defined by sequences that did not match the expected undiluted vaginal microbial characteristics, although not as linearly as expected (Figs. 2 and 3; Additional file 2). The proportion of total contaminants ranged from 0.88–99.44%.

Comparison of decontamination performance between SCRuB and Green Cleaner

We tested the performance of SCRuB and Green Cleaner algorithms in identifying and removing contaminant ASVs from a vaginal microbial dilution series. Compared with SCRuB, the relative abundance of the removed ASVs as contaminants was higher in the Green Cleaner at all dilution stages, and this difference was particularly significant in the more diluted stages with a higher contamination proportion (Table 4). Alpha diversity calculated by Chao1 estimating species richness showed that more types of ASVs were removed by Green Cleaner than by SCRuB across all dilution stages, indicating Green Cleaner usually recognize more types of ASV as contaminants than SCRuB (Fig. 4).

Table 4

The proportion of removed ASV reads in the dilution series samples in each batch determined using SCRuB and Green Cleaner.
	Removed ASV (%)
	Batch 1		Batch 2		Batch 3		Batch 4		Batch 5
	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner
D1	0.89	1.65	0.50	5.54	0.62	1.54	3.04	7.69	0.19	2.76
D2	2.48	3.19	9.02	16.07	2.35	4.18	4.21	8.25	0.66	3.27
D3	3.51	5.35	16.62	24.00	3.72	6.29	10.36	13.89	1.37	4.12
D4	9.40	12.45	25.46	36.36	6.99	9.93	18.75	29.84	4.00	8.48
D5	20.79	27.99	48.35	66.21	16.52	23.29	40.32	52.50	8.14	13.30
D6	33.78	50.93	53.24	74.96	19.49	25.84	46.35	85.13	14.41	20.12
D7	53.71	71.77	56.09	83.74	37.65	50.05	11.65	44.60	19.29	31.33
	Batch 6		Batch 7		Batch 8		Batch 9		Batch 10
	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner	SCRuB	Green Cleaner
D1	1.16	2.07	0.80	1.36	13.07	31.37	2.16	2.80	0.19	4.76
D2	2.25	4.85	1.39	2.22	44.28	59.81	3.93	4.59	0.37	4.58
D3	5.18	11.20	1.86	3.36	55.02	74.87	7.37	10.05	0.98	5.20
D4	10.95	17.05	11.05	15.27	57.97	83.22	17.70	25.22	2.78	9.61
D5	23.93	31.15	40.00	48.80	52.48	83.37	25.77	43.00	5.36	13.02
D6	35.10	47.95	51.27	69.78	65.01	88.60	33.82	46.93	14.91	25.74
D7	58.86	77.30	56.05	88.12	56.78	86.58	67.09	61.26	23.82	41.74

To evaluate the ability of each decontamination method to recover the expected vaginal microbial community profiles from the contaminated dilution series samples, we compared the accuracy, F1-score, and output similarity to the ground truth using the Bray-Curtis dissimilarity between SCRuB and Green Cleaner in each experimental batch. ASVs classified as correctly or incorrectly identified as undiluted vaginal microbial communities or contaminants for the 10 dilution datasets are presented in Fig. 5 and Additional File 3. Green Cleaner had a higher accuracy and F1-score than SCRuB in most diluted samples; however, the F1-score in Green Cleaner was slightly lower than that in SCRuB in a few highly diluted samples (Fig. 6). Likewise, beta-dissimilarity values in Green Cleaner were lower than SCRuB in most samples, similar to the F1-score, indicating that outcomes of Green Cleaner were more similar to the undiluted samples than SCRuB.

We further examined the trends of changes in accuracy, F1-score, and beta-similarity depending on the contaminant proportion using all dilution samples. Both the accuracy and F1-score gradually decreased as the contaminant proportion increased, and the beta dissimilarity gradually increased as the contaminant proportion increased (Fig. 7). In particular, the values of the F1-score and beta-dissimilarity tended to change sharply in the highly contaminated samples. Because highly contaminated samples produce imbalanced data, it can be said that the F1-score, interpreted as the harmonic mean of precision and recall, and beta-dissimilarity, quantifying differences in overall taxonomic composition, reflect more accurate performance rather than accuracy.

Furthermore, we divided the all-diluted samples into two groups with a 90% cut-off for the contaminant proportion and compared the difference in F1-score and beta-dissimilarity between SCRuB and Green Cleaner based on the group to evaluate whether there was a difference in the performance between SCRuB and Green Cleaner depending on the contaminant proportion. The Green Cleaner showed a significantly better F1-score and beta-dissimilarity in the group with a contaminant proportion of less than 90%; in contrast, there was no significant difference in those parameters in the group with a contaminant proportion of more than 90% (Fig. 8).

Several software tools have been developed to identify and control bacterial DNA contamination in 16S rRNA sequencing data. Decontam [42] operates in a set of rules in which contaminant taxa are recognized and removed, which are more prevalent in controls than in the samples of interest and/or are more frequent in samples with lower DNA concentrations. However, there is a limit to identifying and removing a taxon as contaminated if it is both a contaminant and truly present in a biological sample. To address this issue, MicroDecon [43] partially removed possible contaminants by calculating the ratio of taxa found in the controls to anchor contaminants. However, MicroDecon processes only one sample at a time, disregarding the data shared among the samples. Over the last decade, several computational techniques have been proposed for tracking and identifying potentially complex microbial community origins, a process known as "microbial source tracking.” These methods have shown great promise, particularly for quantifying contaminants [41, 44, 45]. In particular, SCRuB [41] has recently been reported to exhibit a high performance, and not only precisely identifies and removes latent contamination in a sample of interest, but also enables the partial removal of taxa that are both contaminants and present in the ecosystem of interest. Notably, it handles well-to-well leakage, in which material from biological samples leaks into controls during experimental procedures, especially during DNA extraction. In Decontam and MicroDecon, truly present taxa accompanied by well-to-well leakage were misclassified as contaminants and removed.

Despite high performance of SCRuB, it functions well when controls represent multiple distinct contamination sources that potentially affect the samples of interest. However, it is difficult to obtain multiple controls that reveal as many distinct contamination sources as possible during the actual experimental process. Additionally, the contaminant taxonomic profile changes over time according to the researcher, external environments, and seasons; therefore, blank extraction controls should be included and sequenced for every batch of extraction [46]. Meanwhile, Green Cleaner makes it simple to apply because it uses a single blank extraction control in the processed batch to eliminate contamination from the samples of the relevant batch. Furthermore, Green Cleaner consists of conceptual and intuitive rules for distinguishing contaminants from true features and can be applied to any data regardless of experimental method with some modification and adjustments.

As well as, Green Cleaner outperformed SCRuB, as shown by the F1-score and beta-dissimilarity results in the evaluation study using a vaginal microbiome dilution series. The better performance of Green Cleaner was due to its higher specificity and PPV than those of SCRuB (Additional File 4). In other words, Green Cleaner removes more contaminant ASV features than SCRuB correctly, which is consistent with the finding that a larger proportion of ASVs are removed in Green Cleaner than in SCRuB, and that the alpha diversity of the decontaminated results performed by Green Cleaner is lower than that by SCRuB. To evaluate the characteristics of contaminant ASV features that were removed from the Green Cleaner but remained in SCRuB, we classified the contaminant ASV features into an ASV category. Most of the contaminant ASVs were identified as Category 3 contaminants, which means that the contaminants is not present in the blank extraction control (Additional File 5). Even though SCRuB uses numerous controls, contaminants that exist at low abundances cannot be represented in these multiple controls; therefore, low-abundance contaminants in the sample of interest may not be adequately removed in SCRuB. Although contaminants present in low proportions have little effect on the microbial composition, the sum of them can significantly affect the overall bacterial composition as the number of contaminants with such small proportions increases. It was found that a significant portion of the contamination does not exist in the control. Therefore, the removing of contamination using only the experimental control results had limitations and the application of pre-defined database such as Green Cleaner was confirmed to be effective for removing contaminants.

Although Green Cleaner showed the better specificity compared to SCRuB overall, SCRuB revealed the higher specificity than Green Cleaner in some cases. In most datasets, the ratio of the contaminant taxa was increased proportionally as the dilution level increased. However, in the dataset 4, the proportion of the most abundant 10 contaminant taxa were rather decreased significantly in the D7 sample compared to the D6 sample due to high proportion of sample-specific contaminants in D7 sample that did not exist in the blank extraction control. In addition, in the dataset 9, one of the most abundant 10 contaminant taxa was specifically amplified in the D6 and D7 samples, causing distortion of the compositional proportion of the most abundant 10 contaminant taxa in the samples compared to the other samples of the dataset (Additional file 6). Because Green Cleaner uses a database of non-biological contaminants or in-house blacklist for ASV features to remove ASV features are not detected in the blank extraction control, there is a limit to the removal of sample-specific contaminants not subject to non-biological contaminants or in-house blacklist in Green Cleaner; therefore, a significant proportion of the abundant sample-specific contaminants in the D7 sample of dataset 4 could not be removed. Also, if contaminant ASVs identified in blank extraction control were particularly over-amplified in some samples, the value of Euclidean distance similarity or an adjusted modified Z-score in Green Cleaner could indicate these over-amplified contaminants in the samples as real features. Therefore, over-amplified contaminants in the samples D6 and D7 from dataset 9 were not removed causing false positivity. In case where the peculiar contamination pattern like these is observed, particularly in samples with very high contamination rates, Green Cleaner's performance in removing contamination seems to be lowering.

Green Cleaner seemed to be slightly lower in sensitivity than SCRuB, even though true features that were eliminated in Green Cleaner accounted for a relatively small proportion of the sample that represented very high contamination level. In the dataset 2, truly present feature, classified as S. agalactiae, was present at a low proportion in the D7 sample and at a high proportion in the samples at the lower dilution stage as well as their blank extraction control. The adjusted modified Z-score applied to the features, S. agalactiae, produced lower value than the cutoff in the D7 sample and removed the feature as contamination in the Green Cleaner because of the relatively lower proportion of the feature in the sample compared to the other samples with the feature at a higher proportion. If a true biological feature present in the low-prevalence in a sample is simultaneously found in high-prevalence in the other samples of the same batch, along with well-to-well leakage into the blank extraction control, there is a possibility that the true biological features existing in low prevalence can be eliminated in Green Cleaner.

Even though Green Cleaner shows significantly better performance than SCRuB, degree of performance difference between Green Cleaner and SCRuB was different depending on the contamination rate and samples where the contamination rate was extremely high (> 90%) showed no significant difference in performance between the two methods. Therefore, the performance of Green Cleaner may be insignificant in samples with very high contamination rates.

We offer several additional considerations regarding the use of Green Cleaner. First, in the case of taxa that were both contaminants and truly present in the ecosystem of interest, the proportion originating from contamination was not partially removed in Green Cleaner. Second, the in-house blacklist applied in our algorithm was created using data specifically generated in our laboratory. To use Green Cleaner, this blacklist needs to be customized to each lab experiment’s unique data in the same way as developed for the blacklist in this algorithm. Third, this algorithm was developed to investigate catheterized urine microbiome samples and further investigation is required to determine whether this algorithm can be applied to other low-biomass microbiome samples with different microbiome compositions. Fourth, the cutoff value of the Euclidean distance and adjusted modified Z-score is determined pragmatically by referring to existing data. Since the performance may vary depending on the number of samples and the microbial distribution of the samples in the batch, it is necessary to optimize the cutoff value appropriately for the specific analysis environment. Fifth, Green Cleaner may not be effective for samples with extremely high contamination proportions; therefore, care should be taken when interpreting such results.

We proved that Green Cleaner outperformed SCRuB, which was recently reported to have good performance in the decontamination of 16S rRNA sequencing data using a dilution series of human vaginal microbial communities. It is anticipated that Green Cleaner will advance urine microbiome research by providing accurate decontaminated results, particularly for low-biomass catheterized urine samples. It’s thought that study on the catheterized urine microbiome using Green Cleaner is further required.

PPV

positive predictive value

ASV

amplicon sequence variant

PCR

polymerase chain reaction

MAD

median absolute deviation

AAD

average absolute deviation

true positive

true negative

false positive

false negative

Ethics approval and consent to participants

All procedures involving the leftover of human vaginal samples and the study were approved by the Ethics Committee of GC Labs (GCL-2023-1075-02).

Consent for publication

All authors have provided their consent for publication.

Availability of data and material

The datasets generated and/or analysed during the current study are available in the GitHub repository, https://github.com/BITsmyoon/Green_Cleaner. All data used in the analysis, including metadata and ASV counts, can be found at https://github.com/BITsmyoon/Green_Cleaner/data and the code used in the analysis can be found at https://github.com/BITsmyoon/Green_Cleaner/script.

Competing interests

The authors declare that they have no competing interests.

Funding

No funding was received for this study.

Authors' contributions

SM and JS conceived of the presented idea and designed the model. SM designed the computational framework and analysed the data. JS wrote the manuscript. JS is in charge of overall direction. All authors have read and approved the final version of the manuscript.

Acknowledgements

I would like to thank YEJI KANG for conducting the entire experiments.

Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449(7164):804–10.
Brubaker L, Wolfe AJ. The new world of the urinary microbiota in women. Am J Obstet Gynecol. 2015;213(5):644–9.
Theis KR, Romero R, Winters AD, Greenberg JM, Gomez-Lopez N, Alhousseini A, et al. Does the human placenta delivered at term have a microbiota? Results of cultivation, quantitative real-time PCR, 16S rRNA gene sequencing, and metagenomics. Am J Obstet Gynecol. 2019;220(3):267.e1-267.e39.
Aho VTE, Pereira PAB, Haahtela T, Pawankar R, Auvinen P, Koskinen K. The microbiome of the human lower airways: a next-generation sequencing perspective. World Allergy Organ J. 2015;8(1):23.
Pohl HG, Groah SL, Pérez-Losada M, Ljungberg I, Sprague BM, Chandal N, et al. The urine microbiome of healthy men and women differs by urine collection method. Int Neurourol J. 2020;24(1):41–51.
Neugent ML, Hulyalkar NV, Nguyen VH, Zimmern PE, De Nisco NJ. Advances in understanding the human urinary microbiome and its potential role in urinary tract infection. mBio. 2020;11(2).
Whiteside SA, Razvi H, Dave S, Reid G, Burton JP. The microbiome of the urinary tract—a role beyond infection. Nat Rev Urol. 2015;12(2):81–2.
Magistro G, Stief CG. The urinary tract microbiome: the answer to all our open questions? Eur Urol Focus. 2019;5(1):36–8.
Bschleipfer T, Karl I. Bladder microbiome in the context of urological disorders—is there a biomarker potential for interstitial cystitis? Diagnostics (Basel). 2022;12(2):281–90.
Lee H-Y, Wang JW, Juan YS, Li CC, Liu CJ, Cho SY, et al. The impact of urine microbiota in patients with lower urinary tract symptoms. Ann Clin Microbiol Antimicrob. 2021;20(1).
Brubaker L, Wolfe AJ. The female urinary microbiota, urinary health and common urinary disorders. Ann Transl Med. 2017;5(2):34.
Li K, Chen C, Zeng J, Wen Y, Chen W, Zhao J, et al. Interplay between bladder microbiota and overactive bladder symptom severity: a cross-sectional study. BMC Urol. 2022;22(1).
Hiergeist A, Gessner A. Clinical implications of the microbiome in urinary tract diseases. Curr Opin Urol. 2017;27(2):93–4.
Patel SR, Ingram C, Scovell JM, Link RE, Mayer WA. The microbiome and urolithiasis: current advancements and future challenges. Curr Urol Rep. 2022;23(3):47–56.
Jayalath S, Magana-Arachchi D. Dysbiosis of the human urinary microbiome and its association to diseases affecting the urinary system. Indian J Microbiol. 2022;62(2):153–66.
Sangrak Bae, Hong Chung. The urobiome and its role in overactive bladder. 2022:190–200.
Shim JH, Gook JH, Chang IH, Sohn JM, Seong SW, Chi BH. Clinical implications of urinary microbiome in bladder cancer. Korean J Urol Oncol. 2021;19(2):71–8.
Choi HW, Lee KW, Kim YH. Microbiome in urological diseases: axis crosstalk and bladder disorders. Investig Clin Urol. 2023;64(2):126–39.
Glassing A, Dowd SE, Galandiuk S, Davis B, Chiodini RJ. Inherent bacterial DNA contamination of extraction and sequencing reagents may affect interpretation of microbiota in low bacterial biomass samples. Gut Pathog. 2016;8:1–12.
Grahn N, Olofsson M, Ellnebo-Svedlund K, Monstein HJ, Jonasson J. Identification of mixed bacterial DNA contamination in broad-range PCR amplification of 16S rDNA V1 and V3 variable regions by pyrosequencing of cloned amplicons. FEMS Microbiol Lett. 2003;219(1):87–91.
Laurence M, Hatzis C, Brash DE. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS ONE. 2014;9(5).
Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12(1).
Nagler M, Insam H, Pietramellara G, Ascher-Jenull J. Extracellular DNA in natural environments: features, relevance and applications. Appl Microbiol Biotechnol. 2018;102(15):6343–56.
Motley ST, Picuri JM, Crowder CD, Minich JJ, Hofstadler SA, Eshoo MW. Improved multiple displacement amplification (iMDA) and ultraclean reagents. BMC Genomics. 2014;15(1).
Kulakov LA, McAlister MB, Ogden KL, Larkin MJ, O'Hanlon JF. Analysis of bacteria contaminating ultrapure water in industrial systems. Appl Environ Microbiol. 2002;68(4):1548.
Shen H, Rogelj S, Kieft TL. Sensitive, real-time PCR detects low-levels of contamination by Legionella pneumophila in commercial reagents. Mol Cell Probes. 2006;20(3):147–53.
Mohammadi T, Reesink HW, Vandenbroucke-Grauls CM, Savelkoul PH. Removal of contaminating DNA from commercial nucleic acid extraction kit reagents. J Microbiol Methods. 2005;61(2):285.
Lo SC, Li BJ, Zou N, Lo SC. Presence of bacterial phage-like DNA sequences in commercial Taq DNA polymerase reagents. J Clin Microbiol. 2004;42(5):2264.
Rand KH, Houck H. Taq polymerase contains bacterial DNA of unknown origin. Mol Cell Probes. 1990;4(6):445.
Corless CE, Guiver M, Borrow R, Edwards-Jones V, Kaczmarski EB, Fox AJ. Contamination and sensitivity issues with a real-time universal 16S rRNA PCR. J Clin Microbiol. 2000;38(5):1747.
Dyrhovden R, Rippin M, Øvrebø KK, Nygaard RM, Ulvestad E, Kommedal Ø. Managing contamination and diverse bacterial loads in 16S rRNA deep sequencing of clinical samples: implications of the law of small numbers. mBio. 2021;12(3):e0059821.
Rabin M. Inference by believers in the law of small numbers. Q J Econ. 2002;117(3):775–816.
Reimer LC, Sardà Carbasse J, Koblitz J, Ebeling C, Podstawka A, Overmann J. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res. 2022;50(D1):D741-D746.
Tang Z, Zhang L, He N, Gong D, Gao H, Ma Z, et al. Soil bacterial community as impacted by addition of rice straw and biochar. Sci Rep. 2021;11(1):22185.
Wang Q, Han Y, Lan S, Hu C. Metagenomic insight into patterns and mechanism of nitrogen cycle during biocrust succession. Front Microbiol.2021;12:633428.
Piro VC, Renard BY. Contamination detection and microbiome exploration with GRIMER. Gigascience. 2022;12.
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol. 2019;37(8):852–7.
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581–3.
Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 2017;11(12):2639.
Nicholas AB, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, et al. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome. 2018;6(1):1–17.
Austin GI, Park H, Meydan Y, Seeram D, Sezin T, Lou YC, et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol. 2023;41(12):1820–8.
Davis NM, Proctor DM, Holmes SP, Relman DA, Callahan BJ. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018;6(1):226.
McKnight DT, Huerlimann R, Bower DS, Schwarzkopf L, Alford RA, Zenger KR. microDecon: a highly accurate read-subtraction tool for the post-sequencing removal of contamination in metabarcoding studies. Environ DNA. 2019;1(1):14–25.
An U, Shenhav L, Olson CA, Hsiao EY, Halperin E, Sankararaman S. TENSL: microbial source tracking with environment selection. mSystems. 2022;7(5):e00995-21.
Shenhav L, Thompson M, Joseph TA, Briscoe L, Furman O, Bogumil D, et al. FEAST: fast expectation-maximization for microbial source tracking. Nat Methods. 2019;16(7):627–32.
Weyrich LS, Farrer AG, Eisenhofer R, Arriola LA, Young J, Selway CA, et al. Laboratory contamination over time during low-biomass sample analysis. Mol Ecol Resour. 2019;19(4):982–96.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Green Cleaner: Advanced Decontamination Algorithm for Catheterized Urine 16S rRNA Sequencing Data

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Material and Methods

Green Cleaner: decontamination model description

Results

Comparison of decontamination performance between SCRuB and Green Cleaner

Discussions

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1