Evaluating the Efficiency of 16S-ITS-23S operon Sequencing: A Comparison of Primer Pairs, Sequencing Platforms, and Taxonomic Classifiers

doi:10.21203/rs.3.rs-4006805/v1

Download PDF

Method Article

Evaluating the Efficiency of 16S-ITS-23S operon Sequencing: A Comparison of Primer Pairs, Sequencing Platforms, and Taxonomic Classifiers

https://doi.org/10.21203/rs.3.rs-4006805/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

The field of 16S rRNA-targeted metagenetics has been enhanced through the improved accuracy of long-read sequencing. More specifically, recent advances have facilitated the transition from short-read sequencing of 16S rRNA gene regions to full-length sequencing of the entire 16S gene (~1500 bp) and, in turn, sequencing of the 16S, Internal Transcribed Spacer (ITS), and 23S regions covering a DNA region known as the ribosomal RNA operon (RRN) (~4500 bp). These technological advances offer the potential to achieve at least species-level resolution when analysing microbiomes, increasing interest in RRN sequencing. However, before widespread adoption of this approach can occur successfully, a thorough assessment of its strengths and limitations is necessary.

Results

This study assesses the effects of RRN primer pairs and sequencing platforms on RRN sequencing, while also aiming to benchmark taxonomic classification methods. In this context, we study the effect four RRN primer combinations; four mock communities, three sequencing platforms (PacBio, Oxford Nanopore Technologies, and Illumina), two classification approaches (Minimap2 alignment and OTU clustering), and four RRN reference databases (MIrROR, rrnDB, and two iterations of FANGORN) alongside two 16S databases (Greengenes2 and SILVA). Our study reveals that choice of primer pair and sequencing platform do not substantially bias the taxonomic profiles provided by RRN sequencing for a majority of the mock communities. However, community composition was identified as a confounding factor. The classification method significantly impacts the accuracy of species-level taxonomic assignment. Applying Minimap2 in combination with the FANGORN database was found to provide the most accurate profile for most microbial communities, irrespective of sequencing platform.

Conclusions

Long-read sequencing of the RRN operon provides species-level resolution surpassing that of Illumina-based 16S rRNA gene sequencing. Our findings advocate for the use of RRN sequencing in species-level microbial profiling. We extensively benchmark the factors involved to provide a valuable resource, aiding the advancement and adoption of RRN sequencing, while highlighting some ongoing challenges.

RRN operon

16S-ITS-23S

long-read sequencing

PacBio

Oxford Nanopore Technology (ONT)

16S rRNA sequencing

The 16S rRNA gene is a key marker for bacterial identification [1, 2]. Traditional methods such as Sanger sequencing provide detailed species level resolution by encompassing the entire length of the 16S rRNA gene [3, 4]. However, the low throughput of Sanger sequencing hinders its use in modern microbiome research [5–7]. Next-Generation Sequencing (NGS) technologies have overcome these throughput-related limitations, thereby enhancing scalability and cost-efficiency of sequencing in microbiome research [8–11]. Among the NGS platforms, Illumina has become the leading choice for 16S rRNA sequencing supported by a robust collection of bioinformatic tools and standardized workflows [12–19]. However, NGS platforms such as Illumina come with sequencing length constraints, in that only a limited number of the nine hypervariable regions of the 16S rRNA gene can be sequenced in a single read, reducing phylogenetic resolution, often to genus or family, rather than species level [20–27].

The introduction of third generation, or long-read, sequencing technologies has enhanced NGS capabilities by allowing the sequencing of longer DNA segments. This advancement has revolutionised 16S rRNA gene studies, allowing for full-length 16S gene sequencing (~ 1500 bp), thereby substantially improving phylogenetic resolution, often to the species level [23, 28–33]. The platforms provided by PacBio and Oxford Nanopore Technology (ONT) are at the forefront of long read sequencing. Although initially challenged by high error rates, the two platforms have recently achieved higher accuracies. The recent introduction of ONT’s Q20 + chemistry has notably increased its reliability for 16S rRNA gene studies [34, 35]. Conversely, PacBio has benefited from higher accuracy for a longer period, partly through the application of consensus or HiFi sequencing [21, 36]. Additionally, PacBio has made significant progress in reducing the costs of 16S rRNA sequencing with the introduction of their Kinnex 16S rRNA kit, which utilises the MASS-seq method [37] coupled with their latest sequencing platform enhancements to increase the accessibility of full-length 16S rRNA sequencing. The growing preference for full-length 16S rRNA sequencing using long-read platforms reflects a trend towards requiring species-level resolution in microbiome research. Long-read sequencing also allows the sequencing of the 16S rRNA, Internal Transcribed Spacer (ITS), and 23S rRNA genes in a single read, known as ribosomal RNA operon (RRN) sequencing (~ 4500 bp). This approach enhances phylogenetic resolution to species, and potentially, strain level [38–40]. RRN sequencing can be especially useful for distinguishing closely related bacteria, such as Escherichia coli and Shigella spp., or species within the Streptococcus mitis group, which exhibit over 99% sequence similarity in the 16S rRNA gene [41–44]. This approach has been applied to study a range of sample types, providing (sub)species-level resolution [45–52]. However, the impact of various factors across the entire workflow, from wet lab techniques to computational processes, remains understudied. Existing work has focussed on classification tools but overlooked the impact of RRN primer bias, sequencing platforms, and their interplay with computational workflows. A comprehensive examination of these elements is therefore crucial to uncover pitfalls and enhance the entire process of RRN sequencing.

Here, we aim to benchmark the various steps involved in RRN sequencing, including primer selections, sequencing platforms, and classification methods. While four RRN primer pairs have been documented in previous publications, this is the first to compare all four directly. We tested these primers using four compositionally distinct mock communities: the first mirrors real environmental samples; the next two, developed in-house, contain closely related species of the same genus and each community pooled with a different strategy to assess RRN primer biases; and the final one is logarithmically distributed containing very low abundance members to assess sensitivity and limits of detection. The inclusion of the four mock communities allows a more robust assessment of the RRN sequencing biases. Furthermore, this study compares the two widely adopted long-read sequencing platforms, PacBio and ONT, with the aim of assessing their advancements in terms of accuracy and their suitability for achieving species-level resolution when RRN sequencing. Additionally, we compare these long-read technologies with the recent developments in v3-v4 based 16SrRNA sequencing using Illumina at both species and genus levels. Computational tools are identified as confounding factors in RRN sequencing, for which we compared two methods: (i) direct alignment of reads to a reference database using Minimap2, and (ii) OTU clustering using vsearch (common practice in short-read 16S amplicon sequencing), followed by classification with QIIME2’s QIIME BLAST classifier. Finally, we evaluated the performance of publicly available RRN reference databases, MIrROR, rrnDB, and two versions of FANGORN.

DNA extraction and mock communities

The performance of RRN sequencing was assessed using four mock microbial communities, serving as simplified representatives of real microbiome samples. The first mock community, ATCC ABRF-MGRG (Cat No. MSA-3001), comprises an even mixture of genomic DNA (gDNA) from 10 microbial strains (9 bacteria, 1 archaea). This was selected to typify environmental microbiome samples. The second, ZymoBIOMICS™ DNA Standard II (Cat No. D6311), is a log-distributed mixture of gDNA isolated from 10 microbial strains (8 bacteria, 2 fungi).

In addition, an in-house mock community with 24 distinct microbial species was constructed. To compose this in-house community, the gDNA was extracted from pure cultures of bacteria species using the GenElute™ Bacterial Genomic DNA Kit (Sigma-Aldrich, Germany) according to manufacturer's instructions. Two versions of the in-house mock community were developed. For the Mock Community Amplicons Pool (MCAP), RRN amplicons were generated using gDNA from each species individually as the template. Those amplicons were then pooled in equal amounts (in ng) before sequencing. In the second version, the Mock Community gDNA Pooled (MCGD), gDNA from each of the species was initially pooled in equal amounts (in ng) and then amplified to create RRN amplicons for the entire community. The microbial composition of each mock community is detailed in Supplementary Table 1.

PCR amplifications

Four RRN primers, 27F, 519F, 2241R, and 2428R, were selected based on their development in previous publications [39, 45, 46, 48, 49, 53]. All four possible pairings of these forward and reverse primers were employed in PCRs using the four mock communities mentioned above. Of these, 27F-2241R [39, 45, 48, 49, 53] and 519F-2428R [46, 47] have been applied previously in RRN studies. A two-step PCR approach was used in this study to generate barcoded amplicons. First-round primers consisted of M13 sequences at the 5’ end and a 3’ RRN-specific region matching the four primers listed above. The primers used during the second-round PCR, target the M13 sequences incorporated into the first-round amplicons, and allow the generation of second-round PCR products containing asymmetric barcode sequences [54]. For each sample, unique forward and reverse barcodes were utilised, with each barcode differing in sequence from the other. Further, the combination of forward and reverse primers was distinct for each sample to prevent index cross-talk during amplification. The sequences and details of the employed primers are provided in Supplementary Table 2. The PCR amplification mix for each round consisted of Q5® High-Fidelity DNA Polymerase 2X master mix (NEB, UK), forward and reverse primers (10 µM), and template DNA (2µL), made up to a volume of 25 µl using PCR grade water. PCR cycling conditions were as per Supplementary Table 3. After each PCR round, amplicons were cleaned and size-selected using NucleoMag (Macherey-Nagle GmbH & Co., Duren, Germany) at a 0.6X ratio. Their quantities were then determined using a Qubit™ fluorometer (Life Technologies, Carlsbad, CA). Each of the barcoded and cleaned RRN amplicons from the second round of PCR was pooled at equal amounts of 8 ng. Size validation was conducted on a 2100 Bioanalyzer instrument (Agilent Technology, California, United States), employing the Agilent DNA 1000 Kit. An overview of the wet-lab methods used in library preparation are presented in Fig. 1A.

Sequencing of the v3-v4 region of the 16S rRNA gene was also conducted using Illumina technology to compare with RRN sequencing at genus and species levels. The v3-v4 libraries were prepared following the procedures for Illumina 16S Metagenomic Sequencing Library Preparation [55, 56], using the Nextera XT Index kit v2 (Illumina, California, United States).

Sequencing and data processing

The pool of cleaned, barcoded RRN amplicons was sequenced using two long read platforms, Oxford Nanopore Technology (ONT, Oxford, UK) and PacBio (California, United States). One half of the pool of amplicons, was prepared for ONT sequencing according to the guidelines provided in the Nanopore protocol for PCR barcoding amplicons, version PBA96_9152_v112_revB_09Feb2022 [57], using the Ligation Sequencing Kit (Q20+) (SQK-LSK112; ONT). Given that the amplicons were already barcoded, the library processing commenced from the “DNA repair and end-prep” stage as outlined in the referenced Nanopore protocol. Sequencing was then carried out using R10.4 flow cells (FLO-MIN112; ONT) on a GridION sequencer following manufacturer’s instructions. The raw fast5 data from the GridION was base-called by Guppy v. 6.2.7 (ONT) with the High-accuracy model. MinKNOW v. 22.08.6 (ONT) was employed to apply a minimum quality score of 9. Subsequent demultiplexing of the asymmetric barcodes and the region-specific RRN primers was achieved using Nanoplexer [58]. Finally, primers and barcodes were removed using Cutadapt v. 2.6 [59].

The other half of the amplicon pool was prepared for PacBio sequencing according to the PacBio protocol for Preparing multiplexed amplicon libraries using SMRTbell prep kit 3.0 [54]. Sequencing was carried out on an 8M SMRT cell on a Sequel IIe instrument using the Sequel II Binding kit 3.2 and Sequencing chemistry v2.0 (PacBio). Loading was performed by diffusion, with a movie time of 30 hours, and a pre-extension time of 60 minutes. CCS analysis was performed on the Sequel IIe instrument. For demultiplexing, the Lima software from SMRT Tools v11 (PacBio) was used via command line without specifying a minimum barcode quality score, following which all primer and barcode sequences were removed. The PacBio sequencing service was provided by the Norwegian Sequencing Centre (Oslo, Norway).

For the v3-v4 Illumina library, sequencing was carried out on the Illumina NextSeq 2000 at the Teagasc sequencing facility, using a 2 × 300 cycle v3 kit, following standard Illumina sequencing protocols. Demultiplexing of reads was performed on the NextSeq system. Primers were removed using Cutadapt v. 2.6. Details on each sample, including number of reads, median read lengths, and median read quality, are shown in Supplementary Table 4.

Bioinformatic analysis

For the RRN amplicons, two primary approaches to compositional analysis were applied: (i) OTU clustering using vsearch [60], and (ii) direct alignment via Minimap2 [61]. Regardless of the chosen method, a consistent quality threshold of > Q12 and a length filter ranging from 3500 to 5000 bp were applied. After quality filtering, chimeric sequences were removed. The specific tools utilised for these processes varied based on the approach used and are mentioned in the respective sub-sections. The quality threshold of > Q12 was set to ensure consistency between the ONT and PacBio data, while removing a number of low-quality sequences for the ONT data. This was particularly crucial given the disparity in quality scores, with ONT data averaging at 14.7 and PacBio considerably higher at 51. PacBio data was suitable for both the OTU and direct alignment approaches, while the ONT data, owing to its comparatively lower quality scores, was only compatible with the direct alignment approach using Minimap2.

OTU clustering using vsearch

The following steps were performed using vsearch v. 2.22.1. FASTQ files from PacBio sequencing were quality (> Q12) and length filtered (3500–5000 bp) with --fastq_filter. Reads were then de-replicated using --derep_fulllength. Pre-clustering was performed to reduce the impact of sequencing error on chimera removal at 98% similarity threshold using --cluster_size. De novo and reference-based chimera removal was performed with --uchime_denovo and --uchime_ref, respectively. OTU clustering was carried out using --cluster_size at five different clustering thresholds of 97%, 98%, 99%, 99.5%, and 99.9%. The resulting OTU table was then imported into QIIME2 v. 2023.2 [12], where taxonomic classification was performed using the BLAST + consensus taxonomy classifier, hereon referred to as QIIME-BLAST (QB), with the feature-classifier classify-consensus-blast option. Two other classifiers within QIIME2, Naïve Bayes (NB) trained (QNBT) and VSEARCH exact match + sklearn (QVPSK), were also used with the options feature-classifier classify-sklearn and feature-classifier classify-hybrid-vsearch-sklearn, respectively.

Direct alignment using Minimap2

FASTQ files from PacBio and ONT sequencing were quality filtered (> Q12) using Filtlong v. 0.2.0 and length filtered (3500–5000 bp) using Cutadapt v. 2.6. Chimeric reads were detected and removed using yacrd v. 1.0. Taxonomic classification was performed using Minimap2 v. 2.17-r974 as the classifier, hereon referred to as Minimap (MM). When using MM the seed length was provided with -z 70, as suggested by Cuscó et al [49].

For v3-v4 Illumina data an average quality score of 33 was obtained. OTU clustering was performed as illustrated by the vsearch developer [62] and was similar to the OTU pipeline run on the PacBio data with the exception that for the Illumina data, forward and reverse reads were first merged using the option --fastq_mergepairs in vsearch v. 2.22.1. Following this, quality and length filtering was performed using the option --fastq_filter to keep only the reads > Q30 and ranging between 250 to 446 bp. The steps from dereplication to chimera removal were kept the same as for the PacBio data. OTU clustering was performed using --cluster_size at three different clustering thresholds of 97%, 98%, 99%. The resulting OTU table was then imported into QIIME2 v. 2023.2 and taxonomic classification was performed with the Naïve Bayes (NB) trained classifier, hereon referred to as QIIME-NB trained (QNBT), using the option feature-classifier classify-sklearn. An overview of the bioinformatic methods used to taxonomically classify the amplicon data is presented in Fig. 1B.

Taxonomy reference databases

In our analysis, six distinct reference databases were utilised. Four of these databases contained sequences of the RRN operon, with the remaining two consisting of full-length 16S rRNA sequences. The rrnDB comprises 16S-ITS-23S sequences extracted from a total of 67,199 bacterial genomes that have been retrieved from NCBI GenBank [39]. The MIrROR database includes curated 16S-23S-ITS sequences, derived through in silico PCRs employing the 27F-2241R primer pair. These sequences, clustered at a 99% similarity, originated from 43,653 bacterial genomes listed in NCBI GenBank [53]. The FANGORN database comprises quality-checked 16S-ITS-23S extracted sequences, clustered at a 99.9% similarity threshold. Within FANGORN, two primary datasets are available: one sourced from 317,541 bacterial and archaeal GTDB genomes, referred to as FANGORN-GTDB, and another sourced from 253,840 bacterial and archaeal RefSeq genomes, referred to as FANGORN-RefSeq [63]. The four RRN databases were used to taxonomically classify long-read RRN amplicons. The rrnDB, FANGORN-GTDB, and FANGORN-RefSeq databases were employed with both the QIIME-BLAST and Minimap classifiers. In contrast, MIrROR, tailored for Minimap2 applications, was exclusively used with our Minimap-based classification method. The two 16S rRNA databases being (i) SILVA v138.1 [64] and (ii) Greengenes2 [65] were used to taxonomically classify Illumina short read v3-v4 amplicons using QIIME-NB trained classifiers.

Selection of classification methods

The classification methods included combinations of classifiers and databases (presented as classifier_database from here on). Each of the four RRN databases, FANGORN-G, FANGORN-R, rrnDB, and MIrROR, were applied in combination with the Minimap (MM) classifier for ONT and PacBio RRN amplicon sequencing data. Of the three QIIME-based classifiers applied in combination with three of the four RRN databases, FANGORN-G, FANGORN-R, and rrnDB, only the QIIME-BLAST classifier was selected for further analysis of the PacBio RRN data at species and genus levels. While all three QIIME-based classifiers provided taxonomic profiles similar to the expected community (Supplementary Table 5), the QIIME-NB trained (QNBT) and QIIME-VSEARCH exact match + sklearn (QVPSK) classifiers were excluded from further analysis as they were observed to perform inconsistently even within the same mock community, cluster further away from the expected community (Supplementary Fig. 1), and report a higher number of unassigned taxa at the species level (Supplementary Fig. 2). The classification methods applied for the v3-v4 Illumina data included the QIIME-NB trained (QNBT) classifier with the Greengenes2 database at species and genus levels and the QIIME-NB trained (QNBT) classifier with the SILVA v138.1 database for genus level taxonomic assignments. OTU clustering thresholds of 99% and 97% were applied for species and genus level assignments respectively, for the v3-v4 data.

Statistical analysis

Statistical analysis was performed in R v.4.1.2. The vegan package v. 2.6.4 [66] was used for Bray-Curtis Dissimilarity calculations and for PERMANOVA (permutational analysis of variance) analysis using the adonis function specifying 10,000 permutations. The stats package v. 4.1.2 in R was used for Classical Multidimensional Scaling and for the pairwise Wilcoxon test to identify significant differences. P values were adjusted using the Benjamini-Hochberg method and significance was accepted as p ≤ 0.05. The ggplot2 package v. 3.4.2 [67] was used for data visualisation.

Accuracy metrics

The accuracy of microbial taxa detection at both genus and species levels, in terms of their presence or absence, was evaluated using three metrics: precision, recall, and F1 scores. The R package yardstick v. 1.1.0 was used for these calculations. Precision measures the proportion of correctly identified true positives in relation to all claimed true positives. Recall, on the other hand, indicates the percentage of actual positives in the sample that were successfully detected by the classification methods. The F1 score, serving as a balance between precision and recall, is the harmonic mean of these two metrics. These three metrics were calculated in the following manner:

$$Precision=\frac{True Positives}{True Positives + False Positives}$$

$$Recall=\frac{True Positives}{True Positives + False Negatives}$$

$$F1 score=2\frac{Precision . Recall}{Precision + Recall}$$

OTU clustering threshold did not significantly impact community taxonomy profiles

RRN sequencing data from both PacBio and ONT platforms were subjected to classification using two distinct methods: OTU clustering with vsearch and direct alignment employing Minimap2. However, due to the lower quality of the ONT data (Q14.7), OTU clustering was not feasible, allowing only direct alignment with Minimap2. In contrast, PacBio data (Q51) provided higher quality, enabling testing of both approaches on the RRN reads. Across the four mock communities (ATCC, MCAP, MCGD, Zymo), varying the OTU clustering threshold (ranging from 97–99.9%) within the PacBio dataset did not significantly impact community taxonomic profiles, as determined by PERMANOVA (ATCC: p = 0.961, R² = 0.006; MCAP: p = 0.626, R² = 0.017; MCGD: p = 0.979, R² = 0.005; Zymo: p = 0.993, R² = 0.002). Consequently, a clustering threshold of 99.9% was chosen for subsequent downstream analysis.

Applying a relative abundance cut-off improves effectiveness of the classification methods

Before examining the impacts of different classification methods on community profiles, the influence of relative abundance cut-offs was assessed on the basis of three metrics, F1 score, precision and recall. Five relative abundance cut-off values were selected and applied individually: 0% (no cut-off), 0.001%, 0.01%, 0.05%, and 0.1%. The Zymo mock community, being log-distributed and composed of sparsely abundant microbial species, was not subjected to any relative abundance cut-off.

The application of a 0.01% relative abundance cut-off led to improved F1 scores for most species-level classification methods (Fig. 3A.i). Notably, three of the four MM methods, MM_FANGORN-G, MM_FANGORN-R and MM_MIrROR, exhibited a substantial F1 score increase for both ONT and PacBio datasets in the case of the ATCC mock community. However, for the MCAP and MCGD communities, this notable increase was observed only in the ONT datasets. The QB methods, exclusively applied to PacBio data, maintained a consistent F1 score across all five relative abundance cut-off values for the ATCC community. However, the same method for MCAP and MCGD communities initially displayed consistent F1 scores at no (0%) and 0.001% cut-offs, followed by a gradual decline above the 0.001% cut-offs. Generally, across the three mock communities, the implementation of a relative abundance cut-off did not improve F1 scores, demonstrating that for QB methods, using PacBio data, a relative abundance cut-off may not be necessary. For the MM_rrnDB and QNBT_greengenes2 methods, higher cut-offs were associated with improved F1 scores as higher cut-offs resulted in further reductions in the number of false positives. Similar to the patterns observed with F1 scores, the precision of the MM methods, MM_FANGORN-G, MM_FANGORN-R, and MM_MIrROR, substantially increased at the 0.01% cut-off (Fig. 3A.ii). This increased precision was observed across the ATCC, MCAP and MCGD mock communities and for both ONT and PacBio datasets. Precision for the QB methods, QB_FANGORN-G and QB-FANGORN-R, was consistent across the various cut-offs, communities, and sequencing platforms. Increased precision was driven by decreasing false positives across the various cut-offs. Most classification methods experienced a decline in recall at cut-offs above 0.01% (Fig. 3A.iii). However, for the Illumina-based QNBT_greengenes2 classification method, consistent recall scores across the cut-offs and communities were observed. A decrease in recall was seen to correspond with an increase in false negatives over the cut-off values. Ultimately, a 0.01% relative abundance cut-off was chosen for all the classification methods as it notably decreased the number of false positives reported by a majority of the classifiers, while only marginally reducing the number of true positives (Additional File 1). This choice represented a crucial balance, striking a considered trade-off between reducing errors and sustaining accurate detection across the tested classification methods.

At the genus classification level, the general patterns observed for F1 score, precision and recall across the various classification methods and communities was consistent with that of the species level classifications (Fig. 3B). For the QNBT_SILVA classification method applied only at the genus level, the F1 score and precision increased with higher relative abundance cut-off values as the number of false positives decreased with the higher the cut-off that was applied. While recall was mostly consistent over the five cut-off values. Similar to the species level approach, a relative abundance cut-off of 0.01% was applied at the genus level.

Comparative analysis of taxa detection by different classification methods

Following investigation into the F1 scores, precision and recall, each of the classification methods were assessed in terms of the taxa detected and missed at genus and species levels along with their relative abundance estimates for the four mock communities.

Consistent with the findings above, RRN-based approaches were able to resolve the microbial communities better, with fewer misclassifications, at species level compared to v3-v4 approaches. For the RRN-based methods, comparable results between the ONT and PacBio sequenced datasets was observed across a majority of the microbial mock communities. Notably, within these RRN-based classification techniques, three out of the four MM methods, MM_FANGORN-G, MM_FANGORN-R, and MM_MIrROR, reported no unassigned taxa at both genus and species levels. In contrast, MM_rrnDB and the QB methods exhibited a higher count of unassigned taxa at both taxonomic levels.

The ATCC mock community was included to evaluate the performance of RRN sequencing on a diverse range of bacterial and archaeal species typical in environmental samples. Across both ONT and PacBio sequencing platforms, all RRN-based methods showed similar performance at the species level. Notably, none of the RRN-based methods could detect the archaeal community member Haloferax volcanii at either the genus or species level. However, the v3-v4 based methods, specifically QNBT_greengenes2 and QNBT_SILVA, successfully identified Haloferax at the genus level (Supplementary Fig. 4.i). Upon examining the RRN databases, it was observed that FANGORN-G and FANGORN-R included reference sequences for the archaea, whereas MIrROR and rrnDB lacked such references. Despite this, the failure of all RRN methods to identify archaeal members suggests a potential limitation of the RRN primers in effectively detecting archaeal taxa compared to the v3-v4 primers. However, with the QNBT_greengenes2 classification method the Haloferax genus detected could not be resolved accurately at species level. Furthermore, a bacterial community member, Pseudoalteromonas haloplanktis, was only identifiable at the genus level by both v3-v4 and RRN-based methods (see Fig. 5.i). This issue seems to stem more from the reference databases rather than the RRN primers themselves. None of the RRN databases included reference sequences for Pseudoalteromonas haloplanktis, although they contained sequences for other species within the same genus. This absence has resulted in a higher frequency of mis-assignments to other species within the genus. The identification of Micrococcus luteus through RRN sequencing appears to depend on the classifier chosen. This becomes apparent as all MM-based methods failed to classify it at the genus or species level, despite the presence of its reference sequence in all the evaluated RRN databases. In contrast, QB methods effectively detected Micrococcus luteus at both genus and species levels.

The in-house mock communities, MCAP and MCGD, were specifically designed to include microbial community members sharing the same genus but differing at the species level. The aim was to assess whether RRN-based methods could more accurately differentiate these communities at the species level compared to v3-v4 based methods, and to investigate the impact of PCR biases on the taxonomic profiles of the two communities when using RRN amplicons-pooled versus gDNA-pooled strategies.

The results showed that RRN-based methods were indeed more effective in distinguishing species within these communities than v3-v4 based methods. For instance, in the case of the RRN methods, the taxa belonging to Lactococcus lactis and Lactococcus cremoris were accurately identified at the species level, whereas v3-v4 methods like QNBT_greengenes2 could not differentiate them at this level, only identifying the genus Lactococcus. Similarly, Lacticaseibacillus casei, Lactiplantibacillus plantarum, Lactobacillus crispatus, and Limosibacillus reuteri were correctly identified at the species level by RRN methods (Fig. 5.ii & 5.iii) but only at the genus level by v3-v4 methods. Additionally, certain taxa, including Staphylococcus aureus and Enterococcus faecalis, were completely missed by v3-v4 methods at both genus and species levels (Supplementary Fig. 4.ii and 4.iii). Among the RRN methods, MM_FANGORN-G and MM_FANGORN-R demonstrated fewer mis-assignments and unassigned taxa at both genus and species levels compared to MM_MIrROR and MM_rrnDB across the two mock communities. In contrast, QB-based methods showed slight variations between the two mock communities. The QB methods seemed to perform better with the RRN amplicons-pooled community (MCAP), with fewer missed taxa, but with greater variation in relative abundance estimates compared to the expected values. While the QB methods did not perform as well with the gDNA-pooled community (MCGD), showing more mis-assignments and unassigned taxa. It is also essential to acknowledge the challenges in accurately pooling the individual RRN amplicons within a mock community, as is the case with MCAP, which could potentially contribute to the variations in relative abundance estimates observed within this community. Despite these differences, both the RRN amplicons-pooled (MCAP) and gDNA-pooled (MCGD) communities showed similar or overlapping beta diversity when evaluated through a PCoA plot (Supplementary Fig. 5), suggesting a consistent representation between the two communities.

The Zymo community, characterized by log-distributed species abundances, served as a test case to evaluate the ability of RRN-based methods in detecting low abundance taxa. The v3-v4 classification method, namely QNBT_greengenes2, struggled to accurately represent most members of the microbial community at the species level, including those that were most abundant. However, at genus level, both QNBT_greengenes2 and QNBT_SILVA succeeded in detecting the most abundant taxa with accurate estimates of their relative abundances (Supplementary Fig. 4.iv). Even with the RRN-based approaches, lower abundance taxa (below 0.001%) were generally missed at both genus and species levels. MM_FANGORN-G and MM_FANGORN-R showed the best performance of tested methods, with fewer misclassified taxa at both taxonomic levels compared to MM_MIrROR and MM_rrnDB (Fig. 5.iv). Notably, there were differences in performance between the ONT and PacBio sequenced datasets within the MM methods, with ONT showing fewer misclassifications at both genus and species levels compared to PacBio data. Interestingly, the Zymo community was the one instance where the QB_rrnDB method performed relatively better. This improved performance is likely attributable to the unique composition of the Zymo community, which, unlike the other mock communities, contained low abundant taxa leading to fewer false positives and unassigned taxa. Additionally, the small number of more abundant taxa (above 0.00898%) were detected equally well by QB_rrnDB, rendering its performance more on par with the other RRN-based classification methods.

Advancements in long-read sequencing, such as PacBio's HiFi sequencing and ONT's Q20 + kit chemistry, have greatly enhanced the read quality of these technologies. Reduced sequencing costs, and improved reliability and accessibility, have initiated a shift from short-read sequencing of regions of the 16S rRNA gene to long-read sequencing of its full length. More recently, this progress has further steered the focus towards RRN sequencing. This trend highlights the push in microbiome research towards attaining more precise species-level resolution. In this study, we assessed the phylogenetic resolution of RRN sequencing compared to v3-v4 Illumina sequencing, verifying the superior species-level resolution of the former. Aligning with the findings of Cuscó et al. [49], it is crucial to recognise that while RRN sequencing enhances species-level resolution, it also tends to yield skewed relative abundance estimates, highlighting the importance of primer design. Additionally, copy number variations of the RRN operon can influence the relative abundance estimates, necessitating normalisation for more precise microbial community profiling [68–70]. While software exists to address these needs in 16S rRNA gene sequencing, the developing field of RRN sequencing currently requires a more manual approach [71]. In this context, the FANGORN database emerges as a valuable resource, providing access to RRN operon copy numbers to support these tasks [63].

This study evaluated the primer-derived biases in RRN amplicon generation by using four combinations of previously published RRN primers, unlike previous studies which utilised just one primer pair. Our approach also involved comparing these primers across four compositionally distinct mock communities. We found that for most mock communities, the use of different RRN primer pairs did not significantly alter the taxonomic profiles or introduce substantial biases. However, we noted that the composition of the mock community itself was a crucial factor. When the ATCC community, containing a diverse array of bacterial species isolated from various environmental sources, was used, the 519F-2428R primer pair exhibited a slightly enhanced performance compared to other primer combinations. This advantage manifested as a higher number of true positive species detected and a reduced Bray Curtis Dissimilarity to the expected mock community. Similarly, reports from Kinoshita et al. [47] and Martijn et al. [46] also highlight the capacity of the 519F-2428R primer pair with respect to detecting and amplifying a diverse range of bacterial species. The 519F-2428R primer pair also showed marginally improved performance, in terms of consistently lower Bray Curtis Dissimilarity to the expected community, in the other three mock communities, leading to its selection for more detailed downstream analyses.

The ATCC mock community was included to evaluate the capability of RRN sequencing to detect archaeal species. Previous studies have indicated that the 519F-2428R primer pair is capable of identifying archaeal species [46, 47]. Contrary to these claims, our study found that this primer pair failed to detect the archaeal species H. volcanii from the ATCC mock community at both species and genus levels when using the RRN methods. It was, however, identified at the genus level through v3-v4 methods, suggesting a need for refinement in primer design. Apart from this, two of the four RRN databases lacked the relevant reference sequences. A similar issue, in terms of reference databases, was observed for another species, P. haloplanktis, in the ATCC mock community, for which reference sequences were not present in any of the RRN reference databases. However, reference sequences for other species within the genus Pseudoalteromonas were present leading to mis-assignment at the species level. These two cases highlight areas where current RRN reference databases require enhancements, specifically in improving the representation of archaeal and rare bacterial species. Despite these issues, RRN reference databases have undergone significant advancements in recent years. This is evident from the curation and release of high quality publicly available RRN reference databases, such as FANGORN and MIrROR. The open access nature of these databases, coupled with the assurance of continual updates and regular maintenance makes them an important resource for comparable and reproducible microbiome research based on sequencing of the RRN operon. In this study, both of the recently developed RRN reference databases, FANGORN and MIrROR, were superior to an older RRN reference database, rrnDB, in terms of F1 scores, precision and recall. Along with this, fewer false positives and more true positives were reported for both FANGORN and MIrROR compared to rrnDB. Between FANGORN and MIrROR, FANGORN was observed to perform better, especially in discriminating closely related species in the MCAP and MCGD mock communities. Potential factors contributing to FANGORN's effectiveness could include its higher resolution, with reference reads clustered at 99.9% compared to the 99% clustering in MIrROR [53, 63]. Additionally, FANGORN stands out as the most extensive database among those employed in our study, in terms of the number of bacterial and archaeal reference sequences it contains. Particularly, the GTDB version of the FANGORN database surpasses others by incorporating Metagenome Assembled Genomes (MAGs), while maintaining a more uniform taxonomy system.

Classification method was observed to play a crucial role in our study, wherein we employed two methods for taxonomic classification of RRN amplicons. The first method involved direct alignment to reference sequences using Minimap2 (MM-based approaches), while the second utilised OTU clustering. In the OTU-based approach, vsearch was used to generate OTUs, which were then classified using QIIME2’s BLAST classifier (QB-based approaches). Both approaches had advantages and limitations, and their performance varied depending on the mock community. Generally, applying MM-based approaches led to a higher incidence of false positives, aligning with the findings of Cuscó et al [49]. To address the issue of false positives in MM-based methods, based on the F1 scores, we propose implementing a relative abundance cut-off of 0.01%, which is consistent with similar studies in the field of amplicon sequencing [72, 73]. This approach proved effective for evenly mixed mock communities like ATCC, MCAP, and MCGD. MM methods utilising FANGORN as the reference database, post relative abundance cut-off application, consistently demonstrated superior performance in terms of F1 score, precision, recall, and lower Bray Curtis Dissimilarity to the expected community compared to other classification methods. A notable limitation to the relative abundance cut-off strategy is that it is not feasible for communities with staggered or logarithmically distributed taxa, such as the Zymo mock community, which contained members at an abundance as low as 0.00009%. In these cases, applying a cut-off could exclude genuinely present low-abundance community members. Despite the lack of a relative abundance cut-off being implemented, MM methods with FANGORN for the ONT dataset fared well in terms of F1 scores, precision, recall, and with the detection of a majority of the true positive species within the Zymo community. However, this largely depended on the sequencing platform, with poor performance observed for MM_FANGORN(G or R) with PacBio data. A pattern also reported by Zhang et al. [34], where direct alignment of PacBio data had a tendency to miss more true positives at low abundances. The Zymo mock community was the only one where the choice of sequencing platform had a discernible impact. It was observed that QB methods outperformed MM methods when working with PacBio data, achieving performance levels similar to the MM_FANGORN method with ONT data. In summary, this study underscores the variability in classifier performance based on the specific mock community and in turn the effect of sequencing platforms. MM_FANGORN generally performed well across most mock communities. Substantially shorter run times were observed for MM methods compared to QB methods, supporting the former’s applicability in sectors necessitating fast and efficient microbial profiling. However, faster run times can be achieved for QB methods, that employ OTU clustering, by implementing lower clustering thresholds (e.g., 97%), which alleviates the processing burden, without compromising the resolution of taxonomic profiling as evidenced by this study.

We have demonstrated here that RRN sequencing offers significantly enhanced species-level resolution of microbial communities compared to 16S rRNA sequencing utilising Illumina platforms. Looking forward, we anticipate that the next phase in the development of RRN sequencing will involve efforts to achieve accurate strain-level resolution. Although few studies have ventured into employing RRN sequencing for this purpose, the outcomes have varied, particularly across different environments [38, 40, 52]. Achieving strain-level resolution has proven to be notably challenging in marine and terrestrial contexts compared to human or biomedical samples. This is largely due to the scarcity of reference genomes for these environmental microbes in public repositories, such as NCBI RefSeq [38]. As the scope of RRN sequencing projects expands to encompass a broader spectrum of microbial habitats, we can expect an increase in the availability of RRN sequencing data. Consequently, this will enrich the repository of reference sequences obtainable for future studies, thereby potentially enhancing the strain-level classification capabilities of RRN sequencing. Given that strain-level detection necessitates highly accurate sequencing data, we foresee further enhancements in long-read sequencing accuracy. PacBio has achieved a commendable milestone in providing high-quality data, as evidenced in this study by its capability to generate OTUs with 99.9% similarity. PacBio’s higher sequencing quality has currently made it the more suitable choice for strain-level detection [21, 22, 40]. ONT, on the other hand, requires additional advancements to produce data with sufficient accuracy to provide strain-level results, or even OTUs with 97% similarity which was another challenge identified in this study. One promising approach involves the application of Unique Molecular Identifiers (UMIs), which have assisted ONT in overcoming its limitations related to lower quality, thereby enabling the generation of accurate OTUs [35, 74]. As the landscape of long-read sequencing continues to advance, we anticipate a surge in technological advancements that will further augment the applicability and precision of RRN sequencing.

In summary, RRN sequencing demonstrated significantly superior species-level resolution compared to Illumina-based 16S rRNA gene sequencing. The efficiency and accuracy of RRN sequencing were primarily influenced by the classification methods employed and the composition of the microbial communities themselves, with primer pairs showing minimal impact. Although the choice of sequencing platform did not have a direct impact, it did influence the selection of classification methods employed. Nevertheless, across the various sequencing platforms and mock communities evaluated, direct alignment using Minimap2 with the FANGORN database consistently performed well in species-level classifications. Increasing yields and accessibility to the long-read sequencing platforms is likely to see greater adoption of RRN sequencing in the microbiome studying community. It is anticipated that this study will provide valuable guidance to microbiome researchers in designing RRN experiments to capitalise on the enhanced resolution offered by this approach.

False Negative

False Positive

True Negative

True Positive

OTU

Operational Taxonomic Unit

QIIME2

Quantitative Insights Into Microbial Ecology 2

RRN

16S–ITS–23S rRNA operon

UMI

Unique Molecular Identifier

PCR

Polymerase Chain Reaction

Ethics approval and consent to participate

No ethics approval or consent to participate was required.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Data Summary

All supporting data and protocols have been provided within the article or as supplementary and additional data files. Five supplementary figures and eleven supplementary tables in the supplementary material (currently .docx format), along with three tables in additional files (.xlsx format), are provided with this manuscript.

Funding

M.S. is funded by the Teagasc Walsh Scholar Scheme (ref no. 2020018). This publication also received financial support from Science Foundation Ireland under grant number 12/RC/2273. D.v.S., P.D.C. and J.G.K receive support from Science Foundation Ireland (SFI) under [grant number SFI/12/RC/2273_P2] (APC Microbiome Ireland). O.O.S., P.D.C. and J.G.K received support from Science Foundation Ireland (SFI) under [grant number SFI/16/RC/3835] (VistaMilk).

Author Contribution

M.S.: conceptualization, data curation, formal analysis, investigation, methodology, visualization, writing – original draft. C.J.W.: conceptualization, formal analysis, methodology, software, writing – review & editing. F.C.: resources, supervision, writing – review & editing. O.O.S.: project administration, supervision, writing – review & editing. P.D.C.: funding acquisition, project management, supervision, writing – review & editing. D.v.S.: funding acquisition, supervision, writing – review & editing. J.G.K.: conceptualization, funding acquisition, methodology, supervision, writing – review & editing. All authors reviewed the manuscript.

Acknowledgements

We would like to thank Paul Cormican for his assistance in installing the bioinformatic tools used in this study; Amy Fitzpatrick for help with ONT library preparation; and the Norwegian Sequencing Centre for PacBio library preparation and sequencing. Figure 1 was created with Biorender.com.

Availability of data and materials

Raw amplicon data for the RRN and v3-v4 16S rRNA gene sequencing approaches can be retrieved from the European Nucleotide Archive under the project accession number PRJEB72454. The codes for the main methods used in this project are available at https://github.com/Meghana9854/RRN-sequencing.git.

Tian R-M, Cai L, Zhang W-P, Cao H-L, Qian P-Y. Rare Events of Intragenus and Intraspecies Horizontal Transfer of the 16S rRNA Gene. Genome Biol Evol. 2015;7:2310–20.
Reller LB, Weinstein MP, Petti CA. Detection and Identification of Microorganisms by Gene Amplification and Sequencing. Clin Infect Dis. 2007;44:1108–14.
Chen L, Cai Y, Zhou G, Shi X, Su J, Chen G, et al. Rapid Sanger Sequencing of the 16S rRNA Gene for Identification of Some Common Pathogens. PLOS ONE. 2014;9:e88886.
Church DL, Cerutti L, Gürtler A, Griener T, Zelazny A, Emler S. Performance and Application of 16S rRNA Gene Cycle Sequencing for Routine Identification of Bacteria in the Clinical Microbiology Laboratory. Clin Microbiol Rev. 2020;33:10.1128/cmr.00053 – 19.
Sabat AJ, van Zanten E, Akkerboom V, Wisselink G, van Slochteren K, de Boer RF, et al. Targeted next-generation sequencing of the 16S-23S rRNA region for culture-independent bacterial identification - increased discrimination of closely related species. Sci Rep. 2017;7:3434.
Vincent AT, Derome N, Boyle B, Culley AI, Charette SJ. Next-generation sequencing (NGS) in the microbiological world: How to make the most of your money. J Microbiol Methods. 2017;138:60–71.
Salipante SJ, Sengupta DJ, Rosenthal C, Costa G, Spangler J, Sims EH, et al. Rapid 16S rRNA Next-Generation Sequencing of Polymicrobial Clinical Samples for Diagnosis of Complex Bacterial Infections. PLOS ONE. 2013;8:e65226.
Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing technologies: An overview. Hum Immunol. 2021;82:801–11.
Sanschagrin S, Yergeau E. Next-generation Sequencing of 16S Ribosomal RNA Gene Amplicons. JoVE J Vis Exp. 2014;:e51709.
Salipante SJ, Kawashima T, Rosenthal C, Hoogestraat DR, Cummings LA, Sengupta DJ, et al. Performance Comparison of Illumina and Ion Torrent Next-Generation Sequencing Platforms for 16S rRNA-Based Bacterial Community Profiling. Appl Environ Microbiol. 2014;80:7583–91.
Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci. 2011;108 supplement_1:4516–22.
Hall M, Beiko RG. 16S rRNA Gene Analysis with QIIME2. In: Beiko RG, Hsiao W, Parkinson J, editors. Microbiome Analysis: Methods and Protocols. New York, NY: Springer; 2018. p. 113–29.
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13:581–3.
Özkurt E, Fritscher J, Soranzo N, Ng DYK, Davey RP, Bahram M, et al. LotuS2: an ultrafast and highly accurate tool for amplicon sequencing analysis. Microbiome. 2022;10:176.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Appl Environ Microbiol. 2009;75:7537–41.
Schloss PD. Reintroducing mothur: 10 Years Later. Appl Environ Microbiol. 2020;86:e02343-19.
Barriuso J, Valverde JR, Mellado RP. Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows. BMC Bioinformatics. 2011;12:473.
Ju F, Zhang T. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions. Appl Microbiol Biotechnol. 2015;99:4119–29.
Keegan KP, Glass EM, Meyer F. MG-RAST, a Metagenomics Service for Analysis of Microbial Community Structure and Function. In: Martin F, Uroz S, editors. Microbial Environmental Genomics (MEG). New York, NY: Springer; 2016. p. 207–33.
Pollock J, Glendinning L, Wisedchanwet T, Watson M. The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies. Appl Environ Microbiol. 2018;84:e02627-17.
Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 2019;47:e103.
Johnson JS, Spakowicz DJ, Hong B-Y, Petersen LM, Demkowicz P, Chen L, et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun. 2019;10:5029.
Matsuo Y, Komiya S, Yasumizu Y, Yasuoka Y, Mizushima K, Takagi T, et al. Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION^™ nanopore sequencing confers species-level resolution. BMC Microbiol. 2021;21:35.
Boers SA, Jansen R, Hays JP. Understanding and overcoming the pitfalls and biases of next-generation sequencing (NGS) methods for use in the routine clinical microbiological diagnostic laboratory. Eur J Clin Microbiol Infect Dis. 2019;38:1059–70.
Park C, Kim SB, Choi SH, Kim S. Comparison of 16S rRNA Gene Based Microbial Profiling Using Five Next-Generation Sequencers and Various Primers. Front Microbiol. 2021;12.
Abellan-Schneyder I, Matchado MS, Reitmeier S, Sommer A, Sewald Z, Baumbach J, et al. Primer, Pipelines, Parameters: Issues in 16S rRNA Gene Sequencing. mSphere. 2021;6:10.1128/msphere.01202-20.
Deissová T, Zapletalová M, Kunovský L, Kroupa R, Grolich T, Kala Z, et al. 16S rRNA gene primer choice impacts off-target amplification in human gastrointestinal tract biopsies and microbiome profiling. Sci Rep. 2023;13:12577.
Waechter C, Fehse L, Welzel M, Heider D, Babalija L, Cheko J, et al. Comparative analysis of full-length 16s ribosomal RNA genome sequencing in human fecal samples using primer sets with different degrees of degeneracy. Front Genet. 2023;14:1213829.
Catozzi C, Ceciliani F, Lecchi C, Talenti A, Vecchio D, De Carlo E, et al. Short communication: Milk microbiota profiling on water buffalo with full-length 16S rRNA using nanopore sequencing. J Dairy Sci. 2020;103:2693–700.
Stevens BM, Creed TB, Reardon CL, Manter DK. Comparison of Oxford Nanopore Technologies and Illumina MiSeq sequencing with mock communities and agricultural soil. Sci Rep. 2023;13:9323.
Dueholm MS, Andersen KS, McIlroy SJ, Kristensen JM, Yashiro E, Karst SM, et al. Generation of Comprehensive Ecosystem-Specific Reference Databases with Species-Level Resolution by High-Throughput Full-Length 16S rRNA Gene Sequencing and Automated Taxonomy Assignment (AutoTax). mBio. 2020;11:10.1128/mbio.01557-20.
Huggins LG, Colella V, Atapattu U, Koehler AV, Traub RJ. Nanopore Sequencing Using the Full-Length 16S rRNA Gene for Detection of Blood-Borne Bacteria in Dogs Reveals a Novel Species of Hemotropic Mycoplasma. Microbiol Spectr. 2022;10:e03088-22.
Handy MY, Sbardellati DL, Yu M, Saleh NW, Ostwald MM, Vannette RL. Incipiently social carpenter bees (Xylocopa) host distinctive gut bacterial communities and display geographical structure as revealed by full-length PacBio 16S rRNA sequencing. Mol Ecol. 2023;32:1530–43.
Zhang T, Li H, Ma S, Cao J, Liao H, Huang Q, et al. The newest Oxford Nanopore R10.4.1 full-length 16S rRNA sequencing enables the accurate resolution of species-level microbial community profiling. Appl Environ Microbiol. 2023;89:e00605-23.
Lin X, Waring K, Tyson J, Ziels RM. High-accuracy meets high-throughput for microbiome profiling with near full-length 16S rRNA amplicon sequencing on the Nanopore platform. 2023;:2023.06.19.544637.
Earl JP, Adappa ND, Krol J, Bhat AS, Balashov S, Ehrlich RL, et al. Species-level bacterial community profiling of the healthy sinonasal microbiome using Pacific Biosciences sequencing of full-length 16S rRNA genes. Microbiome. 2018;6:190.
Al’Khafaji AM, Smith JT, Garimella KV, Babadi M, Popic V, Sade-Feldman M, et al. High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat Biotechnol. 2023. https://doi.org/10.1038/s41587-023-01815-7.
Kerkhof LJ, Roth PA, Deshpande SV, Bernhards RC, Liem AT, Hill JM, et al. A ribosomal operon database and MegaBLAST settings for strain-level resolution of microbiomes. FEMS Microbes. 2022;3:xtac002.
Benítez-Páez A, Sanz Y. Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinION TM portable nanopore sequencer. GigaScience. 2017;6:1–12.
Gehrig JL, Portik DM, Driscoll MD, Jackson E, Chakraborty S, Gratalo D, et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb Genomics. 2022;8:000794.
Deurenberg RH, Bathoorn E, Chlebowicz MA, Couto N, Ferdous M, García-Cobos S, et al. Application of next generation sequencing in clinical microbiology and infection prevention. J Biotechnol. 2017;243:16–24.
Devanga Ragupathi NK, Muthuirulandi Sethuvel DP, Inbanathan FY, Veeraraghavan B. Accurate differentiation of Escherichia coli and Shigella serogroups: challenges and strategies. New Microbes New Infect. 2018;21:58–62.
Lal D, Verma M, Lal R. Exploring internal features of 16S rRNA gene for identification of clinically relevant species of the genus Streptococcus. Ann Clin Microbiol Antimicrob. 2011;10:28.
Kalia VC, Kumar R, Kumar P, Koul S. A Genome-Wide Profiling Strategy as an Aid for Searching Unique Identification Biomarkers for Streptococcus. Indian J Microbiol. 2016;56:46–58.
Kerkhof LJ, Dillon KP, Häggblom MM, McGuinness LR. Profiling bacterial communities by MinION sequencing of ribosomal operons. Microbiome. 2017;5:116.
Martijn J, Lind AE, Schön ME, Spiertz I, Juzokaite L, Bunikis I, et al. Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S-ITS-23S rRNA operon. Environ Microbiol. 2019;21:2485–98.
Kinoshita Y, Niwa H, Uchida-Fujii E, Nukada T. Establishment and assessment of an amplicon sequencing method targeting the 16S-ITS-23S rRNA operon for analysis of the equine gut microbiome. Sci Rep. 2021;11:11884.
Planý M, Sitarčík J, Pavlović J, Budiš J, Koreňová J, Kuchta T, et al. Evaluation of bacterial consortia associated with dairy fermentation by ribosomal RNA (rrn) operon metabarcoding strategy using MinION device. Food Biosci. 2023;51:102308.
Cuscó A, Catozzi C, Viñes J, Sanchez A, Francino O. Microbiota profiling with long amplicons using Nanopore sequencing: full-length 16S rRNA gene and the 16S-ITS-23S of the rrn operon. F1000Research. 2019;7:1755.
Ibironke O, McGuinness LR, Lu S-E, Wang Y, Hussain S, Weisel CP, et al. Species-level evaluation of the human respiratory microbiome. GigaScience. 2020;9:giaa038.
Dowden RA, McGuinness LR, Wisniewski PJ, Campbell SC, Guers JJ, Oydanich M, et al. Host genotype and exercise exhibit species-level selection for members of the gut bacterial communities in the mouse digestive system. Sci Rep. 2020;10:8984.
Spreckels JE, Fernández-Pato A, Kruk M, Kurilshikov A, Garmaeva S, Sinha T, et al. Analysis of microbial composition and sharing in low-biomass human milk samples: a comparison of DNA isolation and sequencing techniques. ISME Commun. 2023;3:116.
Seol D, Lim JS, Sung S, Lee YH, Jeong M, Cho S, et al. Microbial Identification Using rRNA Operon Region: Database and Tool for Metataxonomics with Long-Read Sequence. Microbiol Spectr. 2022;10:e02017-21.
Procedure-checklist-Preparing-SMRTbell-libraries-using-PacBio-barcoded-M13-primers-for-multiplex-SMRT-sequencing.pdf.
Cullen JT, Lawlor PG, Cormican P, Crispie F, Gardiner GE. Optimisation of a bead-beating procedure for simultaneous extraction of bacterial and fungal DNA from pig faeces and liquid feed for 16S and ITS2 rDNA amplicon sequencing. Anim - Open Space. 2022;1:100012.
Walsh AM, Crispie F, Kilcawley K, O’Sullivan O, O’Sullivan MG, Claesson MJ, et al. Microbial Succession and Flavor Production in the Fermented Dairy Beverage Kefir. mSystems. 2016;1:10.1128/msystems.00052 – 16.
Ligation sequencing amplicons - PCR barcoding (SQK-LSK112 with EXP-PBC096). Oxford Nanopore Technologies. https://community.nanoporetech.com/protocols/pcr-barcoding-96-amplicons-sqk-lsk112/v/pba96_9152_v112_revh_09feb2022. Accessed 20 Feb 2024.
Han Y. hanyue36/nanoplexer. 2023.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17:10–2.
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
VSEARCH pipeline · torognes/vsearch Wiki. https://github.com/torognes/vsearch/wiki/VSEARCH-pipeline. Accessed 20 Feb 2024.
Walsh CJ, Srinivas M, Sinderen D van, Cotter PD, Kenny JG. FANGORN: A quality-checked and publicly available database of full-length 16S-ITS-23S rRNA operon sequences. 2022;:2022.10.04.509801.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41:D590–6.
McDonald D, Jiang Y, Balaban M, Cantrell K, Zhu Q, Gonzalez A, et al. Greengenes2 unifies microbial data in a single reference tree. Nat Biotechnol. 2023;:1–4.
Dixon P. VEGAN, a package of R functions for community ecology. J Veg Sci. 2003;14:927–30.
Wilkinson L. ggplot2: Elegant Graphics for Data Analysis by WICKHAM, H. Biometrics. 2011;67:678–9.
Olivier SA, Bull MK, Strube ML, Murphy R, Ross T, Bowman JP, et al. Long-read MinION^™ sequencing of 16S and 16S-ITS-23S rRNA genes provides species-level resolution of Lactobacillaceae in mixed communities. Front Microbiol. 2023;14:1290756.
de Oliveira Martins L, Page AJ, Mather AE, Charles IG. Taxonomic resolution of the ribosomal RNA operon in bacteria: implications for its use with long-read sequencing. NAR Genomics Bioinforma. 2020;2:lqz016.
Lavrinienko A, Jernfors T, Koskimäki JJ, Pirttilä AM, Watts PC. Does Intraspecific Variation in rDNA Copy Number Affect Analysis of Microbial Communities? Trends Microbiol. 2021;29:19–27.
Gao Y, Wu M. Accounting for 16S rRNA copy number prediction uncertainty and its implications in bacterial diversity analyses. ISME Commun. 2023;3:1–9.
Nikodemova M, Holzhausen EA, Deblois CL, Barnet JH, Peppard PE, Suen G, et al. The effect of low-abundance OTU filtering methods on the reliability and variability of microbial composition assessed by 16S rRNA amplicon sequencing. Front Cell Infect Microbiol. 2023;13.
Curry KD, Wang Q, Nute MG, Tyshaieva A, Reeves E, Soriano S, et al. Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data. Nat Methods. 2022;19:845–53.
Karst SM, Ziels RM, Kirkegaard RH, Sørensen EA, McDonald D, Zhu Q, et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods. 2021;18:165–9.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Evaluating the Efficiency of 16S-ITS-23S operon Sequencing: A Comparison of Primer Pairs, Sequencing Platforms, and Taxonomic Classifiers

Status:

Version 1

Abstract

Figures

Introduction

Materials and Methods

PCR amplifications

Sequencing and data processing

Bioinformatic analysis

Taxonomy reference databases

Selection of classification methods

Statistical analysis

Accuracy metrics

Results

Applying a relative abundance cut-off improves effectiveness of the classification methods

Comparative analysis of taxa detection by different classification methods

Discussion

Conclusion

Abbreviations

Declarations

Data Summary

Funding

Author Contribution

Acknowledgements

Availability of data and materials

References

Additional Declarations

Supplementary Files

Status:

Version 1