The practicality of using 2bRAD-M for detecting the HoC microbiome through in vitro simulation
The performance in microbial identification and abundance estimation of 2bRAD-M was benchmarked against 16S rRNA sequencing and WMS sequencing methods. The conventional 16S method targeting the V4-V5 region was applied to generate genus-level taxonomic profiles with QIIME2 [19] based on the Silva database [20]. Additionally, a recently developed 16S rDNA sequencing protocol called the "5R 16S method" was employed to predict taxonomic profiles at the species level [21]. For WMS sequencing, widely recognized bioinformatic tools like MetaPhlAn4 [22] and mOTUs2 [23] were utilized to taxonomically derive profiles down to genus and species levels (Fig. 2 and Table. S1 in Additional File). To critically assess the performance of profiling tools, the precision-recall curve was used as the main indicator for microbial identification [24]. It graphically represents precision and recall scores at distinct abundance thresholds, offering a comprehensive evaluation of classification performance. By comparing the generated taxonomic profiles with the ground truth, the area under the precision-recall curve (AUPR) was calculated to yield a singular metric for consolidating precision and recall scores and effectively evaluating the presence/absence patterns of microbes [25]. Abundance estimation was also assessed based on L2 similarity for a more thorough analysis.
To assess the efficacy of the aforementioned methods in profiling microbial communities amidst substantial interference from host DNA, we employed a mock microbial community comprising evenly mixed DNA from 20 bacterial species (across 18 genera) [26]. This composite served as the basis for preparing stocks spiked with 90% and 99% human DNA respectively. For each type of stock, two technical replicates were generated and subjected to 2bRAD-M, 16S rRNA sequencing, and WMS sequencing methods. Subsequently, the resultant profiles by each methodology, along with the AUPR and L2 similarity metrics, were then compared with the ground truth across the taxonomic ranks, encompassing the genus and species levels.
2bRAD-M exhibited robust microbial identification and abundance estimation capabilities, coupled with commendable technical reproducibility across replicates. In the scenario where host DNA comprises 90% of the sample, the AUPR and L2 similarity scores for 2bRAD-M exhibited exemplary performance, attaining notably high percentages at both the genus and species levels. Conversely, 16S rRNA sequencing yielded relatively lower AUPR and L2 similarity scores across taxonomic levels, whereas WMS demonstrated metric values akin to those of 2bRAD-M. Upon reaching a host DNA proportion of 99% in the sample, the AUPR and L2 similarity scores for 2bRAD-M significantly outstripped those of 16S rRNA sequencing at both the genus and species levels, underscoring its superior capability in delineating the taxonomic composition of the microbial community, even in the presence of high host DNA contamination. Although WMS exhibited AUPR values comparable to those of 2bRAD-M, its L2 similarity values were lower. Notably, under the HoC scenarios, outstanding challenges have been widely observed in experiments using 16S rRNA sequencing, characterized by a pronounced false positive issue (Fig. 2a-b). The elevated false positive rate and diminished accuracy in the abundance estimation correlated with the augmentation of host DNA within mock samples.
Fig. 2. 2bRAD-M revealed highly accurate microbiota composition in HoC mock samples.
In both (a) and (b), the left panel shows the taxonomic profiling results of MSA1002 mixed with 90% human DNA, while results from MSA1002 mixed with 99% human DNA are shown in the right panel. In the stacked bar plots, the colored bars indicate bacteria taxa that can be found in the ground truth, whereas the white bars indicate false positives identified in the profiling results of a given method. AUPR measures the microbial identification accuracy, while L2 similarity measures the similarity between the ground truth and predicted taxonomic profiles.
(a) Genus-level profiling results of the three sequencing methods (16S, 2bRAD-M, WMS) on the same DNA mock community (MSA 1002).
(b) Species-level profiling results of the three metagenomics methods.
Benchmarking the taxonomic profiling performance of 2bRAD-M with diurnal saliva samples
Although the salivary microbiome has been reported to be significantly associated with various oral and systematic diseases, the underlying mechanisms remain unclear. Elucidating the dynamical changes of the salivary microbiota could contribute to a deeper understanding of its relationship with human diseases. To scrutinize the intricate temporal dynamics of salivary microbiota as well as the performance of 2bRAD-M on actual HoC samples, a cohort of eight participants was enlisted to provide saliva samples at four distinct time points: 9 AM, 11 AM, 1 PM, and 5 PM, resulting in a total of 32 samples (hereafter referred to as diurnal saliva samples). Each sample was partitioned into three aliquots for subsequent analysis via WMS, 2BRAD-M, and 16S rRNA sequencing, respectively. (Fig. 1b).
Considering that the metagenomic profilers employed in our study rely on disparate reference databases, the potential for this discrepancy to introduce confounding variables during the evaluation of classification performance across different methodologies becomes apparent. Hence, it is imperative to standardize the database before initiating comparative analyses. 16S data analysis hinges upon the utilization of 16S rRNA databases, such as Greengenes and the SILVA database. In contrast, 2bRAD-M relies on the GTDB; WMS data analysis is conducted using MetaPhlAn4, which provides a script facilitating the conversion of profiles to GTDB-based outcomes. It is noteworthy that the SILVA and GTDB databases do not share a common data universe for microbiome investigations. To facilitate a more rigorous comparison of technical disparities among sequencing methodologies, we undertook the training of a taxonomic classifier utilizing the GTDB within the QIIME2 framework (refer to the code provided in Additional File 2). In light of the potential influence of sequencing depth-related inaccuracies of WMS, the conventional gold standard, with authentic samples, the comparative analysis did not incorporate the AUPR values. Rather, the evaluation centered on the computation of Pearson correlation coefficients and L2 similarity metrics across different methodologies.
After standardization, to comprehensively evaluate the profiling performance of 2bRAD-M with diurnal saliva samples, we compared the taxonomic profiles generated by 2bRAD-M, whole metagenome shotgun sequencing (WMS), and 16S rRNA sequencing at the genus level (Fig. 3a-c). The taxonomic profiles obtained from 2bRAD-M and WMS exhibited high concordance, as evidenced by an average Pearson correlation coefficient (R) of 92.7% and an average L2 similarity score (L2) of 94.7% (average Pearson correlation R = 92.7% and average L2 similarity L2 = 94.7%) (Fig. 3a; Fig. S1 in Additional File 1). Conversely, a notable disparity is observed when comparing these taxonomic profiles to those generated using 16S rRNA sequencing: the average R and L2 are 87.4% and 83.9% between 2bRAD-M and 16S, as well as 82.5% and 85.1% between WMS and 16S (Fig. 3b and 3c; Fig. S2-S3 in Additional File 1).
Fig. 3. The correlation of microbial abundance at the genus or species level between each pair of three sequencing methods (2bRAD-M, 16S, and WMS) in diurnal saliva samples.
Within each group, three out of the total 32 samples were selected to illustrate the correlation between different methodologies. "Average R" and "Average L2" labeled in the figure represent the respective average indices for all 32 samples. The gray box on top of each scatter plot displays the sample ID. Points on the coordinate axis represent unique features identified by the corresponding method, while shared features are indicated by light blue points in the white area of the plot.
(a-c) Comparative analysis of genus-level taxonomic profiling results obtained from 2bRAD-M, WMS and 16S rRNA sequencing. The profiles of 2bRAD-M and WMS are highly concordant.
(d-f) Comparative analysis of species-level taxonomic profiling results of 2bRAD-M and WMS based on overall divergence, sequencing method disparity and computational pipeline variance. 2bRAD-M and WMS are highly similar in sequencing methods, but significant differences were observed in their corresponding computational pipelines. (WMS)PMA: analyzing WMS data using MetaPhlAn4; (WMS)2bRAD-M: analyzing WMS data using the 2bRAD-M computational pipeline; (2b)2bRAD-M: analyzing 2bRAD-M sequencing data with 2bRAD-M computational pipeline.
We next scrutinize the taxonomic profiling efficacy of 2bRAD-M at the species level. Given the inherent limitations in accurately classifying species using 16S data, we opted to exclusively benchmark 2bRAD-M alongside WMS (Fig. 3d-f). Initially, the two approaches were quite dissimilar (average R = 52.0% and average L2 = 69.9%) and both methods identified substantial numbers of unique species absent in the other (Fig. 3d; Fig. S4 in Additional File 1). Although comparing the shared species could enhance the average Pearson correlation to 71.1%, the outcome remained suboptimal (Fig. S5 in Additional File 1). We postulated that the disparity observed in the taxonomic analysis outcomes were attributed to variations in both the sequencing methodologies and the subsequent computational pipelines. The combination of variances from these two factors might have contributed to the high heterogeneity evident in the overall taxonomic profiles.
To assess the disparity in sequencing methodologies, we contrasted the species-level profiling results generated from WMS data processed within the computational framework of 2bRAD-M, denoted as the (WMS)2bRAD-M combination, with those from 2bRAD-M sequencing data processed through the 2bRAD-M pipeline, referred to as the (2b)2bRAD-M combination. Remarkably, the paired results exhibited high degrees of similarity, with an average R = 96.7% and L2 = 96.7% (Fig. 3e, Fig. S6 in Additional File 1). While certain unique species were detected by either data processing combination, the shared species identified by both combinations represented 98.65% and 99.76% of all reads in the 2bRAD-M data and the WMS data, respectively. These results suggest that taxa identified as rare in the two combinations concerned only a minor fraction of the total biomass. Additionally, correlation analysis conducted on the shared species revealed marginally enhanced metrics in the taxonomic profiling outcomes, with both average R and average L2 = 96.9% (Fig. S7 in Additional File 1).
Furthermore, the discrepancy of computational pipelines was compared. We juxtaposed the taxonomic profiles derived from WMS data analyzed with MetaPhlAn4, denoted as (WMS)MPA combination, with those obtained from the (WMS)2bRAD-M combination, revealing a notable discrepancy (Fig. 3f, Fig. S8 in Additional File 1). These findings corroborated a substantial concordance between the sequencing methodologies of 2bRAD-M and WMS, with disparities in taxonomic profiles primarily attributed to different computational pipelines.
2bRAD-M exhibits a significant congruence with WMS when conducting diversity analyses
To substantiate the capability of 2bRAD-M in yielding biologically equivalent results to other methodologies, particularly WMS, we conducted simultaneous analyses of these samples using 2bRAD-M, 16S, and WMS methodologies. All saliva samples (n = 32) were stratified into the gingivitis group (n = 16) and the healthy group (n = 16) according to the oral health status of the participants, with four individuals diagnosed with gingivitis and four without. Both 2bRAD-M and WMS exhibited significant differences between the two groups in alpha diversity, as evidenced by Shannon diversities of 7.36 and 12.28, respectively (p < 0.05). In contrast, 16S data did not reveal any significant difference between the two groups (p = 0.1) (Fig. 4a). We further conducted beta diversity analysis using four distance matrices (Jaccard distance, Bray-Curtis dissimilarity, unweighted UniFrac distance, and weighted UniFrac distance) for the taxonomic profiles generated by each sequencing method. Based on the pseudo-F statistic from the Adonis test, all methods distinguished between gingivitis and health. Remarkably, the statistical power of 2BRAD-M closely approximated that of WMS, and significantly surpassed that of the 16S method (Fig. S9 in Additional File 1). Collectively, both 2bRAD-M and WMS methods demonstrated high efficacy in discerning microbial diversity variances between gingivitis and health states. Conversely, the 16S method exhibited slightly lower sensitivity to inter-group differences due to limited resolution.
To evaluate the technical consistency across methodologies, we performed the Mantel test to assess the similarity of the corresponding distance matrices derived from different sequencing methods (Fig. 4b). We observed a robust correlation (Pearson r = 0.97) between the genus-level Bray-Curtis dissimilarity matrices for 2bRAD-M and WMS. However, the similarity between the 16S and WMS matrices was notably lower (Pearson r = 0.49). At the species level, the Bray-Curtis and weighted UniFrac distance matrices derived from 2bRAD-M and WMS demonstrated strong consistency, with outstanding Pearson correlations of 0.97 and 0.95, respectively. However, the performance of the 16S method was far less satisfactory, as indicated by similarity values of 0.39 and 0.25 to WMS for the respective distance matrices. These suggest 16S yields less accurate abundance estimates for HoC samples, resulting in less correlated distance matrices compared to WMS and 2bRAD-M. Evidently, 2bRAD-M and WMS exhibited high concordance in beta diversity analysis in terms of Bray-Curtis and weighted UniFrac distances. While 2bRAD-M and WMS might have identified unique species, the shared species identified by both methods were predominant and accounted for the majority of the sequencing reads. The identity and relative abundance of dominant species identified by the two methods exhibited an ultra-high degree of similarity. Collectively, 2bRAD-M technology demonstrates both biological and methodological robustness comparable to those of WMS when sequencing HoC samples, represented by saliva specimens in this study.
Fig. 4 2bRAD-M has a high correlation with whole metagenomic sequencing in saliva specimens.
(a) Comparison of alpha diversities (in Shannon index) between health and gingivitis-afflicted samples using 2bRAD-M, WMS and 16S rRNA sequencing. Kruskal-Wallis test was employed to determine the statistical significance of the observed differences.
(b) Mantel test for the correlation between the corresponding distance matrices derived from different methods, with each point denoting the distance between two samples. The distance matrices obtained from 2bRAD-M, whether based on non-phylogenetic distance (Bray-Curtis dissimilarity) or phylogenetic distance (Weighted UniFrac distance), are highly similar to those obtained from WMS.
(c) Rarefaction analysis of 2bRAD-M and WMS datasets. In the graph, each colored line represents a sample. From each sample, three subsamples were randomly sampled at several fixed sequencing depths (if the set subsampling volume exceeded the total data volume of the sample, subsampling would be halted). The average alpha and beta diversity indices for each of the three subsamples at each sequencing depth were calculated and compared to the pre-rarefaction results. The suggested minimal sequencing depth for 2bRAD-M is defined as the depth at which the similarity between the index obtained through resampling 1M and the pre-rarefaction result in 2bRAD-M does not improve by more than 1%. In WMS, it is defined as the depth at which the similarity after resampling 10M does not improve by more than 1%.
2bRAD-M can capture similar biological signals to WMS with just 5-10% of the sequencing effort from host-rich samples
The effectiveness of metagenomic sequencing demonstrates a robust correlation with sequencing depth [27]. To evaluate the minimal sequencing depth for elucidating microbial composition and studying community ecology in 2bRAD-M and WMS sequencing, rarefaction analysis was performed on datasets from both methods across a range of increasing sequencing depths. We sequenced 32 saliva samples with an average read count of 20 million per sample using 2bRAD-M. In comparison, WMS sequencing produced an average of 207 million reads per sample, representing a tenfold increase in read count compared to 2bRAD-M. Subsequently, we bootstrapped these samples down to 0.1, 1, 2, 5, 7.5, 10, 12.5, 15, 17.5, 20, and 22.5 million reads for 2bRAD-M (bootstrap was not conducted if the sequencing depth did not meet the corresponding threshold, totaling 256 samples). WMS data were subsampled to 1, 10, 20, 50, 75, 100, 125, 150, 175, 200, and 225 million sequences per sample (totaling 308 samples). We aimed to determine the sequencing depth necessary for accurate quantification of crucial ecological indicators, including alpha diversity (Shannon entropy) and beta diversity (Bray-Curtis dissimilarity and UniFrac distance), in both 2bRAD-M and WMS datasets. Notably, with a sequencing depth of 5 million reads per sample, corresponding to about 360 megabases (Mb) of sequencing data, the rarefaction curve of 2bRAD-M reaches saturation and attains Shannon Entropy, Bray-Curtis distance, and weighted UniFrac Distance values that are highly congruent with those of the original samples (similarity > 99%). Conversely, WMS data requires 50 million reads to generate accurate diversity estimation comparable with its deepest read depth (Fig. 4c). Given that the reads in the 2bRAD-M dataset are 32bp long, while those in the WMS dataset are 150bp long, our findings indicate that 2bRAD-M needs merely 5-10% of the sequencing data required by WMS to generate the species-resolved taxonomic profile with comparable accuracy, thereby decreasing costs by 50 to 100 times. These cost reductions are especially significant in studies requiring deep sequencing, such as revealing true microbial composition or detecting scarce but ecologically important species in intricate microbial communities with high host DNA. Therefore, 2bRAD-M is particularly well-suited for analyzing HoC samples.
2bRAD-M captured the subtle temporal changes in the saliva microbial communities
Studies have reported the diurnal oscillation patterns of the oral microbiome [28, 29], but further investigation is needed to elucidate the dynamics of microbial communities at the high taxonomic resolution over shorter time frames, such as during the daytime. To benchmark the capacity of 2bRAD-M in capturing subtle oscillations within the microbiome of HoC samples, we partitioned the diurnal saliva dataset into four cohorts based on the sampling time and computed the Bray-Curtis dissimilarity for each participant's salivary microbiome at 11 AM, 1 PM, and 5 PM compared to their baseline (9 AM). Both the analysis of 2bRAD-M and WMS unveiled marked alterations in microbial communities between the 9 AM-5 PM and 9 AM-11 AM time combinations (Fig. 5a, p = 0.02). This observation aligns with previous literature, indicating the detectability of rhythmic microbial community variations over extended periods [28]. In contrast, the results obtained from 16S sequencing showed no significant changes in microbial communities at different time points compared to the baseline (p = 0.20). These suggest that 2bRAD-M exhibits high sensitivity in capturing the temporal oscillations in microbial communities immersed in host DNA.
Using either the 2bRAD-M or the WMS method, we clustered the microbiotas using the principal coordinate analysis (PCoA) plot and further identified which individual species were responsible for the overall diurnal pattern of saliva microbiota. Samples from two times (9 AM and 5 PM) were differentiated and connected by gray lines to indicate if they belong to the same host (Fig. 5b). The taxonomic profiles were projected onto the first principal coordinate (PC1) and showed a gradient-like heterogeneity of microbial community over the daytime period. Changes in PC1 within the host were associated with microbiome alterations in beta diversity from 9 AM to 5 PM. Therefore, PC1 can serve as the primary descriptor for quantitatively measuring the development of microorganisms during daytime oscillation.
Next, we identified individual microbes responsible for changes in the microbiome along PC1 or compositional change using both WMS and 2bRAD-M methods. While we identified 58 PC1-associated bacterial species using 2bRAD-M, 19 were identified using WMS, where eight species were commonly found in both profiles. In the PCoA plot, the 9-AM samples consistently appear on the right side of those at 5-PM ones (Fig. 5b). Six out of 58 species identified by 2bRAD-M negatively correlated with PC1 (i.e., enriched at 5 PM), the remaining 52 species increased in abundance with PC1 (i.e., enriched at 9 AM). Likewise, 11 out of 19 species identified by WMS positively correlated with PC1. The relative abundance of eight shared species between the two methods commonly increases along PC1 (i.e., enriched at 9 AM). We further confirmed the declining trend in the relative abundance of these eight species from 9 AM to 5 PM for each host (Fig. S10 in Additional File 1; Spearman rho > 0.7, FDR q < 0.2). These shared species included Filifactor alocis, Desulfobulbus oralis, Eubacterium_M brachy, Tannerella_forsythia, Treponema_B denticola, Porphyromonas endodontalis, F0058 sp000163695, and Campylobacter_A rectus. In previous literature, the genus Porphyromonas has been shown to exhibit a notable decreasing trend in the saliva of one or multiple individuals [28]. Consistent with these observations, we noted a similar trend for Porphyromonas endodontalis in our study (Fig. 5c). Therefore, 2bRAD-M provides a promising avenue for studying the subtle dynamics of the microbiome in HoC samples and holds potential applications in assessing the relationship between this variation and the host's physiological state.
Fig. 5 The diurnal pattern of saliva microbiota profiled and compared by three sequencing methods.
(a) The box plot shows the Bray-Curtis distance between each pair of samples collected from the same hosts at different time points (i.e., 9 AM VS 5 PM, 9 AM VS 11 AM). Distance values were calculated from the taxonomic profiles produced by 2bRAD-M, WMS and 16S rRNA sequencing. To ensure visual clarity, time point combinations with non-significant differences (e.g., 9 AM vs 1 PM, 11 AM vs 1 PM, 11 AM vs 5 Pm and 1 PM vs 5 PM) are not depicted in the plot. The Wilcoxon signed-rank tests were conducted to indicate the significance between the 9 AM-5 PM combination and the 9 AM-11 AM combination.
(b) Gradient-like changes in microbial beta diversity over two sampling times, 9 AM (red) and 5 PM (blue). PCoA plot was used to visualize the Bray-Curtis dissimilarity matrices of samples collected at the two time points. Gray lines connected samples from the same host.
(c) The relative abundance of Porphyromonas endodontalis increased along the PC1 axis. Each graph includes a color gradient in the bottom left corner indicating relative abundance, ranging from low (dark blue) to high (brown). Rho represents the Spearman correlation coefficient, and p.adj represents the p-value adjusted by Bonferroni correction.
2bRAD-M demonstrates a stronger discriminatory power in classifying early childhood caries (ECC) than conventional metagenomics
Understanding the human microbiome is key to enhancing disease classification, treatment, and prevention by leveraging insights from microbiota profiles. The use of oral microbiota to predict disease states demonstrated in studies utilizing saliva and plaque samples for diseases like Early Childhood Caries (ECC) [30] and gingivitis [31], is a burgeoning area of research. ECC, affecting nearly half of children worldwide, results in considerable social and economic burdens [32]. It causes permanent dental damage, increasing the risk of further decay and tooth loss over a child's life. Therefore, timely diagnosis and prevention of ECC are of utmost clinical significance. Recent studies have shown the effectiveness of using species-level taxonomic profiles generated from 16S short-read sequencing of saliva and plaque to predict ECC [30]. However, models based on saliva alone were less effective (AUC=0.68), potentially due to the dynamic nature of saliva as a sink for other oral microbiome habitats, adding variability to its microbial profile [33]. Furthermore, implementing more refined sequencing techniques like 2bRAD-M may enhance model performance by providing finer species-level or even strain-level resolution.
To evaluate the performance of 2bRAD-M against conventional sequencing methods in diagnosing ECC, we initiated our study by selecting 19 high-quality samples from each of the ECC and healthy control groups in the aforementioned study. These selections were based on the integrity and completeness of data from previous 16S short-read sequencing efforts, which specifically targeted the V3V4 regions. Subsequently, these samples were subjected to both 2bRAD-M and 16S long-read sequencing. For a consistent and fair comparison across all three methods, we used the GTDB release 202 as our reference database.
In our study, we performed diversity analyses and developed machine learning models to compare the effectiveness of three sequencing methods in differentiating ECC from healthy oral microbiomes. Our analyses revealed that both 2bRAD-M and 16S short-read sequencing effectively distinguish between the ECC-affected and healthy groups. Conversely, the long-read sequencing method was not as successful in making this distinction. Specifically, the results of alpha diversity analysis indicate a significant difference in the Shannon Index between the two groups for both 2bRAD-M and 16S short-read sequencing, as confirmed by the Kruskal-Wallis test (p < 0.05) (Fig. 6a). Notably, the Shannon diversity was higher when using 2bRAD-M, suggesting it may detect a broader array of species compared to the 16S method. We then performed beta diversity analysis to examine ecological difference between oral microbiotas of ECC and healthy groups. 2bRAD-M outperforms other methods, evidenced by significant between-sample disparities using both unweighted UniFrac distances and Bray-Curtis dissimilarities (Fig. 6b). 16S short-read data performed moderately well, distinguishing the two groups under Bray-Curtis dissimilarities but not in unweighted UniFrac distances. In contrast, 16S long-read sequencing data exhibits poor performance probably due to its high sequence error rate, failing to discern ECC from health. These findings highlight the superior performance of 2bRAD-M in capturing the complexity of oral microbiomes, emphasizing its potential for enhancing ECC diagnostics.
The efficacy of microbial abundance profiles as diagnostic models for ECC was evaluated using the random forest algorithm. We measured the models’ effectiveness by their area under the receiver operating characteristic curve (AUC), which quantifies their ability to differentiate between ECC and ECC-free cases. The 2bRAD-M sequencing approach yielded the most accurate results (AUC=0.92), surpassing the 16S short-read sequencing (AUC=0.83) and significantly outperforming the 16S long-read sequencing, which showed minimal discrimination (AUC=0.52) (Fig. 6c). Both the 2bRAD-M and 16S short-read sequencing models pinpointed Streptococcus mutans as a key indicator for ECC. Notably, the 2bRAD-M model identified additional biomarkers, underscoring its capability to reveal more nuanced disease state indicators (Fig. 6d). Within the predictive architecture of the 2bRAD-M model, an intriguing phenomenon was observed: confining the predictive input to merely the quartet of most discriminative bacterial species culminated in an optimal model performance. This apex of predictive accuracy was substantiated by a rigorous 10-fold cross-validation approach, underscoring the outstanding discriminative power of these selected taxa (Fig. 6e). Alongside Streptococcus mutans, this cohort includes Propionibacterium acidifaciens, Mitsuokella sp000469545, and Bifidobacterium dentium. Notably, the genus Mitsuokella, in concert with the other three species, has been empirically corroborated as exhibiting a pronounced association with pathogenicity in dental caries, as evidenced by a litany of scholarly inquiries [34-37] (Fig. 6F). These findings demonstrate that the occurrence of dental caries is not solely associated with a single species but rather with multiple species or even a complex sub-community. The high-precision profiling results provided by 2bRAD-M are crucial for identifying key species related to ECC and for constructing highly performant machine learning models.
Fig. 6. Sequencing of ECC saliva samples using 2bRAD-M, long-read 16S sequencing, and short-read 16S sequencing, with diversity analysis and disease classification results based on species-level profiles.
Diversity analysis encompasses (a) the alpha diversity indices and (b) the beta diversity disparities discerned between healthy (blue) and ECC-afflicted (red) samples. The Kruskal-Wallis test was used to evaluate alpha diversity differences, and PERMANOVA was used for beta diversity. The H and F values represent the test statistic, quantitatively indicating the magnitude of the differences.
(c) Classification performance of Random Forest model using species profiles of the three sequencing strategies, assessed by AUC (area under ROC curve).
(d) The scatter plot displays the relationship between the importance scores of different microbial species and their respective AUC values. Red dots indicate an increase in ECC, blue dots signify a decrease, and grey dots represent a neutral status. The size of the dots represents the mean abundance of these species.
(e) Relationship between the numbers of variables included in the 2bRAD-M model and the corresponding predictive performance (the error bar of each dot denotes the standard deviation).
(f) The four most discriminant species in the predictive model were shown using boxplot. The left half of the graph shows the log10-transforme relative abundance of each species in the healthy or ECC group. The utility of each taxon as a potential ECC marker is assessed by AUC in the right half of the graph.