The SE sequencing read analyses in microbiome research was evaluated using the data generated from the Illumina MiSeq platform which is widely used for amplicon microbiome research worldwide. The results of this current study support the usefulness of SE data analyses in environmental microbiome studies.
Quality of reverse reads influences the sample sequencing depth
Primarily, the impact of R2 read qualities on generation of full length amplicon was analyzed. Briefly, the R1 and the R2 reads were truncated at different lengths, the PE data was merged and the results were compared (Fig. 2). The results show that the depth (number of sequences/ reads of a sample) of R2 and merged sequences is inversely proportional to the truncation length of R2 reads whereas the truncation length of R1 reads does have a great impact on merging the PE dataset. The results indicate that the quality of R2 reads, obtained from Illumina MiSeq platform, decreases as the length of reads increases and the low base quality of R2 reads affects the sequencing depth of samples, irrespective of amplicon regions. A similar observation of decrease in the quality of R2 reads as compared to R1 reads were reported in the earlier studies (Chen et al. 2018; Liu et al. 2020). Thus, low quality of R2 reads in Illumina MiSeq dataset is a consistent phenomenon which leads to the reduction in sequencing depth of samples which in turn could affect the environmental microbiome studies (Pereira-Marques et al. 2019; Ramakodi 2021).
SE versus PE datasets for the inference of alpha and beta diversity
This study evaluated the R1, R2 and PE datasets of four different amplicons, V1V3, V3V4, V5V5, and V6V8, for inferring the microbiome structure in environmental microbiome research. The analyses showed that the number of non-chimeric sequences obtained for R1 datasets (Median: V1V3- 43,333; V3V4- 33,063; V4V5- 37,313; V6V8- 34,932) is higher than that of R2 (Median: V1V3- 19,166; V3V4- 30,892; V4V5- 20,390; V6V8- 21,825) and PE (Median: V1V3- 9,579; V3V4- 16,539; V4V5- 20,257; V6V8- 19,652) datasets. Similarly, the observed ASVs were found to be significantly higher (P-value:- < 2.2e-16 to 0.012) for R1 datasets as compared to the other datasets, irrespective of amplicon regions. The comparison of distribution of observed ASVs based on R1, R2 and PE datasets are shown in Fig. 3. Earlier studies showed that the sequencing depth could influence the number of ASVs (Ramakodi 2021). Thus, the higher distribution of observed ASVs in R1 datasets as compared to its counterparts could be due to the higher sequencing depth in R1 datasets. The beta diversity of samples based on bray-curtis dissimilarity index was studied and the dendogram based on the bray-curtis distance values are shown in Supplementary Figs. 2 to 5 for V1V3, V3V4, V5V5, and V6V8, respectively. The beta diversity analyses show that the relationships between samples based on R1, R2 and PE datasets vary, irrespective of amplicon regions. The bray-curtis distance is based on the microbial community composition of datasets. The alpha diversity results clearly illustrate R1, R2 and PE datasets to exhibit different proportions of ASVs within each amplicon region. Thus, the differences in the distribution of the number of ASVs/ microbial taxa within R1, R2 and PE datasets could have influenced beta diversity.
Core microbiome composition: SE versus PE datasets
The ultimate goal of any microbiome study is to find the structure of core microbiome and this study also evaluated the utility of SE sequencing data to infer the core microbiome composition. The comparison of core microbiome at genus level, based on SE and PE datasets of different amplicon regions is shown in Fig. 4 and the relative abundance of taxa (class level) is shown in Fig. 5. The analyses showed the R1 datasets to have a higher number of unique genus as compared to R2 and PE datasets. Similarly, the R1 datasets exhibited higher numbers of unique class, order and family level taxa for all the amplicon regions, except V3V4 (Supplementary Fig. 6). The results suggest that the SE datasets, especially the datasets comprising exclusively R1 reads, could infer more number of taxa as compared to PE datasets. The reason for observing a higher number of taxa in R1 datasets could be attributed to the sequencing depth. Earlier studies highlighted that the microbiome data is compositional in nature which means the observed microbiome structure is defined by the number of reads available in the dataset (Gloor et al. 2017; Susin et al. 2020). A higher sequencing depth could infer more taxa as compared to the dataset having low sequencing depth (Zaheer et al. 2018). In this study, the R1 datasets had a higher number of non-chimeric sequences. Thus, the R1 datasets yielded a higher number of taxa. These observations suggest that the R1 reads could provide more information on the microbiome including the rare taxa which require higher sequencing depth. However, some discrepancies in inferring the core microbiome were observed between SE and PE sequencing data which could be attributed to the shorter length of SE sequencing data which affects the phylogenetic resolution (Fuks et al. 2018; Johnson et al. 2019). Also, the microbiome composition observed for different amplicon regions vary which is not surprising as the original study from which the data was adopted herein, also illustrated the discrepancies on microbiome composition by different amplicon regions (Soriano-Lerma et al. 2020). In summary, this study suggests that the SE sequencing data, especially the R1 reads, provide results comparable to PE datasets. In fact, the R1 datasets yielded more taxa as compared the PE datasets, irrespective of amplicon regions. Thus, the SE sequencing data analyses could be adopted as an alternative approach to infer microbiome composition as and when the quality of R2 reads, generated by Illumina MiSeq platform, are low and significantly reduces the PE data and subsequently, affects the sample size of the study. However, the results obtained from SE sequencing datasets need careful interpretation as some of the taxa observed in PE datasets were absent in SE datasets.