Diet is often considered to be a driver of microbiome variation [47]. In observational population-based studies, diet consistently accounts for only a small proportion of microbiome variation, and this is partly due to large interindividual differences, small sample sizes and limitations in study designs such as potentially insufficient washout periods in crossover studies [47–51]. In general, higher interindividual variation is observed in gut microbiota of human subjects compared to animal species [52]. Nevertheless, use of animal models to study the causal role of gut microbiota in health and disease is an established practice although animal models lack the specific interactions present in the complex system of human organism [53].
Univariate and multivariate analyses provide information at different levels. Biologists will often find that the outputs of univariate analyses are easier to interpret compared to those generated by multivariate analyses, though assumptions are similar for both method types [9]. In general, multivariate methods provide a more holistic overview of differences between samples and account for correlations and interactions between the variables, whereas univariate methods are well suited to point out the differences for specific microbial groups. The two levels therefore provide complementary information, and we think that it is generally of biological interest to report differences at both levels.
Multivariate ANOVA methods, namely FFMANOVA and ASCA, consider the underlying covariance between multiple variables in comparison to traditional ANOVA. In complex microbial data sets, where the variables (i.e. OTUs) are not independent, FFMANOVA and ASCA can serve as suitable tools to make biological inference at both the community and OTU levels. Our results show that these methods performed similarly on the example data (similar community-level effect sizes and similar ranking of OTUs; see Table 2 and Fig. 2). Distance-based PERMANOVA provided similar results at the community level, but the results should be complemented by other methods to gain insight into OTUs that are affected.
ASCA, FFMANOVA and LEfSe (through the LDA step) are based on the covariance structure between OTUs, and therefore depend on the relative scaling of the OTUs. It is usual practice in many areas to scale all variables to equal variance, giving them equal weight in the model. Another option, which retains more of the original variability, is pareto scaling. An overview of different scaling options used in metabolomics, also relevant for microbiota, can be found in [54]. To prevent that the results are dominated by a few highly-abundant OTUs, data should be scaled when analysing abundances. Clr-transformation puts the variables on comparable levels, and the need for scaling is less obvious. However, the highly-abundant OTUs might still have slightly higher variance, and scaling should be considered depending on the data characteristics.
The tools tested in the present study vary in how flexible they are regarding possible comparisons, namely two-group comparisons (SIMPER, ASCA and LEfSe), possibility to adjust for confounder (ANCOM, FFMANOVA and ASCA) or to specify complex models with interactions (ALDEx2, FFMANOVA and ASCA). These aspects should be considered when selecting methods, as different study designs might require different types of statistical models and tests.
Although the Spearman’s rank correlation indicated good agreement for the animal studies, little overlap between lists of “significant” results could be detected. There can be several reasons for this. One important aspect is that different criteria must be used to define “significance” or generally “importance”. FFMANOVA and ALDEx2 provide p-values, whereas more heuristic tools must be applied with ASCA, SIMPER and ANCOM. Also note that for some methods (FFMANOVA, ALDEx2 and ANCOM), the ranking is related to all levels of the experimental factors, whereas the other methods use pairwise comparisons only. Moreover, differences in sample collection, sample preparation and sequencing contribute to additional variability, which, in turn, affects the validity of the results [55]. Hence, comparisons across different studies with similar interventions would be even more difficult.
Newly introduced statistical tools can produce contrasting results compared to the existing methods, but such comparisons tend to be biased in favour of the new approach [56]. Past benchmarking studies [57–59] have widely reported varying results from different tools, which was also confirmed in our study. Currently, there is no consensus for the best existing tool for detecting differentially abundant microbial taxa, and there is no reason to believe that one single method is best in all cases. It is therefore good scientific practice to compare and report outputs from several methods. The strategy for making inference from multiple outputs depends on whether it is more important to reduce the number of false negatives or false positives. If it is important not to miss out any possible findings (typically in early stages of research), any OTU listed as significant/important by any method should be investigated further. To obtain robust results, on the other hand, OTUs should be reported as differentially abundant only if they were selected as “significant” by several methods [56].
Finally, microbial data at the OTU level are zero-inflated, and rare OTUs should be removed prior to downstream statistical analyses. We have observed that the threshold for filtering out OTUs can significantly affect the results, both at the community and OTU levels, which is a topic for further research.