Soil sampling and edaphic variables analysis
Soil samples were collected in July 2017 from 32 sites located in 12 provinces of China (Additional file 1: Figure S1). Each site comprised agricultural fields with three different land-use types, including open field cultivated with main crops (CF), open field cultivated with vegetables (VF) and greenhouse cultivated with vegetables (VG). Within each site, fields with different land-use types were adjacent to each other (less than 1 km apart). For CF and VF treatments, the field had been cultivated with main crops or vegetables, respectively, for more than 20 years, respectively. For the VG treatment, the field had been converted to greenhouse and cultivated with vegetables for three to more than 15 years (Additional file 1: Table S1). It should be noted that, all greenhouses of the VG treatment were for soil-based vegetable production, which mean vegetables were cultivated in the soil, but not for soilless production. Since different crops were cultivated at our sampling sites (Additional file 1: Table S1) and plant roots can strongly affect soil microbial community composition [1], bulk soils were samples in this study. For each kind of land-use type at each site, three plots with an area of about 50 m2 each were randomly selected, and ten soil cores per plot were taken from the upper soil layer (0-15 cm) and pooled. Therefore, there were three composite samples for each land-use type at each site. In total, 288 soil samples were collected. Soils were sieved (2 mm), and large stones and plant debris were removed. One portion of these sampled soils was used for edaphic variables analyses (Additional file 1: Supplementary Methods) and the other portion was stored at -80°C for DNA extraction.
Soil DNA extraction
Soil DNA was extracted from 0.25 g of soil with the PowerSoil DNA Isolation Kit (MO BIO Laboratories, Carlsbad, USA) following the manufacturer's instructions. The quality of extracted DNA was checked with electrophoresis in a 1.2% (w/v) agarose gel and a NanoDrop 2000 spectrophotometer (ThermoFisher Scientific, Wilmington, USA). Each composite soil sample was extracted in triplicate, and the extracted DNA solutions were pooled.
Quantitative PCR analysis
SYBR Green quantitative PCR assays were used to estimate soil bacterial abundance with the primer set of 338F/518R, targeting the V3 region of the bacterial 16S rRNA gene [34]. Quantitative PCR assays were conducted with a qTOWER 3G touch real-time PCR system (Analytik Jena, Jena, Germany) (the detailed PCR conditions are shown in Additional file 1: Supplementary Methods). All amplifications were performed in triplicate. The specificity of the products was confirmed by melting curve analysis and agarose gel electrophoresis.
High-throughput amplicon sequencing and data processing
Extracted soil DNA was amplified with the primer set F338/R806, targeting the V3-V4 regions of the bacterial 16S rRNA gene [35]. Both the forward and reverse primers were tagged with the specific overhang Illumina adapters. Detailed PCR conditions and library preparation protocol are shown in Additional file 1: Supplementary Methods. The libraries were sequenced (2×300) on an Illumina Miseq platform (Illumina Inc., San Diego, USA).
Raw data generated from sequencing were analyzed using the QIIME pipeline [36]. Briefly, adaptor sequence, barcode and 30 low-quality bases at the end of each read were removed. Paired reads were joined (minimum overlapping read length of 20 base pairs) and quality filtered (Phred score of 20), and reads with less than 200 base pairs were removed. Chimeras were screened and removed with USEARCH using the UCHIME algorithm [37]. Sequences with ≥97% similarity were assigned to the same operational taxonomic unit (OTU) with UPARSE using an agglomerative clustering algorithm [37]. Taxonomies were assigned to the representative sequence of each OTU using the SILVA database release 128 [38]. OTUs classified as chloroplasts and mitochondria, and singleton OTUs were removed. To avoid potential bias caused by sequencing depth, sequence counts of all samples were normalized to a minimum number of sequences (10,763) per sample.
Statistical analyses
Bacterial community α- and β-diversity analyses
Unless stated otherwise, statistical analyses were conducted in “R” (v3.6.2, http://www.r-project.org/). Taxon accumulation curves, Good’s coverage and truncated Preston log-normal model were used to evaluate the sampling effort of the sequencing data using the “vegan” package [39]. Bacterial community α-diversity indices, including number of observed OTUs, Shannon index and Faith’s phylogenetic diversity, were calculated using the “vegan” [39] and “picante” [40] packages.
For bacterial community β-diversity analysis, taxonomy-based Bray-Curtis distance and phylogeny-based β-mean nearest taxon distance (βMNTD) were calculated using the vegdist function in the “vegan” [39] and the comdistnt function in the “picante” [40] packages, respectively. Variation in bacterial community composition was visualized with principal coordinates analysis (PCoA) using the “vegan” package [39]. The proportion of variation in community composition explained by land-use type was quantified by constrained analysis of principal coordinates (CAP), and statistical significance was tested using the permutest function with 9999 permutations in the “vegan” package [39]. To test the effects of sampling sites and land-use types on community dissimilarity, permutational analysis of variance (PERMANOVA) was performed using the adonis function with 9999 permutations in the “vegan” package [39]. Pairwise differences between land-use types were tested using the pairwise.adonis function with 9999 permutations and P values were adjusted with the Benjamini-Hochberg (BH) method in the “pairwiseAdonis” package [41].
Identification of land-use type sensitive OTUs
Land-use type sensitive OTUs were identified following the procedure described by Hartman et al. [42]. First, indicator species analysis was performed to calculate the point-biserial correlation coefficient of an OTU’s positive association to one or a combination of land-use types. This analysis was performed using the multipatt function with 9999 permutations in the “indicspecies” package [43]. Then, likelihood ratio test was used to evaluate differential OTU abundance among land-use types with the “edgeR” package [44] and the P values were adjusted with the BH method. Finally, sensitive OTUs were defined as these validated by both indicator species analysis and likelihood ratio test at P<0.05. A maximum likelihood tree was constructed based on representative sequences for each sensitive OTU with the “phangorn” package [45]. Then, the phylogenetic tree with the relative abundances of each OTU in each land-use type was visualized using the iTOL tool [46].
Co-occurrence network and cohesion analyses
Co-occurrence networks were constructed for all samples and for each land-use type based on relative abundances of OTUs. Spearman’s correlations between OTUs with relative abundances higher than 0.005% were calculated using the rcorr function in the “Hmisc” package [47] and the P values were adjusted with the BH method. A correlation coefficient was considered statistically robust if Spearman’s correlation coefficient (ρ) was >0.6 and the P value was <0.01. The Gephi, an open-source software, was used to visualize the network graphs [48]. The nodes in the networks represent bacterial OTUs, and the edges represent strong and significant correlations among nodes. Some key topological features of the networks (including number of nodes and edges, average connectivity, average path length, clustering coefficient, network density and modularity) and topological features of nodes (including degree and betweenness centrality) were calculated using the “igraph” package [49]. A detailed description of these topological features is shown in Additional file 1: Table S). Possible keystone OTUs were those that had low betweenness centrality and high degree values [50]. To determine whether our networks were nonrandom, 1000 Erdös-Rényi random networks with the same number of nodes and edges as the observed networks were constructed, and the topological features of the real and random networks were compared. The differences in bacterial co-occurrence network structure was determined by PERMANOVA analysis of the Spearman’s correlational distances matrix as described by Williams et al. [51].
Community connectivity was measured by the cohesion metric following the protocol outlined by Herren and McMahon [33]. Briefly, pairwise correlations were calculated between all OTUs in a community. Then, a null model was used to verify the strength of these correlations. Expected correlations generated by the null model were subtracted from the observed correlations, yielding positive and negative connectedness values for each OTU. Finally, cohesion metrics were calculated for each community by taking the sum of each OTU’s connectedness value multiplied by its abundance in the community, yielding a positive and negative cohesion metric for each community.
Distance-decay relationship analysis
Geographical distances among sampling sites were calculated from the sampling coordinates. Spatial turnover rates (i.e., the distance decay of similarity) of bacterial community and habitat were determined by regressing the pairwise community similarity (1-dissimilarity of the Bray-Curtis or βMNTD metrics) or habitat similarity, respectively, against the pairwise geographic distance using ordinary least-squares linear regressions. Habitat similarity was calculated based on edaphic variables [26]: Ed=(1-Eucd/Eucmax), where Eucd is the Euclidean distance between two sites and Eucmax is the maximum Euclidean distance between sites in the distance matrix. For edaphic variables (except pH), proportion data were arcsine(sqrt)-transformed, and others were log-transformed. Mantel tests were used to assess the significance of the correlation between community similarity or habitat similarity and geographic distance using the mantel function with the Pearson’s correlation method and 9999 permutations in the “ecodist” package [52]. Differences of slopes between land-use types were evaluated using the diffslope function with 9999 permutations in the “simba” package [53].
To evaluate the relationship between bacterial community similarity, geographic distance and environmental similarity we performed with the partial Mantel test using the mantel function with the Spearman’s correlation method and 9999 permutations in the “ecodist” package [52]. The environmental distance matrix was created with non-redundant edaphic variables best explaining bacterial community dissimilarities, which were identified using the bioenv function in the “vegan” package [39]. To tease apart the relative importance of land-use type, crop type, and edaphic and geographic variables on bacterial community β-diversity, multiple regressions on matrices (MRM) was performed with the Spearman’s correlation method and 9999 permutations using the “ecodist” package [52]. Land-use type and crop type were transformed into categorical variables. Before applying MRM, variable clustering was used to remove redundant environmental variables with the Spearman’s correlation method (cutoff of ρ < 0.6) using the varclus function in the “Hmisc” package [47]. The partial regression coefficients (b) of an MRM model gave a measure of the per-unit effect of an independent variable on the dependent variable (bacterial community β-diversity) while controlling for the effects of the other independent variables. To gain a better understanding of the influences of edaphic variables on bacterial community β-diversity, a distance-based redundancy analysis was performed with forward selection of edaphic variables using capscale and ordiR2step functions in the “vegan” package [39].
Ecological modeling
To evaluate the relationship between OTUs’ niche differences and phylogenetic distances, we used phylogenetic Mantel correlograms as described by Stegen et al. [11]. Distances between OTUs’ environmental optima were calculated with the abundance-weighted mean approach [11]. The phylogenetic distance matrix was generated using the cophenetic.phylo function in the “ape” package [54]. The correlation between phylogenetic distance and the difference in environmental optima was estimated using the mantel.correlog function with 999 permutations and BH correction in the “vegan” package [39]. A phylogenetic signal was detected if OTUs’ niche differences were significantly related to between-OTU phylogenetic distances.
To evaluate the phylogenetic community assembly, the standardized effect size measure of the mean nearest taxon distance (SES.MNTD) was calculated using the ses.mntd function with the “taxa.labels” null model and 999 randomizations in the “picante” package [40]. SES.MNTD values greater or less than zero indicate communities are more distantly or closely, respectively, phylogenetic clustered than expected by chance [7]. To quantify the potential ecological processes (variable selection, homogeneous selection, dispersal limitation, homogenizing dispersal or drift) that govern bacterial community assembly, the null model method was used as proposed by Stegen et al [11, 32]. β-nearest taxon index (βNTI) and Bray-Curtis-based Raup-Crick (RCbray), which are based on the difference between the observed βMNTD and Bray-Curtis dissimilarities and their null distributions (999 community randomizations), respectively, were calculated using R scripts written by Stegen et al. [11] (available at https://github.com/stegen/Stegen_etal_ISME_2013). For a given community, |βNTI| > 2 and < 2 indicated the dominance of deterministic and stochastic processes, respectively [11, 32]. Furthermore, the relative influence of variable or homogeneous selection were quantified as the fraction of pairwise comparisons with βNTI > +2 or βNTI < -2. The relative influence of dispersal limitation or homogenizing dispersal were quantified as the percentage of pairwise comparisons with -2 < βNTI < 2 and RCbray > 0.95 or < -0.95. The undominated fraction was quantified as the percentage of pairwise comparisons with -2 < β-NTI < 2 and -0.95 < RCbray < 0.95.
To identify the major factors that influenced the bacterial community assembly processes, MRM test was performed with geographic distance and the Euclidean distance matrices of each of the edaphic variables as independent variables and with the βNTI matrix among samples as the dependent variable. Partial Mantel tests were used to evaluate the relationship between βNTI and either geographic distance, soil pH and edaphic variables excluding pH. To further assess the role of soil pH in driving bacterial community assembly processes, βNTI values were regressed against Euclidean distance matrix of soil pH. The statistical significance of the resulting comparisons was determined by Mantel tests with 999 permutations in the “ecodist” package [52].
The Sloan neutral model, which assumes that community assembly is driven solely by chance and dispersal, was used to determine the potential importance of neutral processes to community assembly based on the community abundance data [31]. This model was fit to the frequency of detection of OTUs and their abundances by the parameter m (migration rate), and the fit of the model was assessed with a generalized R2.
Two-way ANOVA was conducted to analyze the effects of sampling sites and land-use types on bacterial abundance, α-diversity indices, soil physicochemical properties, SES.MNTD and βNTI. In cases where data did not satisfy normality and homogeneity of variance, values were arcsine(sqrt)-transformed or log-transformed before analysis. Means were compared between land-use types based on Tukey’s HSD test. Spearman’s correlations between bacterial abundance, bacterial community α-diversity indices, relative abundances of sensitive OTUs and soil physicochemical properties were tested using the rcorr function in the “Hmisc” package [47] and the P values were adjusted with the BH method.