SLAF sequencing and SNP discovery
The genome of the cultivated soybean (G. max) was used for program prediction in this project, and the RsaI+HaeIII enzyme was selected for enzyme digestion. SLAF label length ranged from 364-444bp. Once the data had been cleaned, the clean reads derived from each sample ranged between 453 and 2202 Mb for each individual, with most reads being about 800 Mb long. The average number of reads assigned to each individual was 3,859,551, with minimum and maximum read numbers per individual of 2,266,715 and 11,010,066, respectively (Table S1). Phred quality scores were high (30 ≥ 89.82%) and the GC content was found to range between 37.9% and 41.4% (Table S1). A total of 1,784,121 SLAFs were predicted, of which 548,804 were heterozygous SLAF tags. The average number of SLAF labels obtained by each individual was 202,663, with an overall average depth of 11.9x. A total of 2,436,305 SNPs were discovered. SNPs which fulfilled the following criteria were then discarded: (1) those with a minor allele frequency < 1% (2) those with a Hardy-Weinberg Equilibrium p value < 1 × 10−5; and (3) those missing more than 10% of their genotype data. Individuals missing more than 10% of the genotyped data were also discarded. A total of 56,489 SNPs were retained for downstream genetic diversity analysis. The SNPs showed a largely even distribution throughout the genome (Fig. 1).\
Genetic diversity at the species and population levels
The observed heterozygosity (Ho) was 0.0157 for all loci polymorphic at the species level, with the expected heterozygosity (He) being 0.1459, nucleotide diversity (π), 0.1465, and the inbreeding coefficient (FIS), 0.8533. When considering all nucleotide positions, including the non-polymorphic ones, the observed heterozygosity decreased to 0.0004 when the non-polymophic nucleotide positions were included in the analysis, with the expected heterozygosity decreasing to 0.0035, the nucleotide diversity decreasing to 0.0035, and the inbreeding coefficient decreasing 0.0205 under the same conditions.
Statistical analyses for each population are given in Table 2 and Figure 2. Across the loci that showed polymorphism in one or more populations, average observed heterozygosity (Ho) was found to range between 0.0199 (DQ) and 0.0460 (KR), expected heterozygosity (He) between 0.0119 (DQ) and 0.3492 (KR), nucleotide diversity (π) between 0.0130 (JK) and 0.3789 (KR), and inbreeding coefficient between -0.0003 (JK) and 0.0230 (KR).
If all nucleotides, including nonpolymorphic nucleotides were considered, the observed heterozygosity was found from 0.0005 to 0.0016, with the expected heterozygosity from 0.0003 to 0.0121. The observed nucleotide diversity ranged between 0.0003 and 0.0131, and the inbreeding coefficient from -0.0003 to 0.0230. The number of private alleles observed for each population ranged between 1755 (HH) and 12083 (QQHE). From all the measures, the highest genetic diversity was found in the KR population, followed by QQHE. The lowest nucleotide diversity and heterozygosity was seen in the JK population, with the lowest observed heterozygosity found in the DQ population.
Population structure analysis and Linkage disequilibrium
The average pairwise FST values between different populations were used to reconstruct an UPGMA neighbor-joining tree in Mega v6.0 (Fig. 3). The 147 wild soybean accessions clustered into two groups: lineage I comprised five populations from Northern and Central China as well as a population from Japan. Lineage II comprised the other 11 populations, which had Northeast Chinese, Korean and Yangtze River origins. Analysis of gene flow suggested that historical migrations of wild soybean may have occurred, from south to north across the East-Asia land-bridge. The phylogeographic history of wild soybean provides us with new insights into the migration patterns of herbaceous plants across the Sino-Japanese Floristic Region.
The 147 wild soybean accessions clustered into two groups: lineage I comprised five populations (SY, TJ, JN, WN, WH) from Northern and Central China as well as a population from Japan (JK). Lineage II comprised the other 11 populations (HEB, QQHE, CC and DQ from Northeast China; KO, KR and JT from the Korean Peninsula; NJ, YW and HH from the Yangtze River Region). The Korean and Japanese populations did not form independent lineages, with Japanese population falling into Lineage I, and the Korean populations clustering with Lineage II. Individuals sampled from population SY formed two groups.
To further investigate the population structure of the sampled wild soybean populations, we analyzed the 56,489 generated SNPs in Structure2.3 using the “admixture” and “correlated alleles frequencies” models. Changes in LnP(D) and delta K were assessed following, suggesting that the model that best fit our data was K = 10 (Fig. 4A). Populations DQ, CC and QQHE were found to be most genetically similar to each other, from the posterior probabilities (Fig. 4B). Three individuals from population SY were found to be genetically distinct from other individuals in SY. The JT population is highly genetically diverse. The YW and NJ populations are genetically similar.
The principal coordinates analysis was largely consistent with both the reconstructed UPGMA phylogenetic tree and with the Structure analysis (Fig 5). PC1 showed clear divergence between Lineages I and II, although population WH was separated from Lineage I. PC2 divided Lineage II into two clusters: Cluster I comprised populations YW, HH and NJ, and Cluster II comprised populations KO, KR, JT, CC, DQ, QQHE, and HEB.
Linkage disequilibrium decay curves of the 16 G. soja populations are given in Fig 6. Each colored line represents the observed LD data for a single population. A clear and rapid decline of LD is observed to occur with distance, with the LD in all populations decaying rapidly to half its initial value within about 250 kb.
Genetic differentiation and gene flow among populations
The total number of migrants (Nm) per generation was predicted to be 0.18 (Genepop analysis). Figure 3 shows the calculated pairwise population Wright's FST values and the pairwise gene flow (Nm) of the 16 studied G. soja populations. Genetic differentiation between populations, as calculated from the FST values, was found to be quite high. The DQ and JK populations were the most divergent, with an FST value of 0.67, and populations YW and JN were the least divergent with a value of 0.016. The pairwise population Nm values calculated from the Wright's FST values ranged between 0.125 and 2.108. The greatest gene flow (2.108) was between the JN and YW populations, followed by that between the SY and YW populations (2.002). The smallest level of gene flow (0.125) was calculated between the JK and DQ populations, followed by that between the JK and CC populations (0.141). Generally, the levels of gene flow between populations of wild soybean were quite low (Table 3).
To describe the historical relationships between these populations and to investigate potential migration events between them, we ran a TreeMix analysis on the 16 sampled wild soybean populations. The results obtained suggest that population splits have occurred and that there has been gene flow between populations. On the TreeMix output (Fig. 7), the DQ and HEB populations cluster together as one group, and there is strong gene flow from the CC population towards QQHE. Populations TJ, JN and WN clustered together as a single group, and there was strong historical gene flow from this cluster towards the QQHE and JT populations, and modern gene flow from the TJ to the SY population. Overall, the general trend in gene flow was from the south towards the north, with the populations TJ, JN and WN also contributing gene flow. In summary, the general migration patterns seem to have been from the south towards the north.