Sequencing and variation.
We re-sequenced 181 soybean cultivars in Huang-Huai-Hai region (HR) and produced approximately 15.2 billion 100bp pair-ended reads. The sample sequencing depth ranged from 6.64~16.22 fold of the reference genome, and the average sequencing depth was 10.38x (Supplementary Table 1, Figure 1A). Approximately 74.96~99.78% of the reads of each accession can be mapped onto the reference genome, with a mean mapping ratio and coverage rate of 94.87%, 95.48%,respectively (Supplementary Table 1, Figure 1B and 1C), indicating that the sequencing covered most of the reference genome. After comparing with the reference genome, we identified 11185589 SNPs in all 181-soybean materials, which was more than the number of mutations in other studies22-23, 41. These SNPs were evenly distributed on all chromosomes, among which the number of SNPs on chromosome 18 was the highest and so was the density, while chromosome 15 took the second place (Supplementary Table 2, Supplementary Figure 1, Figure 1D). The genome conversion/transversion ratio of 181 materials in HR was 1.89 (ts/tv ratio) (Figure 1G). After annotation, a total of 6136859 SNPs were detected in the intergenic region. In the gene region, we detected 370289 SNPs in exons and 172761 SNPs in UTR region (Supplementary Table 3, Figure 1E). The synonymous and non-synonymous mutation rate of 181 materials in HR was 1.60.
In this study, we detected a total of 2520208 InDel markers through mutation detection, which was more than the number of mutations detected in other studies22-23, 41. For InDel markers, their distribution on chromosomes was basically consistent with the previous description of SNPs (Supplementary Table 2, Supplementary Figure 1, Figure 1D). There were about 975395 InDels in intergenic region, 146578 in intron region, 70379 InDels in UTR region and 47724 InDels in exon region (Supplementary Table 3, Figure 1F).
Variations in the genome.
The variations found in HR cultivars′ genome led to a large number of codon changes in important gene regions. In this part of the present study, by comparing with the reference genome, we found a large number of genes affected by putative variations.
After statistics, we found that 50365 loci were mutated in all cultivars in HR, of which 30624 were SNPs, and 19741 were InDels.
A total of 3240 SNPs were identified in 1314 important gene regions in all accessions. According to the enrichment analysis of SoyBase (http://soybase.org), these 1314 genes were widely involved in biological processes such as metabolic process, cellular process, reproductive process, obsolete GTP catabolic process, regulation and so on (Supplementary Figure 2). In addition, some genes were annotated to involve in the gene localization process, which we thought might account for the adaption of varieties to the environment of HR.
Among these mutations, 1368 SNPs were located in exons of genes, and 473 were synonymous mutations, while the remaining 895 SNPs existing in 522 genes were non-synonymous mutations (Figure 2). Among them, two SNPs were annotated as “initiator_codon_variant”, and may cause the start codon loss of Glyma.01G016800 and Glyma.08G024100. In addition, 14 SNPs led to the early appearance of premature stop codons of 13 genes, leading to shortened polypeptides of Glyma.01G009500, Glyma.02G287400, Glyma.03G034400, Glyma.03G081100, Glyma.03G173800, Glyma.08G228400, Glyma.12G181000, Glyma.14G209700, Glyma.15G166300, Glyma.16G121600, Glyma.18G279800, Glyma.20G015100 and Glyma.20G240900, and these 13 genes were mainly participated in metabolism, response to external stimuli and some regulation processes. Moreover, another four SNPs were likely to lead to the loss of termination codons of Glyma.03G049800, Glyma.18G100200 (binding), Glyma.18G161900 (binding) and Glyma.20G135500, but unfortunately, the biological process annotation information of Glyma.03G049800, Glyma.20G135500 was not included in SoyBase (http://soybase.org) (Table 2).
A non-synonymous mutation analysis for the InDel was also carried out, which identified InDels showing up in important gene regions in HR genome, and from this part we found that InDels might have a greater impact on the HR genome than SNPs (Supplementary Figure 3). 2205 InDels in important gene regions were identified to have different degree of influence on 1241 genes (Figure 3), of which, 660 InDels were exonic variants. We annotated the 660 InDel variants and found that twelve of them were “conservative inframe deletion” variants, which can lead to the deletion of at least one complete codon in the CDS of Glyma.10G188800, Glyma.14G088700 and Glyma.15G074100. This change was likely to change the protein coding products of the corresponding genes. Contrary to the mutation of codon deletion, we found nine “conservative inframe insertion” InDel variants, which were thought to lead to the insertion of at least one codon into CDS regions of Glyma.03G248400, Glyma.04G086400, Glyma.07G140200, Glyma.07G230200, Glyma.15G070700, Glyma.16G001000, Glyma.16G066200, Glyma.19G152500 and Glyma.19G179300. In addition, we found that 622 InDels were “frameshift” type variants, which might cause at least one base to be deleted or inserted downstream of the mutation site, in hence may lead to changes in subsequent amino acid coding, and among which three frameshift mutations may lead to the loss of the starting codon of Glyma.01G013200, Glyma.02G022800 and Glyma.16G132100. Moreover, another seven frameshift mutations can result in the loss of stop codons of Glyma.04G030100, Glyma.04G110000, Glyma.07G078000, Glyma.09G146200, Glyma.11G097000, Glyma.15G187700 and Glyma.16G080500. Furthermore, there were three InDels may have the effect that can lead to the early appearance of stop codons of Glyma.09G135600, Glyma.15G234300 and Glyma.09G278400. Surprisingly, all the genes mentioned upward in this part were mainly involved in the metabolic process and had catalytic activity and binding function and only a small range of them participated in cellular process and localization. Notably, we also found an InDel, which was annotated as a “bidirectional gene fusion” type mutation, and its production can fuse Glyma.13G168900 and Glyma.13G169000 together. In addition, after GO annotation, we found that Glyma.13G168900 was involved in photosystem I assembly process, but the Glyma.13G169000 had no corresponding annotation message in SoyBase (http://soybase.org) (Table 3).
Population structure analysis.
Population structure was completed by using the core SNP set as described in the method part. First, we conducted a principal component analysis (PCA) analysis to reveal the population structure. PCA1 and PCA2 explained the variation of 19.97% and 13.35% of the population respectively (Figure 4A) but basing on the first two principal components (PCA1 and PCA2), the HR soybean population did not exhibit obvious subgroup clustering. The materials of SA, SB, SC, SD and SE all showed a distributed distribution. However, all the varieties of SA gathered in a relatively small range, and as the breeding period of each sub-population processing, the variation range of sub-populations gradually expanded and SE got the largest range of variation.
Next, the genetic relationship matrix among all the samples was computed by using the core SNP data set, and the NJ tree between samples was drawn (Figure 4B), which gave a similar result with PCA. NJ tree showed that although the population of cultivars bred in the HR showed a certain tendency of clustering, the materials of sub-populations in each breeding period did not present obvious clustering characteristics. Furthermore, a cluster analysis was carried out on the present population by using Terastructure (Figure 4D), and it was found that the value of the validation likelihood was in a climbing state until a peak occurred at k = 17. As the value of k increasing, the validation likelihood remained unchanged, and this result indicated that there might be more than 17 ancestral populations in the breeding populations in the HR.
Linkage disequilibrium decay analysis.
In this part, r2 was used to reflect the degree of LD of sub-populations. The half-decay distance (physical distance when r2 decays to half of the maximum value) of the entire HR population was 160 kb, and 290kb, 380kb, 753 kb, 182kb, and 227kb for five sub-populations respectively. SD exhibited the shortest half-decay distance, followed by SE, and SC showed an abnormally high semi-decay distance, which was almost 4 times the semi-decay distance of SD (Figure 4C).
Genetic Diversity among Sub-populations.
In this section, we calculate the nucleotide diversity (π) (Table 4, Supplementary Figure 4) and the polymorphic information content (PIC) (Table 4) of sub-populations at each breeding stage in order to understand the effect of artificial selection within sub-populations on total genetic diversity. The results of π showed that among all the five sub-populations, SA had the highest nucleotide diversity (1.54×10-3), which was significantly higher than that in the later breeding stage sub-populations, while SB (1.27×10-3) and SC (1.24×10-3) got the lowest. With the passage of breeding period, the nucleotide polymorphism of SD (1.41×10-3) and SE (1.38×10-3) gradually increased, and significantly higher than that of SB and SC. The results of PIC analysis were consistent with π. SA had the highest polymorphic information content (0.242), followed by SD (0.234), SB (0.197) and SC (0.190). The analysis of the above two aspects showed that SA had the highest genetic diversity, followed by SD, while SB and SC had the lowest genetic diversity.