Construction of a platinum-grade soybean reference genome
Genome assembly using PacBio SMRT data and Hi-C data resulted in a JD17 genome assembly with 411 contigs anchored to 20 pseudomolecules, with a total size of 965.8 Mb (Table 1), accounting for 97.1% of the total contigs (Fig. 1). The genome accuracy rate was evaluated as 99.999% by using resequencing data and GATK tools [13, 14] (See methods). BUSCO score of the assembly was 93.2% [15] (Table 2), indicating that the completeness of JD17 was higher than that of the Wm82, ZH13, and W05 genomes [16, 17].
Table 1
Assembly statistics of Glycine_max_JD17 (JD17), Glycine_max_v2.0 (Wm82), Gmax_ZH13 (ZH13) and W05.
Assembly Feature | JD17 | Wm82 | ZH13 | W05 |
Estimated genome size (by K-mer analysis) (Mb) | 1,109 | 1,115 | - | - |
Number of contigs | 446 | 16,311 | 1,528 | 1,870 |
Total size of contigs (Mb) | 995.0 | 955.1 | 1,007 | 998.6 |
Longest contig (Mb) | 31.8 | - | - | - |
Number of contigs > 1 Mb | 97 | - | - | - |
Number of contigs > 10 Mb | 39 | - | - | - |
N50 contig length (Mb) | 17.998(PacBio) | 0.189 | 3.46(PacBio + BioNano + Hi-C) | 3.3 |
L50 contig count | 21 | 1,492 | 66 | 58 |
Anchored contigs | | |
Number of chromosomes | 20 | 20 | 20 | 20 |
Number of contigs | 139 | - | - | 772 |
Total size (Mb) | 971 | 937.3 | 1,025 | 1,013.2 |
Number of gaps | 119 | 12,785 | 815 | 750 |
Table 2
BUSCO analysis for JD17, Wm82, ZH13 and W05.
| JD17 | W82 | ZH13 | W05 |
Genome |
Complete BUSCOs © | 1,342 | 1,340 | 1,280 | 1,340 |
Complete and single-copy BUSCOs (S) | 669 | 690 | 631 | 673 |
Complete and duplicated BUSCOs (D) | 673 | 650 | 649 | 667 |
Fragmented BUSCOs (F) | 16 | 15 | 29 | 16 |
Missing BUSCOs (M) | 82 | 85 | 131 | 84 |
Transcript |
Complete BUSCOs © | 1,400 | 1,393 | 1,393 | 1,401 |
Complete and single-copy BUSCOs (S) | 199 | 341 | 533 | 384 |
Complete and duplicated BUSCOs (D) | 1,201 | 1,052 | 860 | 1,017 |
Fragmented BUSCOs (F) | 9 | 13 | 11 | 6 |
Missing BUSCOs (M) | 31 | 34 | 36 | 33 |
Protein |
Complete BUSCOs © | 1,404 | 1,401 | 1,397 | 1,404 |
Complete and single-copy BUSCOs (S) | 205 | 341 | 533 | 405 |
Complete and duplicated BUSCOs (D) | 1,199 | 1,060 | 864 | 999 |
Fragmented BUSCOs (F) | 5 | 8 | 12 | 7 |
Missing BUSCOs (M) | 31 | 31 | 31 | 29 |
Total BUSCO groups searched | 1440 |
Annotation of JD17 genome
We identified 616.6 Mb (62.0% of the JD17 genome) repeat elements in JD17, which contain 1,138,787 intact transposable elements (TEs), including 864,771 class I RNA retrotransposons and 274,016 class II DNA transposons (Table S1).
We identified 58,738 protein-coding genes and 253,172 full-length transcripts in the JD17 genome. The average length of transcripts (2,525 bp) and 3’UTR (707 bp) in JD17 genome are comparatively longer than that in Wm82 (1,628 bp, 374 bp), ZH13 (1,512 bp, 338 bp), and W05 (1,650 bp, 405 bp) genome (Table 3).
Table 3
Comparison of genome annotation of JD17, Wm82, ZH13 and W05.
| JD17 | Wm82 | ZH13 | W05 |
Number of genes | 58,738 | 56,044 | 52,178 | 55,539 |
Number of transcripts | 253,172 | 88,647 | 58,017 | 89,477 |
Average number of transcripts per gene | 4.3 | 1.58 | 1.11 | 1.61 |
Average length of transcript* (bp) | 2,525 | 1,628 | 1,512 | 1,650 |
Average exons number per transcript* | 8.4 | 5.33 | 5.62 | 5.36 |
Average length of 5' UTR* (bp) | 591 | 388 | 183 | 252 |
Average length of 3' UTR* (bp) | 707 | 374 | 338 | 405 |
Numbers of genes with gap | 0 | 1,063 | 12 | 0 |
* mean based on transcript with the longest CDS. |
Table 4
Genes related to important agronomic traits.
| Gene Name | Wm82.a2.v1 Gene id | JD17 Gene ID | Genotype in JD17 | Speculated phenotype | Reference |
Flowering related genes | FLS1/gmfls1 | Glyma.13G082300 | GmJD_13G0078400 | WmWm | purple flower | (Takahashi et al., 2007) |
F3’5’H | Glyma.13G072100 | GmJD_13G0068800 | W1 | Light purple flowers | (Takahashi et al., 2010) |
DFR1 | Glyma.14G072700 | GmJD_14G0066900 | W3 | Purple flower | (Park et al., 2015) |
DFR2 | Glyma.17G252200 | GmJD_17G0274600 | W4 | Purple flower | (Yan et al., 2014) |
FT2c | Glyma.02G069200、Glyma.02G069500 | GmJD_02G0063800 | GmFT2c | Days About 32 in short day (SD) and 46 in long day (LD) to flowering | (Wu et al., 2017) |
J gene | Glyma.04G050200 | GmJD_04G0047400 | J | Promoting flowering under short days | (Lu et al., 2017) |
E1 | Glyma.06G207800 | GmJD_06G0201400 | e1-as | Early-Flowering | (Xia et al., 2012) |
Seed related genes | GmIRCHS | Glyma.08G110300 | GmJD_08G0106200 | I/I | Inhibiting seed coat pigmentation | (Kasai et al., 2007) |
GmHs1-1 | Glyma.02G269500 | GmJD_02G0283800 | Gmhs1-1 | Permeable seed coat | (Sun et al., 2015) |
SoyWRKY15a | Glyma.05G096500 | GmJD_05G0121200 | GmWRKY15a (haplotypes H1) | Heavier seeds | (Gu et al., 2017) |
FATB1a | Glyma.05G012300 | GmJD_05G0012300 | No deletion, homozygous FATB1a | Relatively high palmitic acid levels of seed | (Goettel et al., 2016) |
ZF351 | Glyma.06G290100 | GmJD_06G0292100 | GmZF351 | Relatively high lipid content of seed | (Li, QT et al., 2017) |
GmDREBL | Glyma.12G103100 | GmJD_12G0097600 | variation type 2 | Relatively higher oil content of seed | (Zhang, YQ et al., 2016) |
GA20OX | Glyma.07G081700 | GmJD_07G0078300 | cultivated GmGA20OX | Relatively heavy 100 seeds weight of seed | (Lu et al., 2016) |
NFYA | Glyma.02G303800 | GmJD_02G0316700 | GmNFYA | Relatively higher oil content of seed | (Lu et al., 2016) |
GmSWEET10a | Glyma.15G049200 | GmJD_15G0049100 | Haplotypes H_III-3 | Relatively heavy 100-seed weight, higher fatty acid content and lower protein content of seed | (Shoudong et al., 2020) |
GmSWEET10b | Glyma.08G183500 | GmJD_08G0177700 | Haplotypes H_III-1 | Relatively heavy 100-seed weight, higher fatty acid content and lower protein content of seed | (Shoudong et al., 2020) |
Other important traits related genes | GmphyA2 | Glyma.10G141400 | GmJD_10G0172700 | E4 | Normal phytochrome A gene (phyA) | (Liu et al., 2008) |
Dt2 | Glyma.18G273600 | GmJD_18G0289500 | dt2/dt2 | indeterminate | (Ping et al., 2014) |
SHAT1-5 | Glyma.16G019400 | GmJD_16G0022500 | GsSHAT1-5 | Thinner FCC secondary walls | (Dong et al., 2014) |
Evolutionary relationship between JD17 and other soybean lines
We compared the genome sequencing results of the JD17 genome and 68 lines (including 10 wild soybean lines from different regions) [2]. SNVs and small Indels were obtained (Table S2) through aligning the Illumina sequencing results of 68 soybean strains to the JD17 reference genome by using BAW [18], followed by calling variations using GATK [19]. In order to determine the evolutionary status of JD17, all these variations were merged, obtaining a total of 18,156,237 SNV sites and 2,307,757 small Indel (≤ 10 bp) sites. After quality filtering, 5,544,995 SNV sites were used to construct a neighbor-joining tree. The results showed that JD17 was closely related to Williams (PI548631). This result is consistent with that Williams was the parental ancestor of JD17 [12]. JD17 was also closely related to the cultivated varieties Fen Dou No.85, Cang Dou-11 and Jin Da No.52, which were all cultivars in the Huang-Huai-Hai region (Fig. 2).
Characterization of flower color related genes in JD17
In order to determine the genes of important agronomic traits in JD17, we investigated a lot of literature and found 140 genes in total (Table S3). Among them, 25 genes related to soybean flowering were discovered (Table S3). Six of them control and regulate soybean flower color, including W1, W2, W3, W4, Wm and Wp [20]. Cloning results showed that these six locus were F3’5’H [21], MYB [22], DFR1 [23],DFR2 [24]༌FLS1/gmfls1 [25] and F3H [26], respectively (Fig. 1). The flower color of soybeans with dominant W1W3W4 genotype is dark purple, the flower color of W1W3w4 genotype is light purple, the flower color of W1w3W4 genotype is purple, and W1w3w4 has nearly white flowers [27]. Subsequent research showed that w1w3W4 has white flowers [23]. The F3’5’H gene was cloned in Clark and B09121 strains [21]. The flower colors of the two soybean lines are purple (W1) and light purple (w1-lp). Compared with the wild-type W1, the difference between the mutant-type w1-lp is that a single base G is replaced with A at the 653 bp after the initial codon, while the mutant-type w1 has a 65 bp insertion (Fig. 3a). At the W1 locus, the sequence of JD17 is exactly the same as Clark-w1 (white flower), and there is such a 65 bp sequence insertion, indicating that the genotype of JD17 is w1w1, which is consistent with the white flower phenotype of JD17. For the two genes that regulate flower color, W3 (DFR1) and W4 (DFR2), the genotype of w3w4 is characterized by nearly white flowers, while the genotype of W3w4 is characterized by purple throat flowers [23]. The DFR1 gene was first cloned in L70-4422, L68-1774, Harosoy and Williams 82 [23]. Comparative analysis found that the genotype of JD17 was consistent with Wm82 except that there was no loss of 32 bp (Fig. 3b). Although they have this difference in sequence, their flower color is still the same, indicating that the 32 bp deletion has no effect on flower color. The W4 (DFR2) gene was cloned in Clark and kw4, Bay and E30-D-1 [24]. The comparative analysis found that the key site of JD17 gene sequence is exactly the same as that of Bay (W4) (Fig. 3c), which means that the genotype of JD17 is also W4. Therefore, we determined that the gene of JD17 is w1w3W4 with white flowers, which is the same as Wm82 [23].
The Wm (FLS/gmfls1) gene was first cloned in Harosoy (WmWm) and Harosoy-wm (wmwm). The homozygous wild-type Harosoy (WmWm) has purple flowers, while the homozygous mutant Harosoy-wm (wmwm) is magenta [25]. The change in flowers color here is due to the deletion of a single base G from FLS1 in Harosoy. In the Wm82 strain, the FLS1 gene is exactly the same as Harosoy, and they are all homozygous WmWm without single-base G deletion [25]. The gene sequence of JD17 is consistent with Wm82 and Harosoy, except for a single base A insertion in the intron region (Fig. 3d). It is inferred that JD17 should also be a white flower, which is the same as Wm82.
Characterization of flowering time-related genes in JD17
Among the genes related to flowering (Table S3), there are 19 genes related to the flowering cycle, including 3 genes FT2c [28], J gene [29], and E1 gene [30] DNA mutation sites were determined at the molecular level (Fig. 1). The comparison found that the FT2c gene in the JD17 strain is the same as the allele carrying wild-type GsFT2c (Fig. 4a). The flowering time of plants with wild-type GsFT2c was about 2.8 days earlier than that of Wm82 under short-day light, and about 2 days later under long-day light [28]. The mutations at this locus speculate that JD17 should have a short flowering cycle under short-day light, about 29 days, and about 45 days under long-day light.
The J gene was cloned in PI 159925, BR121 and Harosoy [29]. In these strains, PI 159925(j) has a single base C deletion at 1,352 bp after the start codon compared with Harosoy(J), which leading to the early termination of encoding (Fig. 4b). However, BR121(j) compared with Harosoy(J) has a 10 bp deletion at 587–596 bp after the initial codon, which resulting in a frameshift mutation. The variation in both cases delayed the flowering time under short-day sunlight, from about 32 days to about 45 days [29]. The sequence of this gene in JD17 is consistent with Wm82 and Harosoy (J), so it can be speculated that JD17 should behave to promote flowering under short-day light.
The E1 gene was first cloned in the Harosoy-E1 strain [30]. Compared with Wm82, the coding sequence of E1 gene in the Harosoy-E1 strain is G at 44 bp after the initial codon (Fig. 4c), while the mutant e1-as genotype is shown here as a missense mutation caused by a C base substitution, the genotype of e1-fs at 49 bp is a frameshift mutation caused by A base insertion, and the mutant e1-as genotype is shown here as a missense mutation caused by a C base substitution [30]. The comparative analysis showed that the genotype in the JD17 genome was consistent with e1-as, which suggested that JD17 should have a shorter flowering cycle.
Characterization of seed traits-related genes in JD17
Soybean seeds have many traits, including seed coat color, seed size, grain weight, oil content, protein content, etc., which are all very important agronomic traits.
The GmIRCHS gene is an important gene that controls the color of the seed coat. Compared with the TM strain, the study found that THM (A pigmented seed coat mutant Toyohomare) mainly caused the pigmentation of the seed coat through a 3.3-kb region deleted (Fig. 5a), from normal yellow to brown [31]. The result of sequence alignment revealed that the 3.3Kb was not lost in JD17, but about 56 kb was inserted between the adjacent J and CHS3, and the seed coat color of JD17 was still yellow [11].
The GmHs1-1 gene controls the cracking of the seed coat and is closely related to water penetration. The genotype is distinguished based on mutations at 7 sites in the promoter region of the gene [32]. The sequence comparison evidence of JD17 showed that it was completely consistent with the Wm82 sequence (Fig. 5b), the genotype was Gmhs1-1, and the phenotype was permeable seed coat.
In the seed weight study, the SoyWRKY15a gene was identified in cultivated soybean Suinong14 (SN14) and wild soybean ZYD00006. Through 6 polymorphic loci, 4 genotypes were defined, H1, H2, H3 and H4. The indel at -61 bp in the 5'UTR is a major and independent cis-motif variation (Fig. 5c) [33]. JD17 has the same genotype as Wm82, and can be determined to be H1. According to studies, the phenotype of this genotype has large seeds and relatively large leaves.
The FATB1a gene is related to the palmitic acid content of seeds. Compared with the Jack strain, there is a 254-kb deletion in the mutant N0304-303-3 (Fig. 5d), which directly affects the low-level palmitic acid content of soybean seeds, about 5% [34]. Through the result of sequence alignment, we found that there is such a sequence in JD17 (GmJD_05G0012300). The genotype is No deletion, Homozygous FATB1a, and it is speculated that the palmitic acid content phenotype of JD17 may be around 10%.
The ZF351 gene is closely related to the oil content of soybean seeds, which is mainly determined by the A/T SNP at 767 bp upstream of the gene start codon (Fig. 5e), and the expression of the GmZF351 allele with A at -767 bp is significantly higher than that of the GsZF351 allele with T at -767 bp, and promotes the increase of oil content [35]. The result of sequence alignment shows that the genotype of JD17 is GmZF351, which is consistent with Wm82, and it is speculated that the oil content is higher. GmDREBL is another gene related to oil content, and its genotype is determined by the 7 sites upstream of the gene [36]. The sequence of JD17 shows that its genotype is consistent with Wm82 (variation type 2) (Fig. 6a). Since the two genes closely related to oil content in JD17 are the same as Wm82, it is inferred that the oil content of JD17 is also the same as Wm82, which is 18.2%.
GA20OX mainly affects the 100-seed weight of seeds, and its genotype is mainly determined by 29 sites in the promoter region (Fig. 6b) [37]. Sequence alignment showed that the genotype in JD17 was completely consistent with Wm82, and it was inferred that its 100-seed weight was 15–20 g, which is consistent with the phenotype [12].
NFYA gene is also related to oil content, and the main difference in its genotype comes from whether there is an insertion of ~ 1,500 bp in the promoter region (Fig. 6c) [36]. While the sequence of JD17showed that there is such an insertion, it is speculated that the oil content of JD17 is also similar to Wm82, which is 18–22% [12].
The GmSWEET10a and GmSWEET10b genes are related to the seed 100-kernel weight, fatty acid content, and protein content. They were identified by GWAS association analysis of more than 800 soybean lines [38]. Among them, the GmSWEET10a gene distinguishes different genotypes by 5 sites upstream of the gene and 5 sites of the gene body (Fig. 6d). The sequence of JD17 is showed be H_III-3 type through aligned with the region of Wm82. GmSWEET10b mainly defines different genotypes based on the differences of 17 sites (Fig. 6e). According to the sequence characteristics of JD17, its genotype was identified as H_III-1 type. According to the phenotypic statistics in the article [38], we can roughly speculate that the seed weight of JD17 is about 18 g, the fatty acid content is about 20%, and the protein content is about 36%[12].
Characterization of genes underlying other important traits
There are also genes related to other important traits, such as the gene GmphyA2 related to pigment synthesis [39], the gene Dt2 related to stem growth [40], the gene SHAT1-5 (Fig. 1), which is closely related to fruit pod dehiscence [41]. And the genotypes and related phenotypes of JD17 were identified and predicted respectively based on the characteristics of these published genotypes.
In TK780 and wild soybean Hidaka 4 (H4), the insertion of the repetitive sequence GTTTT changed the GmphyA2 genotype from E4 to e4 (Fig. 7a), resulting in yellow flowers appearing on plants under continuous far-red light [39]. By determining that the genotype of JD17 is consistent with the wild type H4, it is inferred that its phenotype is also consistent with that of the wild type H4.
The genotype of Dt2 is mainly determined by the differences in the three sites within the three genes (Fig. 7b) [40]. Although there are some differences between the sequence of JD17 and Wm82 in these regions, these three key sites are consistent. It is concluded that the genotype of JD17 is consistent with Wm82 as dt2/dt2, and the predicted stem growth phenotype is indeterminate.
SHAT1-5 gene is a gene related to the control of pod dehiscence identified in cultivated soybean HEINONG44 and wild soybean ZYD00755, mainly by promoting the thickening of the secondary wall of fibrous cap cells to inhibit pod dehiscence [41]. The genotype of JD17 was inferred to be GmSHAT1-5 based on the deletion of its sequence (Fig. 7c). It is speculated that the pod phenotype of JD17 is the same as that of cultivated soybean HEINONG44, and it is easy to crack.