Correlation between wet chemistry and energy-dispersive X-ray fluorescence method
To validate our chosen analytical method (energy-dispersive X-ray fluorescence, ED-XRF), thirty samples were analyzed by both ED-XRF and flame atomic absorption spectroscopy (FAAS) for Ca and K and by spectrophotometry for P content. As can be seen in Figure 1, the correlation coefficients (r) between both methods were positive and highly significant (P < 0.001) and ranged from 0.91 (Ca) to 0.94 (P). These results demonstrated that the ED-XDF was appropriate for the quantification of Ca, P and K content in soybean seeds.
Figure 1: Pearson correlation between wet chemistry and ED-XRF for Ca, K, P and S content on a dry-weight basis among 30 soybean seed samples.
Phenotypic variation and correlations among traits
The concentrations of Ca, K, P and S on a set of 137 soybean lines grown on two sites (two replicates/site) in 2013 were estimated using an ED-XRF device. The frequency distributions exhibited an approximately normal distribution and thus appeared to be quantitatively inherited (Figure 2). As shown in Table 1, the range of seed mineral content varied for the four elements: from 1.6 to 2.4 mg/g for Ca, 17 to 21 mg/g for K, 4.5 to 6.5 mg/g for P and 3.5 to 5.5 mg/g for S content on a dry-weight basis. Across all 137 lines, the means were 1.8, 18.7, 5.3 and 4.3 mg/g respectively for Ca, K, P and S content. The least significant difference (LSD) between two genotype means was 0.03 mg/g for Ca, 0.44 mg/g for K, 0.28 mg/g for P and 0.09 mg/g for S content. A high broad-sense heritability was observed and ranged from 81% (K) to 99% (S). The presence of a fairly large phenotypic variation and high heritability suggested that these traits and association panel would be well suited to uncover the genetic architecture of these traits.
Figure 2: Distribution of Ca, K, P and S content in the seed of 137 Canadian soybean lines.
Table 1: Descriptive statistics for Ca, K, P and S content across two sites (two replicates per site) in the seed of 137 Canadian soybean lines
Traits
|
Range
|
Mean
|
LSD
|
H2 (%)
|
Ca
|
1.6 - 2.4
|
1.8
|
0.03
|
84
|
K
|
17.0 - 21.0
|
18.7
|
0.44
|
81
|
P
|
4.5 - 6.5
|
5.3
|
0.28
|
83
|
S
|
3.5 - 5.5
|
4.3
|
0.09
|
99
|
As illustrated in Table 2, an analysis of variance showed that both the genotype and environment had a highly significant effect (P £ 0.001) on phenotypic variation for all traits except for Ca where the genotypic effect was the sole significant source of variation. No significant genotype x environment interactions were observed for any of the traits. The observed phenotypic values were significantly (p < 0.001) correlated between the two experimental sites, with correlations ranging between 0.75 and 0.98. The seed content in the different minerals also proved to be correlated (Additional file 1: Table S1, in bold). All such pairwise comparisons were statistically significant (p < 0.05) and the highest correlations were observed between K and S (r2 = 0.67, p < 0.001) as well as between P and K (r2 = 0.65, p < 0.001).
Table 2: ANOVA results for Ca, K, P and S content across two sites (two replicates per site) in seed of 137 Canadian soybean lines.
Nutrient
|
Source of variation
|
df
|
F values
|
p-values
|
Ca
|
Genotype
|
136
|
4.69
|
< 0.0001***
|
Environment
|
1
|
0.47
|
= 0.4900 ns
|
Genotype x Environment
|
136
|
0.17
|
= 1.0000 ns
|
K
|
Genotype
|
136
|
2.72
|
< 0.0001***
|
Environment
|
1
|
23.88
|
< 0.0001***
|
Genotype x Environment
|
136
|
0.34
|
= 1.0000 ns
|
P
|
Genotype
|
136
|
15.32
|
< 0.0010 **
|
Environment
|
1
|
11.01
|
< 0.0010 **
|
Genotype x Environment
|
136
|
0.08
|
= 1.0000 ns
|
S
|
Genotype
|
136
|
19.46
|
< 0.0001***
|
Environment
|
1
|
15.32
|
< 0.0001***
|
Genotype x Environment
|
136
|
0.10
|
= 1.0000 ns
|
*** and ** = Significant, p < 0.0001 and 0.001; ns = not significant, p > 0.05.
Genotyping and SNP calling
The lines of the association panel were initially genotyped via a GBS approach that yielded a total of 56K high-quality SNPs. In a second step, a reference panel of 4.3 M SNPs was used to perform missing loci imputation onto the original set of GBS-derived SNPs. After removing InDels, markers with a MAF < 0.05 and heterozygosity > 0.1, a total of 2.18M SNPs were retained, offering an average marker density of 1 SNP every 435 bases across the entire genome. The physical distribution of these 2.18M SNPs across the soybean 20 chromosomes is illustrated in Additional file 2: Figure S1. The genotypic data thus obtained was then used to characterize population structure within this panel and to look for marker-trait associations.
Population structure
The population structure of this core set of 137 Canadian soybean lines was initially inferred using fastSTRUCTURE and the optimum delta K was observed to range between 6 and 9 subpopulations, with trivial differences between these estimates; therefore, the optimum delta K was chosen to be 7 (Figure 3a). In addition, as it can be seen in Figure 3b and 3c, both results of the phylogenetic tree analysis and the PCA-based population structure analysis were consistent with fastSTRUCTURE. Together, these results suggested that K = 7 provided a good assessment of population structure and the corresponding Q matrix was used for GWAS.
Figure 3:Models-based population structure in a core set of 137 Canadian soybean lines.
a: Classification into seven populations using fastSTRUCTURE where each individual (from 1 to 137) is represented by a single vertical line and each color represents one cluster. b: Bootstrap consensus phylogenetic tree (2,000 replicates) constructed using MEGA 7; each colour represents a subgroup and seven subgroups were found in total. c: PCA eigenvalues computed using GAPIT. As presented the total variance explained by each principal component (PC) decreased from PC1 to PC7 and, beyond PC7, the variance explained by each further PC remained low and stable.
Genome-wide association scan for mineral elements content in soybean seeds
To discover chromosomal regions that contribute to the phenotypic variation, we used three analytical tools to measure marker-trait associations: FarmCPU, CMLM and MLMM. Each of these used a single value (BLUP) for each trait and took into account both population structure and relatedness between the lines (Q and K matrices). As shown in the quantile-quantile (QQ) plots in Additional file 2: Figure S2, all three models successfully limited the confounding effects as the observed p-values only diverged from the diagonal (expected p-values) at the most extreme values (beyond 3E-03 for almost all traits).
To identify distinct QTLs, we characterized haplotype blocks around peak SNPs and concluded that all markers falling within the same haplotype block as the peak SNP essentially marked the same QTL. Within each block, the most highly associated SNP was chosen to represent the QTL. When two different peak SNPs were detected within the same haplotype block (by different models), the most common peak SNP was chosen. In addition, when the same peak SNP was detected by at least two models the lowest uncorrected p-value was reported.
The results of these association analyses are presented as Manhattan plots for FarmCPU, CMLM and MLMM in Figure 4. Based on the threshold for false discovery rate (blue horizontal line, FDR £ 0.05), we detected 32 QTLs of which seven were associated with Ca content, ten with K, five with P and ten with S content (Additional file 1: Table S2). Interestingly, one shared QTL contributing to both both K (K_#1) and P (P_#1) was observed. The uncorrected p-values of these QTLs ranged from 1.35E-06 to 2.84E-21 for Ca, from 1.89E-05 to 8.05E-19 for K, from 1.17E-06 to 3.61E-12 for P and from 1.75E-05 to 6.63E-15 for S content.
In total, among these 32 QTLs, eight QTLs were co-identified by at least two models (Figure 5) and the features of these eight QTLs are summarized in Table 3. The portion of phenotypic variance explained (R2) ranged from 20 to 21% for Ca, from 17 to 31% for K, 22% for P and from 18 to 23% for S. The magnitude of allelic effects varied between 0.06 to 0.07 mg/g, 0.30 to 0.57 mg/g, 0.30 mg/g and 0.15 to 0.46 mg/g for Ca, K, P and S, respectively. The genetic variance (additive) explained as the narrow-sense heritability (h2) was 41 % for Ca, 82% for K, 78% for P and 93% for S.
Figure 4: Manhattan plots for (a) calcium (b) potassium, (c) phosphorus and (d) sulfur content in a core set of 137 Canadian soybean accessions using three models for measuring marker-trait associations. Each dot indicates the degree of association between a single marker and a trait (y-axis) while the x-axis shows the physical position of each marker. A blue horizontal bar indicates the significance threshold (FDR £ 0.05) and significantly associated markers are coloured in red for FarmCPU, (+) for CMLM and (*) for MLMM.
Figure 5: Venn diagram for the 32 identified QTLs through three analytical approaches.
Table 3: List of QTLs for mineral element content identified by at least two statistical models in 137 Canadian soybean lines. The most highly associated SNP within each QTL is indicated along with the associated statistics. For each trait, a measure of its heritability (h2) is provided. The models that detected a significant marker-trait association are abbreviated as follows: C for CMLM, M for MLMM and F for FarmCPU.
Gm
|
Peak SNP
|
QTL N°
|
p-value
|
FDR
|
R2%
|
Effect
|
h2%
|
Models
|
06
|
3,354,869
|
Ca_#3
|
2.94E-08
|
4.5E-03
|
20
|
-0.06
|
41
|
C/M
|
09
|
6,092,970
|
Ca_#4
|
3.70E-08
|
4.5E-03
|
21
|
-0.07
|
C/M/F
|
04
|
49,071,552
|
K_#1
|
1.75E-06
|
6.1E-03
|
17
|
-0.30
|
82
|
C/F
|
10
|
1,925,709
|
K_#3
|
4.31E-10
|
4.9E-05
|
31
|
-0.57
|
C/M/F
|
04
|
49,071,286
|
P_#1
|
6.12E-08
|
1.5E-02
|
22
|
-0.30
|
78
|
C/M/F
|
10
|
1,602,998
|
S_#4
|
2.84E-08
|
4.0E-03
|
23
|
0.46
|
93
|
C/M
|
15
|
3,986,243
|
S_#7
|
2.80E-07
|
2.3E-02
|
19
|
0.15
|
M/F
|
20
|
39,076,484
|
S_#10
|
9.13E-07
|
9.7E-03
|
18
|
0.20
|
C/M/F
|
Validation of the eight co-identified QTL across three environments
To verify the stability of each of the eight QTLs detected by at least two models, data from three additional trials were obtained. Given the difference in original purpose and experimental design of this second set of trials, we divided the population into two groups according to the allelic genotype at each QTL (peak SNP) and tested whether the mean phenotypes of the two genotypic classes were significantly different using a t-test (Additional file 2: Figure S3 and Additional file 1: Table S3). Overall, across the three new environments, seven QTLs were validated in at least two environments (I_17 and N_18). Only QTL#4 for Ca could not be validated in any of the three new environments. The I_18 environment saw the lowest rate of validation with five QTLs being successfully detected in this environment (Figure 6). Of the 24 possible QTL-environment combinations (8 QTLs x 3 environments), 18 resulted in a significant difference between the mean phenotype of lines contrasting for the peak SNP. These results indicate that the identified QTLs are robust across a wide range of environments.
Figure 6: Stability of the eight QTLs detected by at least two models for Ca, K, P and S content.
The core set of 137 early Canadian soybean accessions were grown in three additional environments (in 2017 or 2018, with or without supplemental irrigation). The phenotypic mean was calculated for the subsets of lines contrasting for the peak SNP at each of 8 QTLs previously detected by at least two of the three GWAS models. Each colored dot represents the p-value for the contrast observed in one environment. The y-axis shows the -log10(p-value) of each test while the x-axis shows the reported QTLs associated with each trait. A red horizontal bar indicates the Bonferroni significance threshold at alpha £ -log10 (0.05/n), where n = number of co-identified QTLs per trait. (e.g. 0.05/2 for Ca).
Refinement of the GWA scan for co-identified QTL
To more deeply explore variants in these robust QTLs, we extracted all SNPs falling within the haplotype blocks surrounding the seven most robust QTLs from the larger catalogue of 2.2M SNPs. These were merged with the pruned data (243K) set to perform the GWAS with three models again. In six of these seven instances, stronger association signals were observed and the physical distance between the previous and the new peak SNP ranged from 1 to 311 kb (Additional file 1: Table S4), but always resided within the same haplotype block.
Prediction of candidate genes within the robust QTL regions
Based on the GWAS results, we investigated the genes annotated in the soybean genome in order to identify putative candidate genes from loci significantly associated with each trait. To establish a list of candidate genes, we focused only on those residing within a region delimited by the left-most and right-most flanking markers that were in perfect LD (D’ =1) with the peak SNP for the seven QTLs described above. These genomic regions (ranging in size between 32 and 360 kb) were extracted from Wm82.a2.v1 and their GO annotations examined. An example of this approach is illustrated in Figure 7. The number of genes residing (fully or in part) in each region varied between 4 and 43 and the full list of these genes and their annotations are provided in Additional file 1: Table S5.
To identify a candidate gene, we looked for genes that met either of the two following criteria: 1) genes annotated as being involved in the transport of the given mineral element and expressed in roots, shoots or leaves or 2) genes annotated as being involved in the uptake, translocation, and/or homoeostasis of the element of interest and mainly expressed in seeds. In total, three promising candidate genes involved either in the transport or assimilation of these mineral elements were identified. We first discovered Glyma.06G046000 (132 kb upstream of the peak SNP in Ca_#3), Glyma.10G020000 (222 kb downstream of the peak SNP in K_#3). These two genes were both annotated as being involved in transport and expressed in roots tips and roots hairs. In addition, Glyma.06G046000 was expressed in young leaves, flowers, main roots, pods as well as in seeds (Additional file 2: Figure S4b and 4d.). Finally, Glyma.20G151500 (32 kb downstream of S_#10) was annotated as being involved in sulfate assimilation and expressed in flowers, root, nodule and seeds (Additional file 2: Figure S4f). No candidate gene falling within the defined LD blocks and meeting our criteria was found for QTLs K_#1, P_#1, S_#4 and S_#7.
Figure 7: Identification of a candidate gene underlying QTL S_#10 within the haplotype block on chromosome 20. Top panel: marker-trait associations within a ~ 80-kb interval (39.027 – 39.106 Kb) of Gm20. Middle panel: position and orientation of four gene models present in the 35-kb region that is defined by the left-most (Gm20: 39,042,071) and right-most (Gm20: 39,076,880) markers that are in perfect LD with the peak SNP (Gm20:39,076,484). The most likely candidate gene (Glyma.20G151500, Sulfate assimilation) is highlighted with a green asterisk. Bottom panel: pairwise LD among markers falling within the defined genomic region of interest. LD is indicated as D’x100 and the empty squares indicate complete LD (D’=1). The position of the peak SNP (blue arrow) and candidate gene (green arrow) are shown.
Table 4: Identification of candidate genes for seven QTLs associated with mineral element content in a core set of 137 Canadian soybean lines. For each robust QTL (detected using multiple models in many environments), a region of interest was delimited by flanking markers in perfect LD with the peak SNP. The identifier and annotation of candidate genes residing within the relevant genomic regions are provided.
Gm
|
QTL
|
Peak SNP
|
Size of LD
block
|
# of
genes
|
Candidate
gene
|
Relevant
annotation
|
06
|
Ca_#3
|
3,354,869
|
199kb
|
30
|
Glyma.06G046000
|
Calcium ion transport
|
04
|
K_#1
|
49,071,552
|
32kb
|
4
|
NA
|
NA
|
10
|
K_#3
|
1,966,469
|
360kb
|
43
|
Glyma.10G020000
|
Potassium ion transport
|
04
|
P_#1
|
49,071,286
|
32kb
|
4
|
NA
|
NA
|
10
|
S_#4
|
1,602,998
|
162kb
|
18
|
NA
|
NA
|
15
|
S_#7
|
3,986,243
|
158kb
|
20
|
NA
|
NA
|
20
|
S_#10
|
39,076,484
|
35kb
|
04
|
Glyma.20G151500
|
Sulfate assimilation
|
Structural and nucleotide variation within candidate genes and their predicted functional impact
To determine if genetic (structural or nucleotide) variation within or overlapping the candidate gene could constitute causal variants, we examined a catalogue of such variation established from the whole-genome sequencing data available for a subset of 56 lines. No structural variant (> 51 bp) was identified as overlapping in full or in part with these three candidate genes. As for nucleotide variants, a total of 18 SNPs were found within the coding regions of two genes (one within Glyma.06G046000 and 17 within Glyma.20G151500). All of these variants were predicted as a having a “modifier” or “low” impact on protein function. It is therefore unlikely that the observed phenotypic variation is due to a loss of function of these candidate genes.