Phenotypic variation in maturity, plant height, seed weight and yield in soybean germplasm
Data on maturity were collected over 3 years on a total of 250 soybean genotypes. Average maturity was slightly stable over years. Average maturity values were 118.1 days, 106.4 days, and 103.8 days in 2008, 2009, and 2010, respectively (Table 1). Average maturity was in the range of 77.3 days to 139.0 days with an average of 109.5 days for the 3-year study (Table 1) (Fig. 1). Maturity was significantly different between years (F value = 600, p-value < 0.0001) and between genotypes (F value = 5.67, p-value < 0.0001) (Table 2). The top 5 soybean genotypes with the longest average maturity were Hualvdou (123.9 days), Miyangxiaozihuang (122.8 days), Miyangbaidou (121.9 days), Huanggandou (120.4 days), and Zhongnongshi-1 (119.8 days), whereas the ones with the shortest average maturity were Huangdou 2 (92.7 days), Xiaohuangdou (92.3 days), Zhongdou 36 (91.6 days), Shuyangchunheidoubing (88.7 days), and PI159764 (79.8 days) (Table 3).
Plant height was 90.5 cm, 101.8 cm, and 76.1 cm in 2008, 2009, and 2010, respectively (Table 1). Plant height ranged between and 18.5 cm and 162.9 cm over 3 years (Table 1) (Fig. 1). Significant differences in plant height were found between years (F value = 496.08, p-value < 0.0001) and between genotypes (F value = 18.51, p-value < 0.001) (Table 2). Based on the 3-year pooled data, the tallest genotypes were Lvpihuangdou (152.7 cm), Miyunlaoyelian (143.3 cm), Jindou 21 (139.5 cm), Xiaheidou (138.4 cm), and Sijiaoqi (133.7 cm), whereas the shortest ones consisted of Elf (38.3 cm), Gnome 85 (35.9 cm), Kaohsiung 1 (31.6 cm), Zhonghuang 18 (30.0 cm), and PI159764 (20.2 cm) (Table 3).
Average data on seed weight were, in g per 100 seeds, 17.3, 18.8, and 16.8 in 2008, 2009, and 2010, respectively (Table 1) (Fig. 2). Seed weight varied from 18.5 (g/100 seeds) to 162.9 (g/100 seeds) over 3 years. Seed weight was significantly different between years (F value = 135.67, p-value < 0.0001) and between genotypes (F value = 28.49, p-value < 0.0001) (Table 2). Overall, the genotypes with the highest seed weight were Zhongpindaheidou (34.0 g/100 seeds), KD01 (33.7 g/100 seeds), Lv 75 (32.6 g/100 seeds), Lvrouheipidou (31.6 g/100 seeds), and Verde (29.8 g/100 seeds), whereas those with the lowest seed weight were Liushiribaidou (9.5 g/100 seeds), Qingdou (9.2 g/100 seeds), Xiaoheidou (8.3 g/100 seeds), Shuyangchunheidoubing (6.9 g/100 seeds), and Banyesheng (4.3 g/100 seeds) (Table 3).
Average soybean yield was 2309.3 kg/hm2, 2207.9 kg/hm2, and 2212.1 kg/hm2 in 2008, 2009, and 2010, respectively (Table 1). Over the 3 years, the highest recorded yield was 4286.6 kg/hm2, whereas the lowest one was 649.4 kg/hm2 (Table 1). Significant differences in yield were identified between years (F value = 5.32, p-value = 0.0052) and between genotypes (F value = 4.2, p-value < 0.0001) (Table 2). The high-yielding genotypes were Delsoy 4710 (3406.1 kg/hm2), Franklin (3204.9 kg/hm2), Lu 99 − 1 (3195.8 kg/hm2), Newton (3134.7 kg/hm2), and Lu 99 − 7 (3127.9 kg/hm2), whereas the accessions with the lowest yield for the 3-year average were Xiaoheidou (1251.8 kg/hm2), Banyesheng (1173.5 kg/hm2), Verde (1114.1 kg/hm2), PI159764 (784.4 kg/hm2), and Kim (655.1 kg/hm2) (Table 2).
Correlation between maturity, plant height, seed weight, and yield
Pearson’s correlation coefficients (r) between maturity, plant height, seed weight, and yield were computed. Maturity was almost not correlated with seed weight (r = 0.054) and yield (r = 0.019), respectively (Table 4). However, a medium correlation was found between maturity and plant height (r = 0.439). A low correlation was identified between seed weight and yield (r = 0.080). Seed weight was negatively correlated with plant height (r=-0.257). Plant height was not correlated with yield (r=-0.004) (Table 4).
For each trait, Pearson’s correlation coefficients were also calculated between the three years. The correlation between years was high for maturity, seed weight, and plant height, and relatively moderate for yield. For maturity, the highest correlation was between data from 2008 and 2010 (r = 0.763) (Table 4), whereas the lowest one was between 2009 and 2010 (r = 0.657) (Table 4). For seed weight, the data were highly consistent over years since the lowest correlation was 0.876, corresponding to seed weight in 2008 and 2010 (Table 4). Yield data varied from year to year where the highest correlation was found between 2009 and 2009 (r = 0.604), and the lowest one between 2008 and 2010 (r = 0.484) (Table 4). Similarly to seed weight, data were also consistent over years for plant height where the lowest correlation was 0.848 for the data obtained in 2009 and 2010 (Table 4).
Population structure and genetic diversity analysis
SNP filtering yielded a total of 10,259 high-quality SNPs that were used for GWAS and GS. SNP number per chromosome varied from 292 (chromosome 12) to 785 (chromosome 18), with an average SNP number of 513 per chromosome (Table 5). There was also a large variation in the SNP density per chromosome with chromosome 16 having the lowest SNP density (73.29 kb between two adjacent SNPs) and chromosome 12 having the highest one (136.50 kb) (Table 5). The average distance between two adjacent SNPs was 94.86 kb. The average minor allele frequency (%) per chromosome ranged between 16.24% and 24.73%, with an average of 21.02% (Table 5). The average percentage of heterozygous SNPs per chromosome varied from 0.42–0.74%, with an average of 0.59%. There was not significant discrepancy in the average percentage of missing SNPs per chromosome (Table 5).
STRUCTURE Harvester indicated a delta K peak at K equal to 2 (Fig. 3A), indicating that the panel involving the 250 soybean genotypes consisted of two subpopulations. The bar plot from STRUCTURE 2.3.4 showed the two-well differentiated subpopulations where the subpopulation 1 was shown in red and the subpopulation shown in 2 in green (Fig. 3B). The first cluster consisted of 146 genotypes, accounting for 57.2% of the total population. The second subpopulation involved a total of 103 genotypes, which corresponded to 40.4% of the whole panel. In addition to the two distinct subpopulations, an admixture consisting of 6 genotypes was also identified and accounted for only 2.4% of the soybean panel used in this study. The mean inbreeding coefficients of the subpopulation relative to the total population were 0.565 and 0.043 for the subpopulation 1 and subpopulation 2, respectively (Table S2). The average distances between individuals in the same cluster were 0.332 (subpopulation 1) and 0.154 (subpopulation 2) (Table S2). The overall proportions of membership of a genotype within each cluster were 0.567 and 0.433 for subpopulation 1 and subpopulation 2, respectively (Table S2). The average allele frequency divergence among populations was 0.095 (Table S2).
A genetic diversity analysis was also carried out along with population structure as shown in Fig. 3C. In the phylogenetic tree, the subpopulation 1 was displayed using solid red circles, the subpopulation 2 was shown using solid green circles, and the admixture using solid blue circles (Fig. 3C). A good correlation was found between the genetic diversity analysis and the structure analysis. A large number of genotypes belonging to the subpopulation 1 was placed on the left-hand side of the phylogenetic tree (Fig. 3C), whereas most of the genotypes grouped into subpopulation 2 were clustered on the right part of the phylogenetic tree (Fig. 3C). The admixture genotypes were randomly scattered in the phylogenetic tree.
Genome-wide association study (GWAS) in maturity, plant height, seed weight and yield
Maturity
The significant SNPs associated with maturity varied between years (Fig. 4). When the maturity data were combined over three years, a total of 20 SNPs associated with maturity were found (Table 6). These SNPs were distributed across the soybean genome (Fig. 4D) with chromosome 20 having the most significant SNPs associated with maturity (Table 6). The top five SNPs with the highest LOD values were Chr10_45903960 (LOD = 10.47, MAF = 38.31%), Chr13_33362588 (LOD = 6.83, MAF = 6.61%), and Chr09_30962080 (LOD = 6.53, MAF = 16.60%), Chr08_3672982 (LOD = 6.32, MAF = 22.53%), and Chr01_10725106 (LOD = 6.22, MAF = 14.04%), which were located on chromosomes 10, 13, 9, 8, and 1, respectively (Table 6). The t-test analysis conducted to compare the genotypic class from the aforementioned SNPs was significant expect for Chr13_33362588 (Figs. 8A-E). The SNP Chr10_45903960 was the only significant SNP which was consistent across three years with LOD values equal to 9.95, 3.22, 3.86, and 10.47 in 2008, 2009, 2010, and the combined data, respectively (Tables 6 and S3).
A total of 21, 16, and 33 SNPs associated with maturity were found in 2008, 2009, and 2010, respectively (Figs. 4A-4C) (Table S3). The top five SNPs found in 2008 were Chr20_43647960 (LOD = 11.87, MAF = 8.27%), Chr10_45903960 (LOD = 9.95, MAF = 38.31%), Chr01_9100366 (LOD = 7.92, MAF = 15.26%), Chr02_33386127 (LOD = 6.31, MAF = 27.60%), and Chr13_23406473 (LOD = 5.68, MAF = 5.60%), and located on chromosomes 20, 10, 1, 2, and 13, respectively (Fig. 4A) (Table S3). In 2009, the most significant SNPs associated with maturity were Chr04_46043518 (LOD = 7.77, MAF = 6.75%), Chr17_7918542 (LOD = 6.97, MAF = 13.01%), Chr03_35339536 (6.92, MAF = 34.44%), Chr16_5522373 (LOD = 5.93, MAF = 12.16%), and Chr08_19444425 (LOD = 5.52, MAF = 6.07%), which were found on chromosomes 4, 17, 3, 16, and 8, respectively (Fig. 4B) (Table S3). The data from 2010 provided with the largest number of SNPs among the three years. The top SNPs found from the 2010 maturity were Chr17_11801370 (LOD = 4.10, MAF = 20.78%), Chr03_2051343 (LOD = 3.90, MAF = 33.73%), Chr04_42265893 (LOD = 3.90, MAF = 15.97%), Chr10_45903960 (LOD = 3.87, MAF = 38.31%), and Chr12_38319250 (LOD = 3.87, MAF = 11.65%), located on chromosomes 17, 3, 4, 10, and 12, respectively (Fig. 4C) (Table S3). Despite the data from 2010 having the largest number of significant SNPS, the LOD values were slightly lower than the 2 other years. The SNP Chr17_7918542 was found in both 2008 and 2010. Of the 21 SNPs found in 2008, 6 were located on chromosomes 11 and 17 with 3 SNPs each (Table S3). In 2009, chromosome 4 harbored the most SNP (Table S3), and 6 SNPs were located on chromosome 6 for the data from 2010 (Table S3).
Plant height
The number and location of SNPs associated with plant height varied between years (Fig. 5). The data from 2009 suggested a total of 43 SNP markers associated with plant height, whereas the one from 2008 had a total of 18 significant SNPs (Table S3). On average, the significant SNPs associated with the 2010 plant height data had higher LOD values than the two other years (Table S3). The top five significant SNPs associated with plant height in 2008 were Chr05_4382157 (LOD = 11.80, MAF = 12.65%), Chr06_20321899 (LOD = 10.12, MAF = 46.64%), Chr19_44603046 (LOD = 7.56, MAF = 10.36%), Chr03_35230252 (LOD = 7.13, MAF = 22.13%), and Chr19_45769759 (LOD = 5.71, MAF = 23.53%), which were found on chromosomes 5, 6, 19, 3, and 19, respectively (Fig. 5A) (Table S3). Out of the 18 significant SNPs found for the 2008 data, 4 were mapped on chromosome 10 within a 600-Kb distance. In 2009, the top five SNPs with the highest LOD values were Chr05_4341777 (LOD = 12.56, MAF = 18.11%), Chr19_45769759 (LOD = 8.70, MAF = 23.53%), Chr02_33386127 (LOD = 6.72, MAF = 27.60%), Chr05_27107594 (LOD = 6.32, MAF = 6.53%), and, Chr19_44603046 (LOD = 6.04, MAF = 10.36%), located on chromosomes, 5, 19, 2, 5, and 19, respectively (Fig. 5B). A total of 8 SNPs out of the 43 found in 2009 were located on chromosome 5. The SNPs with the highest LOD values associated with plant height in 2010 were Chr19_45359939 (LOD = 22.63, MAF = 20.24%), Chr05_4341777 (LOD = 9.72, MAF = 18.11%), Chr19_45270675 (LOD = 7.02, MAF = 14.06%), Chr08_3672982 (LOD = 6.85, MAF = 22.53%), and Chr02_11936018 (LOD = 6.78, MAF = 14.52%), which were located on chromosomes 19, 5, 19, 8, and 2, respectively (Fig. 5C) (Table S3). The two SNPs on chromosome 19 were within a distance of 80 Kb. The SNP marker Chr20_34864638 was the only one which was consistent over three years (LOD_2008 = 3.41, LOD_2009 = 4.59, LOD_2010 = 3.40, MAF = 21.37%). The SNPs Chr19_45769759, Chr19_44603046, and Chr05_27107594 were significant in both 2008 and 2009, and the SNP Chr05_4341777 was significant in both 2009 and 2010 (Table S3).
When the plant height data were combined over three years, a total of 31 SNPs were identified (Table 6). The top five SNPs associated with the combined data were Chr19_45769759 (LOD = 11.78, MAF = 23.53%), Chr05_4341777 (LOD = 7.97, MAF = 18.11%), Chr20_43000412 (LOD = 6.76, MAF = 43.83%), Chr19_45270675 (LOD = 6.60, MAF = 14.06%), and Chr19_44603046 (LOD = 6.21, MAF = 10.36%), which were found on chromosomes 19, 5, 20, 19, and 19, respectively (Fig. 5) (Table S3). The t-test analysis conducted to compare the data from the two genotypic classes defined by the aforementioned SNPs was significant (Figs. 8F-J). Despite the lack of consistency between the SNPs associated with plant height over three years, the region 44 Mb- 45 Mb on chromosome 19 displayed significant SNPs regardless of the year (Fig. 5), indicating that this region could contain important QTL(s) for plant height in soybean.
Seed weight
Similarly to the pattern of SNPs found for maturity and plant height, the SNP number and location varied between years for seed weight (Fig. 6). For the combined data (2008, 2009, and 2010), a total of 37 SNPs were found to be associated with seed weight (Table 7). The SNP Chr04_36949349 was the only one which was consistent across three years (LOD_2008 = 13.55, LOD_2009 = 9.01, LOD_2010 = 26.76, MAF = 16.47%) (Table S3). The top five significant SNPs were Chr04_36949349 (LOD = 22.58, MAF = 16.47%), Chr09_42124679 (LOD = 12.97, MAF = 35.90%), Chr01_42908212 (LOD = 10.47, MAF = 5.14%), Chr17_15088391 (LOD = 9.68, MAF = 23.08%), and Chr08_47483065 (LOD = 7.85, MAF = 25.83%), which were located on chromosomes 4, 9, 1, 17, and 8, respectively (Fig. 6) (Table 7). T-test analysis was significant for genotypic class comparison for these SNPs (Figs. 8K-O). Out of the 37 SNPs associated with the average seed weight over three years, 10 were located on chromosome 10 (Table 7).
A total of 29 SNPs were significantly associated with seed weight in 2008 (Table S3). Of the 29 SNPs, 7 were located on chromosome 10 with a genomic region of about 7 Mb. The top significant SNPs were Chr09_42124679 (LOD = 14.27, MAF = 35.90%), Chr04_36949349 (LOD = 13.55, MAF = 16.47%), Chr11_18308497 (LOD = 7.82, MAF = 16.61%), Chr18_54032147 (LOD = 6.66, MAF = 6.69%), and Chr08_47483065 (LOD = 6.44, MAF = 6.69%), located on chromosomes 9, 4, 11, 18, and 8, respectively (Fig. 6A) (Table S3). The data from 2009 suggested a total of 30 SNPs associated with seed weight. The SNPs with the highest LOD values were Chr09_3931828 (LOD = 26.87, MAF = 6.91%), Chr02_23396978 (LOD = 17.05, MAF = 5.67%), Chr18_53835731 (LOD = 14.09, MAF = 14.96%), Chr13_34849223 (LOD = 13.62, MAF = 11.62%), and Chr11_32073461 (LOD = 13.27, MAF = 33.60%), found on chromosomes 9, 2, 18, 13, and 11, respectively (Fig. 6B) (Table S3). Interestingly, chromosome 9 harbored the largest number of SNPs for 2008, whereas chromosome 10 had the largest number of SNPs for the data from 2008 (Figs. 6A and 6B). The results from 2010 revealed the highest number of SNPs among the three years, where a total of 42 SNPs were found to be associated with seed weight (Fig. 6C). The top significant SNPs were Chr04_36949349 (LOD = 26.76, MAF = 16.47%), Chr02_19239630 (LOD = 15.27, MAF = 5.98%), Chr07_38110956 (LOD = 14.65, MAF = 33.05%), Chr09_42124679 (LOD = 12.65, MAF = 35.90%), and Chr08_47483065 (LOD = 10.16, MAF = 25.83%), which were found on chromosomes 4, 2, 7, 9, and 8, respectively (Table S3). The SNPs Chr09_42124679 and Chr08_47483065 were consistent in both 2008 and 2010, and the SNP Chr17_15088391 overlapped between the data from 2009 and 2010 (Table S3).
Yield
The number of significant SNPs associated with yield was consistent over years. A total of 17, 17, and 16 significant SNPs were found in 2008, 2009, and 2010, respectively (Table S3) (Fig. 7). However, the SNP locations were different between years. Only two significant SNPs were overlapping between years for the yield phenotype. The SNP Chr14_191942 was found in both 2009 and 2010, and the SNP Chr16_30202844 was identified in both 2008 and 2009. For the 2008 yield data, the top significant SNPs were Chr01_33020051 (LOD = 14.74, MAF = 9.02%), Chr18_1543178 (LOD = 13.59, MAF = 5.69%), Chr19_44507961 (LOD = 10.95, MAF = 24.80%), Chr13_7416534 (LOD = 7.25, MAF = 30.57%), and Chr07_7610107 (LOD = 7.13, MAF = 49.17%), located on chromosomes 1, 18, 19, 13, and 7, respectively (Fig. 7A).
For the 2009 data, the top SNPs associated with yield were Chr10_44639359 (LOD = 8.78, MAF = 29.41%), Chr11_10192448 (LOD = 5.27, MAF = 22.75%), Chr05_4106987 (LOD = 5.16, MAF = 14.22%), Chr18_51842219 (LOD = 4.98, MAF = 10.55%), and Chr16_30202844 (LOD = 4.93, MAF = 17.14%), which were found on chromosomes 10, 11, 5, 18, and 16, respectively (Table 7) (Fig. 7B). The average LOD values of the significant SNPs found in 2009 was lower than that of found in 2008 (Table S3).
The average LOD values of the significant SNPs in 2010 was also lower than that of found in 2008 but was almost similar to the 2009 results (Table S3). The SNP markers for the 2010 yield data were scattered across the soybean genome (Fig. 7C). The top significant SNPs associated with yield in 2010 were Chr08_43763537 (LOD = 9.83, MAF = 35.18%), Chr05_5758793 (LOD = 8.64, MAF = 6.69%), Chr18_7828843 (LOD = 7.90, MAF = 34.71%), Chr03_20676666 (LOD = 7.29, MAF = 14.16%), and Chr14_191942 (LOD = 6.38, MAF = 13.65%) (Table S3), which were located on chromosomes 8, 5, 18, 3, and 14, respectively (Fig. 7C).
When the yield data over three years were combined, a total of 23 significant SNPs were identified (Table 7) (Fig. 7D). The top significant SNPs for the combined data were Chr09_3204462 (LOD = 8.92, MAF = 25.98%), Chr19_9586720 (LOD = 8.52, MAF = 27.97%), Chr08_47747059 (LOD = 8.43, MAF = 41.43%), Chr10_2089552 (LOD = 7.02, MAF = 24.31%), and Chr07_7610107 (LOD = 6.69, MAF = 49.17%), which were located on chromosomes 9, 19, 8, 10, and 7, respectively (Table 7). The variation between each genotypic class defined by the aforementioned SNPs was visualized in Figs. 8P-T. Of the 23 significant, 6 were located on chromosome 10 (Fig. 7D). Of the 6 significant SNPs found on chromosome 10, 3 were mapped within a 3-Mb genomic region (Table 7), indicating a strong likelihood of QTL(s) affecting soybean yield in this region.
Candidate genes selection
Candidate genes found within a 10-kb genomic region spanning a significant SNP associated with the combined data over three years for maturity, plant height, seed weight, and yield were investigated. For maturity, a total of 20 significant SNPs were identified (Table 6). Of which, 14 were mapped to genomic regions harboring annotated genes in Soybase (www.soybase.org). The annotated genes had a wide variety of functions. Leucine-rich repeat (LRR) domain was prevalent as shown in Table 6. The candidate genes associated with the most significant SNPs were Glyma.10g228900, Glyma.13g220200, Glyma.08g046800, Glyma.17g100600, and Glyma.08g176000, which encoded for LRR receptor-like protein kinase, F-box/leucine rich repeat protein, sensor histidine kinase, malate and lactate dehydrogenase, and replication protein A-related, respectively (Table 6). A candidate gene involved in epigenetics such as O-methyltransferase was also identified.
Of the 31 significant SNPs associated with the combined data for plant height, 25 were found in the vicinity of annotated genes (Table 6). Functional annotations of the candidate genes were diverse and mainly included transcription factors, kinases, and biomolecule transporters. The candidate genes found near the most significant SNPs were Glyma.19g200800, Glyma.05g048800, Glyma.19g195500, Glyma.19g187900, and Glyma.06g134200, encoding for transcription factor NF-Y alpha-related, cleavage and polyadenylation specificity factor, ubiquitin, PHD-finger, and serine/threonine-protein kinase, respectively (Table 6).
A total of 24 candidate genes were found for seed weight. Of which, 19 had functional annotations and one has an uncharacterized function (Table 7). GWAS for seed weight revealed a high probability of QTL(s) on chromosome 10; however, most of the SNPs found on the chromosome 10 were not mapped in the vicinity of an annotated gene (Table 7). Functional annotations related to the candidate genes had diverse functions. The candidate genes with defined functional annotations and associated with the most significant SNPs were Glyma.09g196700, LOC100801248, Glyma.20g181100, Glyma.02g161100, and Glyma.18g251800, which encoded for ring finger domain-containing, protein TIC 56 chloroplastic-like, LRR receptor-like protein kinase, camp-response element binding protein-related, and putative serine/threonine protein kinase, respectively (Table 7).
A total of 15 annotated genes were found near the significant SNPs associated with the combined data for yield over three years. Of which, 14 had functional annotations. Similarly to what was found for seed weight, only one candidate gene was identified on chromosome 10, which harbored 10 SNPs associated with yield (Table 7). Most of the candidate genes encoded for transpiration factors and transferases. The candidate genes with functional annotations and found close to the most significant SNPs for yield were Glyma.09g038300, Glyma.08g366900, Glyma.07g082800, Glyma.18g021100, and Glyma.01g093300, encoding for calmodulin-binding transcription activator (CAMTA), zinc finger protein with krab and scan domains, homo-oligomeric flavin containing cys decarboxylase family, gamma-glutamyltransferase, and RNA recognition motif, respectively (Table 7).
Genomic selection (GS) in maturity, plant height, seed weight and yield
GS accuracy for maturity, plant height, seed weight, and yield were estimated. Using a 5-fold cross-validation within the 250-soybean accession panel, GS accuracy was found to be trait-dependent (Fig. 9) (Table S4). On average, the accuracy of GS was the highest for seed weight (0.84) and was the lowest for maturity (0.47) (Table S4). Variations in GS accuracy were found across years (Fig. 9). The between-year variation was important for yield where the highest accuracy was 0.72 in 2008 and the lowest one was 0.50 in 2010 (Fig. 9). The estimation of GS accuracy was also conducted using the two subpopulations derived from structure analysis. Cross-validation using all 250 soybean accessions and samples from the Q1 group provided similar trend as shown in Fig. 8. However, GS accuracy significantly dropped down for maturity when samples from the Q1 cluster were used (Fig. 9) (Table S4). Interestingly, GS averaged 0.64 with a less variation between traits and across years when samples from subpopulation 2 were used to estimate GS accuracy (Fig. 9). In addition, accuracy for predicting maturity was the best using Q2 samples (Fig. 9) (Table S4). The ANOVA results indicated statistically significant interaction effect between year and population type on the accuracy of GS for maturity (F value = 5.77, p-value = 0.0001), plant height (F value = 11.47, p-value < 0.0001), seed weight (F value = 76.17, p-value < 0.0001), and yield (F value = 12.82, p-value < 0.0001) (Table 8). The results indicated that year and population structure were important factors to take into account when incorporating GS in a breeding program.
GS accuracy using datasets from different years had similar trends to that of obtained from a within-year cross validation (Figs. 9 and 10). Overall, there is lack of consistency between GS accuracy and the training year across traits and population types. This could be explained by the significant interaction effect of population structure and years on GS of plant height (F value = 2.86, p-value = 0.0091), seed weight (F value = 4.56, p-value = 0.0001), and yield (F value = 17.37, p-value < 0.0001) (Table 9). Yield in 2010 was better predicted using dataset from 2009 than using yield data obtained in 2008 when all samples within the panel and individuals from Q2 were used for cross-validation, respectively (Fig. 10) (Table S5). However, plant height in 2010 was better predicted by the dataset recorded in 2008 for all sample- and Q1 sample-cross validation. Different results were found for plant height when cross-validation was carried out using individuals from the Q2 group (Fig. 10). Genomic prediction appeared to be year-independent but subpopulation-dependent for maturity (Fig. 10) (Table S5). ANOVA results showed that there was no significant interaction effect between population structure and year in predicting maturity (F value = 1.68, p-value = 0.1229) (Table 9). In addition, year effect on GS of maturity was not significant (F value = 0.88, p-value = 0.4519) (Table 9). However, GS for maturity was heavily influenced by population structure (F value = 1232.94, p-value < 0.0001) (Table 9). Maturity could be better predicted than the other traits when samples from the group Q2 were used for cross-validation (Fig. 10). Despite the stability of maturity prediction between training and testing year as shown in Fig. 9, GS performed poorly for this trait when all samples and samples from the subpopulation 1 were used, respectively (Fig. 10) (Table S5).
GS was also conducted using samples from subpopulation 1 as training set and samples from subpopulation 2 as testing set and vice versa. For cross-validation using data from the same year (Table S6), discrepancy in GS accuracy was found when samples from one subpopulation was used to predict the ones from the other group (Fig. 11). Overall, selection accuracy was slightly higher for most traits when the GS model was trained using samples from subpopulation 1 (Fig. 11). This difference was substantial for maturity. The average GS accuracy for maturity was 0.49 and 0.20 for Q1-based training set and Q2-based population, respectively (Table S6). Unlike the two previous approaches where cross-validation was carried out using within-subgroup samples, variation of selection accuracy between years was more pronounced when prediction was done across subpopulations (Fig. 11). In addition, the interaction effect of population structure and year on GS was significant for maturity (F value = 89.30, p-value < 0.0001), plant height (F value = 14.02, p-value < 0.0001), seed weight (F value = 56.77, p-value < 0.0001), and yield (F value = 184.92, p-value < 0.0001) (Table 10). When the training set was taken from subgroup 1 in order to predict yield of subpopulation 2, selection accuracy was 0.72 and 0.41 for 2009 and 2008, respectively (Table S6). A similar trend was found for yield when the training samples were derived from subpopulation 2 (Fig. 11).
In addition to performing a within-year GS using two subpopulations, trait prediction for other years of a genetically- distant population subset was also conducted. Overall, results indicated a similar trend to what was found using a within-year GS approach (Figs. 11 and 12). The interaction effect of population structure and year on GS was significant for maturity (F value = 13.18, p-value < 0.0001), plant height (F value = 14.70, p-value < 0.0001), seed weight (F value = 264.27, p-value < 0.0001), and yield (F value = 25.23, p-value < 0.0001) (Table 11). Prediction accuracy for most traits was slightly higher when the samples from Q1 were used to train the GS model (Table S7). Training the model on Q1 resulted in prediction accuracy of yield being 0.62, 0.40, 0.36, and 0.41 when the training/testing year pair was 2008/2009, 2008/2010, 2009/2010, average 2008_2009/2010, respectively (Table S7). When the GS model was trained using samples from Q2, the prediction accuracy for yield was 0.28, 0.13, 0.31, and 0.21 corresponding to the training/testing year pair s2008/2009, 2008/2010, 2009/2010, average 2008_2009/2010, respectively (Table S7). Updating data on plant height, maturity date, and seed weight each year helped increase GS when the model was trained under Q2. The update was achieved by averaging the training set data from 2008 and 2009 to predict the testing set in 2010. GS accuracy was 0.47, 0.07, and 0.60 for plant height, maturity date, and seed weight, respectively, using the 2008 data from the Q2 samples to predict the 2010 data from the Q1 samples (Fig. 11) (Table S7). By taking the average data from 2008 and 2009 to establish the training set, plant height, maturity date, and seed weight of the Q2 samples were predicted with an accuracy of 0.52, 0.09, and 0.68, respectively (Table S7). These results indicated that GS accuracy was impacted by multiple factors such as population structure and the variable year from which the training set was established.