Review of reported GRs for wheat yield-related traits identified by GWAS
Twenty-three GWAS publications on wheat yield-related traits were collected (Table S1), and their identified GRs were summarized. Generally, those with a significance threshold of -log10 (p) > 4 were selected, and those showed consistency within the study or with previously reported GRs were also included. For GRs in some studies where the marker physical positions were not indicated due to the lack of reference genome information at the time of publication, their available SNP sequences (either from known SNPs such as wheat 90K Illumina iSelect array or from the marker sequences given in the paper) were searched for sequence similarity using BLAST against the Chinese Spring (CS) reference genome, International Wheat Genome Sequencing Consortium (IWGSC) RefSeq v1.0, to locate their physical positions (https://wheat-urgi.versailles.inra.fr/) (Appels et al. 2018). A MQTL study of GRs for wheat yield traits was also incorporated, where the physical positions of the major GRs were indicated (Liu et al. 2020). The distribution of the summarized GRs on wheat chromosomes were visualized using R package RIdeogram (Hao et al. 2020).
Plant materials, phenotyping and statistical analyses of phenotypic data
A total of 228 diverse spring wheat (Triticum aestivum L.) varieties were used as the association panel and were field-trialled under six different environments in Australia and China (Table S2). Most of these varieties have long been cultivated in either of the countries. Among them, 150 lines were trialled at The University of Western Australia (UWA) Shenton Park Field Station (31.9588° S, 115.8053°E) in 2019, and in different Australian fields at Brocklesby, NSW (35.8738°S, 146.6817° E), Cunderdin, WA (31.4926°S, 117.4928°E), Lock, SA (33.5973°S, 135.7110°E) and Kalkee, VIC (36.5640°S, 142.1874°E) in 2020; and 133 lines were trialled at the Inner Mongolia (IM) Academy of Agricultural & Animal Husbandry Sciences (IMAAAS) (43.3782° N, 115.0595° E) in 2019. There were 55 overlapped varieties between the Australian trials and the IM trial. Seeds were obtained from Australian Winter Cereals Collection and wheat breeding companies/institutions including InterGrain Pty Ltd, Australian Grain Technologies, LongReach Plant Breeders, Edstar Genetics Pty Ltd, Chinese Academy of Agriculture Sciences, IMAAAS, and Gansu Academy of Agricultural Sciences. Field was ploughed two weeks prior to seed sowing and fertilizers were added with standard rates. All plots were under rainfed conditions.
Not all the traits were recorded in all the environments. In UWA Shenton Park trial, three blocks, each with 150 plots for 150 varieties (Table S2), were used as three replicates for the individual varieties which were planted in generalized randomized block design. Each plot had a size of 1.0 m × 0.6 m with three rows. Sixty seeds were sown in each plot, with 20 seeds (spaced 5 cm apart) in each 1 m row. Distance between rows was 0.2 m, and distance between plots was 0.4 m. Buffer zones were two rows surrounding the whole field. Seeds were sown in mid-May in 2019. Trials were phenotyped for plant height (PH), above ground biomass per plant (BM), spike number per plant (SN), grain number per plant (GN), thousand grain weight (TGW) and yield (YLD) per square meter. PH was recorded as the average of three values for each plot measured in centimeters from the soil surface to the tip of the spike excluding awns. BM was measured as above ground crop cuts of each line, which were dried in a 60°C oven before being weighed to assess the dry biomass. SN, GN, TGW and YLD were counted or measured at maturity when whole plots were harvested.
In 2020 Australian trials, seeds of 150 varieties (Table S2) were sown at different locations in WA, SA, NSW and VIC in mid-May, 2020. A randomized complete block design was used for all locations. At Brocklesby (NSW), each variety was grown in a 1.14 m × 4 m plot with two replications; at Cunderdin (WA), each variety was grown in a 1.52 m × 4 m plot with two replications; at Lock (SA), each variety was grown in a 1.65 m × 4 m plot with two replications; and at Kalkee (VIC), each variety was grown in a 1 m × 4 m plot with two replications. YLD was measured at maturity for all the trials.
In 2019 IM trials, seeds of 133 varieties (Table S2) were sown in late-March at IMAAAF in 2019. A randomized complete block design was used and every wheat line was planted in a 1.2 m × 0.2 m plot with three replicates. Fifty seeds were sown in each plot. Five buffer lines were used for the whole field. Phenotyping were done for PH, fresh BM, SN, GN, TGW and YLD per plant. Except fresh BM which were measured as fresh weight of the above ground crop cuts of each plant, other traits were measured using the similar methods as the UWA trial.
Correlations of the traits in individual environments were obtained using the Pearson’s product–moment estimator. Frequency distributions and correlation plots of the traits were generated using R package PerformanceAnalytics (https://github.com/braverock/PerformanceAnalytics).
Genotyping by sequencing (GBS) and data filtering
Genomic DNAs of the panel plants were extracted from three-leaf stage seedling leaves using a modified CTAB method as described previously (Wang et al. 2019). The yield, purity and integrity of the DNAs were checked using NanoDrop 2000 (Thermo Fisher Scientific Inc., Australia) and agarose gel electrophoresis. The Illumina platform was used for genotyping at Beijing Genomics Institute (BGI), and a double digest restriction-site associated DNA (ddRAD) GBS approach was adopted to obtain genome-wide single nucleotide polymorphism (SNP) and insertion-deletion (InDel) markers for all the lines. The protocols of DNA double digestion, library construction and raw sequencing data filtering were the same as described in the previous study (Liu et al. 2020). BWA (Li and Durbin 2009) was used to map the clean reads to the reference genome IWGSC RefSeq v1.0. Marker polymorphisms were called using Samtools (Li et al. 2009), Picard-tools (http://broadinstitute.github.io/picard/), Reseqtools (He et al. 2013), and GATK unifiedGenotyper (DePristo et al. 2011).
The variant call format (vcf) data files of SNP markers generated by BGI were further filtered using VCFtools (http://vcftools.sourceforge.net) for downstream analyses with the following criteria: 1) only keeping variants that were successfully genotyped in 80% of individuals, with a minor allele frequency (MAF) > 5% and a minimum quality score of 30; 2) removing varieties with more than 30% missing data to get rid of individuals that did not sequence well; and 3) after missing data were imputed with Beagle v5.1 (Browning et al. 2018), the markers were further pruned using plink v1.9 (Purcell et al. 2007) with a cut-off linkage disequilibrium (LD) value r2 > 0.8.
Population structure analyses, principal components analyses (PCA), kinship analyses and phylogenetic tree constructions
Population structure was analysed with Admixture 1.3.0 (Alexander and Lange 2011). The number of presumed subpopulations (K) was set to 2 to 15 and the cross validation (CV) errors for the different admixture models were calculated. The K value with the smallest CV error were considered the best fitting model. R package LEA (Frichot and François 2015) was used to generate structure plots for the varieties trialled in Australia and IM respectively.
The first three PCs and a kinship matrix were generated using TASSEL v5 (Bradbury et al. 2007). Neighbor-joining trees for the phylogenetic relationships of the varieties in individual environments (including Australian and IM environment) were drawn using Mega X software (Kumar et al. 2018).
Genome-wide association mapping
Genome-wide association mapping between SNP markers and phenotypic data in individual environments was conducted with TASSEL v5 using a mixed linear model (MLM) that accounts for both kinship coefficients and population structure. Marker-trait associations (MTAs) were claimed as significant with a p value threshold of − log10 > 4 for traits of PH, BM, TGW, SN and GN, and a p value threshold of − log10 > 3 for YLD. Manhattan and QQ plots were drawn using R package CMplot (Yin et al. 2021).
Candidate gene identification for major MTAs
The trait-associated SNP markers, especially those showed consistency under different environments in this study, were compared with previously reported markers/genes or MQTL near the significant MTAs, and were searched in the reference genome RefSeq v1.0 to identify their overlapped or closely located high-confidence genes as associated candidate genes. Those genes with functions previously reported as important for yield (Liu et al. 2020) were especially considered as key candidates.
Comparison of MTAs for yield to previously reported MTAs for maturity
As days to maturity (maturity in short) can significantly affect yield traits under different environments, the major (-log10 (p) > 4) MTAs for maturity identified in a previous study (Juliana et al. 2019) were summarized and visualized using RIdeogram. These reported maturity MTAs were considered representative because they were identified across different seasons under multiple environments including irrigated, drought stressed, early heat stressed and late heat stressed conditions. The GRs for yield traits reported previously and in this study were compared to the maturity MTAs to explore the possible influence of maturity to yield traits.