Genome variation map
To generates a comprehensive genome variation map in mungbean, we resequenced a total of 558 Chinese mungbean landraces selected from 16 provinces in North China, the Huang-Huai-Hai region, and the Yangtze River region (Figure 1a and Table S1). Approximately 16.93 billion 150-bp paired-end reads (2.54 Tb clean data) were generated, resulting in 97.10% of reads mapped and 89.93% genome coverage. The average sequencing depth was 9.83-fold, ranging from 5.97- to 19.19-fold, based on the mungbean VC1973A reference genome (Kang et al., 2014). Genotype coverages at six-, 12- and 18-fold averaged 66.05, 16.00 and 3.58%, respectively (Table S1). After alignment of the reads to the reference genome, variant calling and filtering, we identified a final set of 2,582,180 high-quality single-nucleotide polymorphisms (SNPs) and 412,999 indels (ranging from 1 to 244 bp in length) (Table S2). The distribution of variants across the genome was variable, depending on genome context and gene density (Figure 1b). A total of 69,992 SNPs (2.71%) and 3,652 indels (0.88%) were located in coding regions, among which 4,259 showed potentially large effects: 1,597 SNPs affected 1,152 genes by causing start codon changes, premature stop codons or elongated transcripts, and 2,662 indels led to frame shifts, gain of stop codons, or other disruptions of protein-coding capacity in 1,549 annotated genes (Tables S2 and S3). The ratio of non-synonymous to synonymous SNPs (N/S) and transition to transversion SNPs (Ts/Tv) is 0.74 and 1.95, respectively (Table S2). We identified 27 genomic regions containing 673 genes with N/S ratio > 2.5 in all accessions (Tables S4 and S5). Overall, we have generated a comprehensive mungbean genome variation dataset in which we identified numerous relevant SNPs and indels from diverse landraces.
Population structure and linkage disequilibrium decay
To better infer the phylogenetic relationship and population structure of Chinese mungbean, phylogenetic analyses (Figure 2a, Figure S1), principal component analyses (PCA, Figure 2b and Table S6), and population structure analyses (Figure 2c, Table S7) were performed on the basis of the population genotype information. We found that these 558 accessions were divided into two distinct subpopulations, which exhibited strong geographic distribution patterns (Figure 1a). Based on the geographical origins of the accessions within each subpopulation, we termed these the ‘Northern subpopulation’ (NS, N = 211) and ‘Southern subpopulation’ (SS, N = 347). The SS consisted mainly of landraces from provinces in the Yangtze River region and the NS included mainly landraces from provinces in North China and the Huang-Huai-Hai region (Figure 1a, Table S1). Linkage disequilibrium (LD) dropped to half of its maximum value at 65.41 kb for all samples (Figure S2), similar to those of pigeonpea (70 kb, Varshney et al., 2017) and Chinese soybean landraces (58 kb, Li et al., 2020). However, the decline was more rapid than that in chickpea landraces (180 kb, Varshney et al., 2019).
Population differentiation
Diverse mungbean landraces were bred for specific regions and local cropping systems. Compared with the SS accessions, the NS accessions flower earlier, are shorter, and produce smaller seeds and pods with higher seed starch and lower protein contents (Figures S3-S10). To elucidate the genomic regions that might underlie the phenotypic differences between SS and NS, we calculated fixation statistics (Fst) and nucleotide diversity (π) for the two subpopulations. The π was estimated at 2.77 × 10-3 for all sampled landraces (Table S8). The SS had 32.16% higher nucleotide diversity than NS (π = 2.63 × 10-3 versus 1.99 × 10-3, Table S8). This result indicated that Chinese mungbean may have initially been cultivated in the Yangtze River region before later expanding into the Huang-Huai-Hai region and North China. We next investigated the diversity patterns within the two subpopulations. We detected 228 chromosomal regions with significant differences in nucleotide diversity between SS and NS (top 5% πSS/πNS ≥ 4.44, or top 5% πNS/πSS ≥ 2.32, Table S9). Of these 228 chromosomal regions, 24 overlapped with 35 loci identified in the GWAS, including loci associated with all traits phenotyped except plant height (Figure 3a and Table S10). The measure of population differentiation, Fst, between SS and NS was estimated at 0.18, suggesting a moderate level of differentiation within Chinese mungbean landraces. Fst analysis identified 102 genomic regions with significant genetic divergence (top 5% Fst ≥ 0.44) covering 25.56 Mb of the genome (Figure 3b and Tables S11 and S12). To confirm the population differentiation signals and narrow down the locations of potential candidate genes, we overlapped the data obtained through the fixation and diversity analyses and identified 36 genomic regions harboring 577 candidate genes (Table S13). Thus, these 577 candidate genes may be responsible for the phenotypic differences between SS and NS.
GWAS results
We measured nine traits in the 558 accessions from four agroecologically diverse locations, ranging from Mid-China to southern China, in 2019-2021 (Table S14), although because of resource constraints, we did not measure all nine traits in all six locations. These traits were days to flowering time (DFT), pod length (PL), pod width (PW), seeds per pod (SP), 100-seed weight (HSW), plant height (PH), branch number (BN), seed protein content (SPC), and seed starch content (SSC), all of which are crucial for the improvement of mungbean yield and end use. We observed diverse phenotypic variations for these traits (Table S15). Based on the 2,582,180 SNPs identified, we used a total of 37 sets of phenotypes assessed in six environments to perform GWAS using the genome-wide efficient mixed-model association (GEMMA, Zhou and Stephens, 2012) method. Manhattan plots and quantile-quantile plots of all nine traits from varied environments are shown in Figures S3-S10. In total, we identified 110 significant association signals (P < 3.87E-07, –log10 P = 6.41) for the nine traits in the mungbean genome (Figure 4a and Table S16). Among them, 12 association signals for the same traits were shared between at least two phenotyping environments (Table S16). Only a few candidate genes underlying agronomic traits have been identified in mungbean so far, thus, we integrated the GWAS approach with functional annotation of the orthologs in model plants to rapidly identify candidate genes associated with seeds per pod, pod length (Figure 4), days to flowering time (Figure 5), and seed protein content (Figure 6).
Seeds per pod is one of the main determinants of seed yield in mungbean and is positively correlated with pod length. Out of 110 GWAS signals for all nine traits, 10 and 18 signals were associated loci for seeds per pod and pod length, respectively. We identified a major locus responsible for both seeds per pod and pod length at Mb 0.16-0.28 on chromosome 5 in three environments (Figure 4a), which we designated SEEDS PER POD ON CHROMOSOME 5 (SP5). Based on the peak SNP (5_203751, C/T) of the association signal, two haplotypes were identified in all 558 accessions. Accessions carrying the CC allele exhibited increased seeds per pod and pod length compared to accessions carrying the TT allele (~14.38% and ~21.64% greater SP and PL, respectively) (Figure 4b). However, pod width and 100-seed weight did not significantly differ between accessions with CC vs. TT alleles. After carefully analyzing the 13 genes in this region of chromosome 5 (Table S17), we identified a candidate gene encoding a leucine-rich repeat serine-threonine/tyrosine-protein kinase (Vradi05g00200). Leucine-rich-repeat receptor-like kinases are involved in polar auxin transport in plants (Afzal et al., 2008, Zou et al., 2013), and auxin regulates the silique length of rapeseed and the kernel number per row of maize (Liu et al., 2015, Jia et al., 2020, Li et al., 2021). We therefore propose Vradi05g00200 as the key candidate gene for SP5.
The number of days to flowering time is critical for modern crop production and a major trait associated with crop adaptation. This trait has been reported to be highly sensitive to environmental temperature and photoperiod in crops (Hung et al., 2012, Lu et al., 2020, Wei et al., 2020). In the present study, we identified a strong GWAS signal for days to flowering time at Mb 7.79-8.14 on chromosome 3 using phenotype data from 2021_EZ and 2021_GC (Figure 5a, b). This GWAS signal showed weak associations in the other environments (Figure S3). The ~200 kb (Mb 7.93-8.13) LD block surrounding the peak SNP (3_7937039, T/A) contains 22 gene models (Figure 5c, Table S18). The 3_7937039 SNP generated two haplotypes, TT and AA, and resides 5 kb downstream of Vradi03g06500, a gene encoding a Calvin cycle protein, CP12-2, whose orthologue in Arabidopsis regulates flowering time (Singh et al., 2008, López-Calcagno et al., 2017). The peak SNP 3_7937039 exhibited an opposite direction of effect between accessions carrying the TT allele and AA allele in different environments (Figure 5d). Landraces carrying the TT allele showed significantly earlier flowering time than those carrying the AA allele under long day conditions (2019_EZ, 2021_EZ and 2021_GC), but significantly later flowering in environments with short days (2020_LS and 2021_WH). Through further analyses assessing the frequency of different haplotypes between two sub-populations, we found that 71% of landraces in NS carried the AA genotype, while 95% of landraces in SS carried the TT genotype (Figure 5e). The distribution of 3_7937039 alleles thus reflects genetic differentiation between NS and SS. We also found that this genomic region harbored a significant difference in nucleotide diversity between SS and NS (Figure 5f). These results suggest that Vradi03g06500 may be a strong candidate for the flowering time locus.
Protein and starch are the two most abundant components of mungbean seeds. GWAS results showed that both SPC and SSC were associated with SNPs in one genomic region ranging from Mb 16.027 to 16.073 on chromosome 4 (Figure 6a, b). Four candidate genes, Vradi04g07800, Vradi04g07810, Vradi04g07820, and Vradi04g07830, were found in this association region, and the peak SNP (4_16046710, C/T) was located within the ninth intron of Vradi04g07810 (Figure 6c). Vradi04g07810 encodes a serine carboxypeptidase whose orthologues in Arabidopsis and tobacco mediate brassinosteroid signaling and has an impact on cell elongation (Li et al., 2001, Bienert et al., 2012). The peak SNP generated two haplotypes: CC and TT. Accessions carrying the TT allele had significantly higher SPC but lower SSC than those with the CC allele (Figure 6d, e). Additionally, we found that the TT allele correlated with larger pod size (PL × PW) and higher HSW, and that the CC allele correlated with smaller pod size and lower HSW (Figure 6f-i). Given that protein and starch account for ~80% of mungbean seed content, it is reasonable that genes regulating both SPC and SSC will also influence seed weight. Moreover, we found that this genomic region displayed significant differences in nucleotide diversity between SS and NS (Figure 6j), and that the TT haplotype was present in most accessions (90% for NS and 96% for SS) (Figure 6k). These results indicate that Vradi04g07810 could be a candidate gene for this pleiotropic locus.