Principles of RapMap
Here we first introduce a method called Rapid Mapping (RapMap) that is fast, powerful and comprehensive enough to simultaneously discover sufficient QTLs controlling a trait in easy-to-perform steps. The principle and method of RapMap (Fig. 1) are explained using rice as the example. To clone a QTL, the first priority is to obtain a mapping population with Mendel's single gene segregation. To most probably achieve this major goal, we first collect a rice mini-core collection with large genetic diversity across the world, phenotype them for a target trait, divide accessions into several gradient groups with similar trait values for each group and minor phenotype differences between adjacent groups, and select gradient parents (from P1 to Pn) from each gradient group (Left panel of Fig. 1a). Then, using gradient parents, a series of gradient crosses, such as P1 with P2, P2 with P3, P3 with P4, and so on, are made to most probably minimize the number of segregating loci responsible for phenotypic variations of a genetic population and to discover more QTL genes for the trait using all crosses. Then the subsequent F2 gradient populations (F2GPs) are constructed and independently phenotyped for their target traits (Right panel of Fig. 1a). Because F2 progenies of each F2GP are derived from a cross between trait-adjacent bi-parents, most genes responsible for the trait in F2 are homologous or function-similar alleles in F1, and the number of segregating loci responsible for their phenotypic variations is minimal, in most cases one, even though the bi-parents with closely-related phenotypes could be far diverse in other genome regions. Unequivocal phenotype segregation caused by the one locus can be found in F2 progenies and the corresponding QTL can be easily mapped and cloned. Therefore, RapMap has the simplicity and rapidity of constructing genetic populations over current approaches (such as RIL, DH, NAM and MAGIC26), coupled with the flexible and comprehensive selection of target traits and parents (Fig. 1a). To detect the QTL region of the causal gene for each F2GP, then bulked segregant analysis (BSA) is performed using two phenotype-extreme DNA pools by bulking > 30 individuals for each pool from segregating progenies of each gradient population, with two phenotype extremes in the top and bottom 15%27. The SNPs or InDels responsible for target-phenotype variations are homozygous in at least one extreme pool (Fig. 1b). Therefore, by subjecting two DNA pools of each population to SNP chip or sequencing, a set of clustered SNPs or InDels between two pools are very likely to be tightly linked to the causal QTL gene controlling the phenotypic variations (Fig. 1b).
To find a general quality-control standard of confirming whether a gradient population has a Mendel's single-locus segregation pattern, in RapMap, we updated the concept of Mendel's single gene segregation into 12 single-locus genetic models totally depending on three different genetic effects (Fig. 1c): an allele could be a low-value or high-value complete dominant gene for the phenotypic distribution with a discrete (with no environmental or other effects) or partially overlapping (due to environmental or other effects) boundary and with a ratio of 3:1 or 1:3 (Upper panel of Fig. 1c), or could be a semi-dominant or additive gene for the phenotypic distribution with two discrete or partially overlapping boundaries and with a ratio of 1:2:1 (Middle panel of Fig. 1c), or could be an over-dominant gene for the phenotypic distribution with two discrete or partially overlapping boundaries and with a ratio of 1:1:2 or 2:1:1 (Lower panel of Fig. 1c). Therefore, the common and general nature of the 12 single-locus genetic models is the co-segregation standard that the phenotypes of two homozygous genotypes (AA and aa) can be distinguished (Fig. 1c), regardless of heterozygous genotypes. The traditional judgment in textbooks and literatures that a single locus or multiple loci control is determined by a bimodal (or 3:1) or normal distribution of progeny phenotypes, respectively, should be rationally updated to the co-segregation standard. The co-segregation standard is also the quality-control standard for the simultaneous verification of a candidate QTL and its near isogenic line-like lines (NIL-LLs) in RapMap: if two homozygous genotypes of both flanking molecular markers of a candidate QTL region in progenies from any randomly segregating population can observably differentiate target phenotypes, the region is confirmed as a real QTL for this trait, and the heterozygous lines for the QTL in each gradient population are the corresponding NIL-LLs suitable for fine mapping and cloning of the QTL coupled with various candidate-gene methods (Fig. 1c and d). Therefore, mapping a QTL, verifying its effect and obtaining its NIL-LLs, three decisive rate-limiting steps of cloning a QTL in traditional methods, are a three-in-one step featured in RapMap (Fig. 1a–d), which is integrated and ensured by the co-segregation standard and greatly improves the efficiency and accuracy of QTL identification over traditional methods (Fig. 1e). The genetic population not fitting the co-segregation standard of single-locus genetic models could be replaced by constructing a new gradient population using bi-parents of a similar gradient level, or replaced by a phenotype-well-segregating F3 or F4 family derived from a randomly selected low-, medium- or high-value F2 or F3 line, respectively, to most probably make minor-effect loci homozygous and one major-effect locus heterozygous through selfing and selecting phenotype-well-segregating F2 or F3 progenies (Fig. 1f). To sum up, the co-segregation standard as the necessary and sufficient condition of single-locus genetic models is the cornerstone of RapMap, while the construction of multiple F2 gradient populations to most probably meet the quality-control standard is the core technique of the method. Thus, the method could be immediately generalizable to any crops that are easy to crossbreed and reproduce.
Identification of four QTL genes for grain-length variations in rice by RapMap, including a novel GL1 gene
As a proof-of-principle experiment, we applied RapMap to map and isolate QTL genes (QTGs) for natural variations of grain length and width in rice. We have collected, maintained and sequenced a rice mini-core collection of more than 541 accessions, including both landraces and improved varieties from 59 countries, representing the high genetic diversity in this species28, 29 (Supplementary Table 1). For grain length, we selected 12 gradient parents, varying from the shortest one of 5.58 mm to the longest one of 13.76 mm, and developed 8 gradient populations using these trait-adjacent parents with minor grain-length differences within 12 months (Fig. 2a). After measuring grain length of about 200 F2 progenies of each population and obtaining their DNA, we constructed a high-value bulk (about the top 15%) and a low-value bulk (about the bottom 15%) for each population (Fig. 2b). Two bulked DNA pools for each of the 6 populations (Cross 1–3 and 5–8) were subjected to the commercial rice whole-genome SNP arrays (RICE6K)30. SNPs between two extreme pools detected by the SNP arrays were liable to be tightly linked to the causal QTL gene for grain-length variations (Fig. 2c). Alternatively, following the principle of the SNP-index estimation in QTL-seq analysis31, whole-genome sequencing of two DNA bulks from Cross 4 progenies identified a cluster of SNPs with SNP index close to 1 for the locus on the short arm of chromosome 3 (Fig. 2c), but nearby 0.5 for the other loci (Supplementary Fig. 1), and the SNP cluster was most likely near the causal QTG. To confirm a candidate QTL, verify the QTL effect and select its corresponding NIL-LLs suitable for map-based cloning, we first designed 2–4 InDel markers in about 2–4 Mb region covering each candidate QTL region (Fig. 2c), employing the database of rice genomic variations32, and then genotyped all the individuals of each gradient population using the InDel markers, and finally checked the co-segregation standard for every F2GP and confirmed two boundaries of each QTL based on the recombinant events (Fig. 2d). If two homozygous genotypes of two flanking molecular markers of each SNP region could observably differentiate grain length of each population at the individual level (Fig. 2d), the SNP region was verified as a real QTL for the grain-length variations, and the heterozygous lines of the QTL region could be the corresponding NIL-LLs. In fact, the Cross 5 population did not fit the co-segregation standard of single-locus genetic models in F2 (Fig. 1f) and was replaced by a grain length-well-segregating F4 family derived from a selected medium-value F3 line (Fig. 2d). Finally, we verified each QTL per gradient population and confirmed the minimum QTL regions on chromosome 3, 3, 3, 3, 7, 1, 7 and 2 for Cross 1 to Cross 8, respectively (Fig. 2e).
For fine mapping of each grain-length QTL identified by each gradient population in RapMap procedures, enough recombinants in progenies of corresponding heterozygous NIL-LLs were screened using the KASP high-through genotyping technique33 based on two flanking markers and the rapid DNA extraction technique in the greenhouse. Using both high-density marker genotype and grain-length phenotype information of these recombinants, we resolved grain-length QTLs to regions of 20 kb, 18 kb, 14 kb, 54 kb, 24 kb, 19 kb, 56 kb and 3.7 kb for Cross 1 to Cross 8, respectively, in which there is only one to three predicted open reading frames (ORFs) (Fig. 2e). To find out if the known gene in each fine-mapping region is the causal gene, comparative sequencing analyses of their alleles between two parents of corresponding gradient populations were conducted and known functional variations of all three grain-length genes (GS3, GL7 and GS2) reported in references34–39 were indeed discovered in Cross 1 to Cross 8, except Cross 6 (Fig. 2e). Thus, a transgenic approach was not needed to reassess a known gene underlying the QTL with the same functional variation. Self-pollinating the heterozygous NIL-LLs for each QTL fragment produced two NIL-LLs and their two homozygous genotypes determined by the reported functional marker of each cloned gene could distinguish the long-grain and the short-grain progenies very well (Fig. 2f), further confirming the three causal genes for grain-length variations of gradient populations.
It should be noted that two novel alleles of the major grain-length gene GS3, GS3-5 from the Cross 1 and 2 population, and GS3-6 from the Cross 3 population, were identified (Fig. 2 and Supplementary Fig. 2). The GS3-5 allele has a stronger function (shorter grain) than that of the strongest allele of GS3, the GS3-4 allele from Chuan 7 (Cross 1 of Fig. 2f and Supplementary Fig. 2a)36. The GS3-6 allele has a stronger function (shorter grain) than that of the GS3-1 allele (Cross 3 of Fig. 2f and Supplementary Fig. 2a). Compared with the known four alleles of GS3, GS3-5 and GS3-6 have totally different functional variations with reported ones, with 10 bp and 317 bp deletion in the last exon, respectively, resulting in much shorter frameshift C terminal than that of the GS3-4 allele (Fig. 2, Supplementary Fig. 2b and c).
A novel grain-length gene GL1 was discovered in the RapMap programs using the Cross 6 population (Fig. 2). The only one ORF (LOC_Os01g63930) in the 19-kb fine-mapping region (Fig. 2e) was determined as the reliable candidate gene of GL1. We sequenced the 2.3-kb promoter region, the coding region and the 3’-UTR of GL1 in WZ1, 9311 and Zhonghua 11 (ZH11), and found many variations among them (Supplementary Fig. 3a). Compared to ZH11, there is a 2-bp CT insertion in WZ1 and a 751-bp deletion at 3’-UTR in 9311. Compared with NIL(GL1), the grains of NIL(gl1) were 9.4% longer and 11.3% heavier (Fig. 2g). To further confirm the function of the candidate gene for the GL1 QTL, we prepared the vectors for gene editing by CRISPR and complementation transformation. Because of the fully dominant feature of the long-grain WZ1 allele as revealed by analyzing the F2 population (Fig. 2d) and the difficulty in regenerating plants from the indica calli of both WZ1 and 9311, we performed complementation transformation by introducing the WZ1 allele into the japonica variety ZH11 with short grains. We first knocked out the GL1 allele in ZH11 background by CRISPR with double target positions, and twenty positive lines of total 26 transgenic plants were generated. However, no significant differences in grain length between negative and positive plants with single or double target positions were found (Fig. 2h, i and Supplementary Fig. 3b, c). Therefore, the GL1 allele in ZH11 was a non- or weak-functional one, and we transformed the complementation vector contained the coding region of GL1 in WZ1 driven by its 2.3-kb native promoter into ZH11. Thirty of 36 complementation plants were positive lines, and most of them showed increased grain length and grain weight (Fig. 2h, i, j and Supplementary Fig. 3d). Therefore, these results suggest that only one ORF in the QTL region is the right candidate for GL1.
Cloning of four QTL genes for grain-width variations in rice by RapMap, including a new GW5.1 gene
We also applied RapMap to clone QTGs of rice grain width (Fig. 3). First, we selected 10 gradient parents, varying from the narrowest one of 1.99 mm to the widest one of 3.43 mm, and constructed 7 gradient populations using these gradient parents with adjacent or near width values (Fig. 3a). It should be noted here that if the grain-width difference between the two parents was not very big, the cross would be compatible and suitable for RapMap (Fig. 3a). Similar to the application case of grain length, after RICE6K array or genome sequencing analyses of two bulked DNA pools of each grain-width population (Fig. 3b, c and Supplementary Fig. 4), 7 real QTLs from 7 populations were determined by the co-segregation standard, which were located on chromosome 8, 8, 8, 7, 5, 5 and 5 for Cross 1 to Cross 7, respectively (Fig. 3d). To clone each grain-width QTL gene, we screened sufficient recombinants for each real QTL. Consequently, using the marker genotypes and the grain-width phenotypes of these recombinants, we resolved the grain-width QTLs to regions of 168 kb, 36 kb, 168 kb, 24 kb, 28 kb, 18 kb and 18 kb for Cross 1 to Cross 7, respectively, which contain only few ORFs (Fig. 3e) for each region after some candidate-gene analyses for two bigger fine-mapping regions. To check whether the known grain-width genes GW8, GW7 and GW5 within these regions were the causal genes, known functional variations of them40–44 were finally discovered by comparative sequencing analyses of their alleles (Fig. 3e). The corresponding NIL-LLs determined by the functional markers of each grain-width gene further confirmed the three causal genes for grain-width variations of these gradient populations (Fig. 3f).
A new grain-width gene GW5.1 was identified in the RapMap processes using the Cross 5 population (Fig. 3). In the 28-kb fine-mapping region, there are only two ORFs annotated by the RAP-DB website. We identified their full-length cDNAs corresponding to ORF1 and ORF2 (NCBI accessions AK121364 and AK065965, respectively). By searching the public expression database of Zhenshan 97 and Nipponbare (http://crep.ncpgr.cn/; http://ricexpro.dna.affrc.go.jp/), very low expression of ORF2 was detected in panicles. Moreover, the second exon of the ORF1 allele from Zhenshan 97 or the AA genotype was disrupted by an 11.8-kb LTR-retrotransposon insertion as exhibited by comparative sequencing of the two alleles from the two parents (Fig. 3, Supplementary Fig. 5a), which results in loss-of-function of the ORF1 allele (Fig. 3g). Compared with NIL(GW5.1), the grains of NIL(gw5.1) were 14.9% wider and 12.9% heavier (Supplementary Fig. 5b). Therefore, ORF1 (LOC_Os05g25350) was determined as the reliable candidate gene of GW5.1. To validate the candidate gene, gene editing by CRISPR/Cas9 and complementation transformation strategies were carried out. Three independent frame-shift mutants (Fig. 3h, Supplementary Fig. 5c and d) all produced wider grains than that of ZH11, a japonica variety easy in regenerating plants from its calli and with the same functional allele as Iksan438 (Fig. 3i). Consistent with mutant results, the positive plant carrying the complementation vector produced the narrower grains than that of AA (Fig. 3i), suggesting ORF1 is a negative regulator of grain width in rice. Taken together, these data indicate that ORF1 is the right candidate for GW5.1.
Genetic contributions of the eight genes to grain size and shape in a rice mini-core collection
To assess the power of RapMap, we investigated the effect size and predictive ability of the eight genes cloned using a single round procedure of RapMap. Using the genotypes of the functional variations of the eight genes as the predictor variables and the phenotypes of 541 accessions, we performed multiple linear regression analyses to assess the genetic contributions of the eight genes to grain size and grain shape (length-width ratio) in the rice mini-core collection (Fig. 4a–f and Supplementary Table 1). The four grain-length genes GL1, GS2, GS3 and GL7, and four grain-width genes GW5, GW5.1, GW7 and GW8, totally explained 67.4% and 66.8% phenotypic variations in the rice mini-core collection, respectively (Fig. 4a and b). Among the eight grain-size genes, GS3 and GW5 accounted for 62.5% grain-length and 44.3% grain-width variation, respectively, while other genes showed diverse effects on grain length or width (Fig. 4a and b). Because grain shape is an important quality trait in most rice planting countries, we also investigated the contributions of these eight genes for grain shape (Fig. 4c). Totally 77.2% variations of grain shape could be explained by the eight genes, of which GW5 and GS3 explained 47.5% and 55.0% phenotypic variations, respectively (Fig. 4c). These results imply we have dissected most of the genetic variation for grain size and shape, although there are still nearly one third phenotypic effects unrevealed due to other genetic variations that could be identified by selecting other gradient accessions, coupled with environmental effects and large-scale experimental errors. We investigated the prediction performance of the grain-size genes based on their functional variations (Fig. 4d–f). The correlation coefficient between the observed phenotypes and the predicted values was 0.82 for grain length (Fig. 4d), 0.82 for grain width (Fig. 4e) and 0.88 for grain shape (Fig. 4f), showing a very high prediction power. In short, through a single round of RapMap with 15 crosses, more than two thirds of the molecular genetic architecture could be revealed for rice grain size and shape, which displayed a huge potential of RapMap for discovering genes and comprehensively understanding the molecular genetic architecture of a trait in rice.
Directional selection of grain-size genes during rice domestication and improvement
Grain size has been recognized as a typical domestication trait. However, how human domesticated grain size of rice was not clear. Selective signatures from domestication include altered allele frequency and a reduction in nucleotide diversity in domestication loci45. Here we first investigated the allele proportion of eight grain-size genes identified by RapMap in a large and geographically diverse natural population45, including wild, landrace and cultivar rice of 446, 2462 and 784 accessions, respectively (Supplementary Table 3 and 4). For each of the eight genes, we discovered the stepwise accumulation with different extent of all the long-grain alleles (Fig. 4g) and all the slender-grain alleles (Fig. 4h) from wild to landrace and from landrace to cultivar during rice domestication and improvement (Fig. 4i), indicating the directional and differential selection and enrichment of preexisting genetic variants rather than the acquisition of mutations after domestication for all genes except GS2. The extent of the directional selection of these long-grain and slender-grain alleles in indica subspecies is much higher than that in japonica subspecies, especially for major-effect genes GS3, GW5 and GL7/GW7 (Supplementary Fig. 6a–e). The average grain length and width in cultivar rice are much longer and slenderer than those of landrace rice, respectively (Fig. 4j). The average grain shape in cultivars is also much slenderer and longer than that in landraces (Fig. 4j). The results were consistent with the significant selection signals of the nucleotide diversity reduction detected for each grain-size locus from wild to landrace and from landrace to cultivar (Fig. 4k). These results demonstrated the enrichment of preferred alleles of all grain-size genes coupled with their nucleotide diversity reduction during rice domestication and improvement.
To understand the phenotype effect of different allele combinations (haplotypes) of eight grain-size genes and guide the molecular design breeding, we identified nine haplotypes for grain length (Supplementary Fig. 7a) and 16 haplotypes for grain width (Supplementary Fig. 7b) based on the responsible functional variations in the two diverse panels of landrace and cultivar rice. Three haplotypes with the most proportion of cultivars for grain length, Hap-L7 (78%, 9.50 mm), Hap-L8 (83%, 8.98 mm) and Hap-L9 (100%, 9.05 mm), were higher than the average value of 8.56 in landraces (Supplementary Fig. 7a). Three haplotypes with the most proportion of cultivars for grain width, Hap-W14 (35%, 2.80 mm), Hap-W15 (50%, 3.08 mm) and Hap-W16 (50%, 2.78 mm), were lower than the average value of 3.16 in landraces (Supplementary Fig. 7b). The grain shape of all the six haplotypes with the most proportion of cultivars is much higher than the average level in landraces (Supplementary Fig. 7a and b). It should be noted that the haplotypes with most proportion of cultivars predominantly owned indica subspecies with much slenderer and longer grains than those of landrace rice, but not the case for japonica subspecies (Supplementary Fig. 7c–e). The slender- and long-grain haplotypes and indica subspecies are enriched in low latitude regions with warm temperature, while the wide- and short-grain haplotypes and japonica subspecies are enriched in high latitude regions with low temperature (Supplementary Fig. 8a–c). The landraces broadly distributing worldwide include most japonica rice, while cultivars distributing most in low to moderate latitude regions include most indica rice (Supplementary Fig. 8d). These observations suggest that rice grain size shifted toward slenderer and longer grain shape in indica subspecies, which might be due to the preferential and unparallel combinatorial selection and differentially geographical distribution of these grain-size alleles between indica and japonica rice (Supplementary Fig. 6–8). Furthermore, these results provided a basis and flexibility for human to select or design diverse rice grain shapes.
Selection sweeps of grain-size gene loci and correlation among their selection intensities, DNA variations and grain-shape effects
To find the selection evidence of the grain-size genes, we scanned each sweep region which is always generated by the linkage disequilibrium of the selection target and its surrounding loci, and expected to affect the genetic diversity (Fig. 5a–h and Supplementary Fig. 9). By comparing the ratio of the silent-site nucleotide diversity of each gene in a 500-kb genome region of the GS3 locus between long/short allele accessions in cultivar population (Fig. 5a–h), we discovered a distinct valley of reduced nucleotide diversity for GS3 which contributes to grain length mostly, spanning ~263 kb, among accessions carrying the long gs3 allele (Fig. 5a). For GW5, the most contributor for grain width, there is a ~307-kb valley with reduced nucleotide diversity among accessions carrying the wide gw5 allele (Fig. 5e), which has been recognized as a domestication-related gene located at a 4.8Mb selection sweep by comparing the nucleotide diversity of the wild and cultivated rice previously45. Similarly, there is a typical valley for GW7/GL7, the second important gene for grain width and grain length, spanning a ~229-kb region (Fig. 5b and f). However, there is no evident valley with reduced nucleotide diversity and linkage disequilibrium discovered for other genes with small grain-size effects (Fig. 5c–d, g–h and Fig.4a–c). To confirm whether this reduced nucleotide diversity across the target region of GS3, GW5 and GW7/GL7 is an artifact of demographic history, we compared the long-to-short and wide-to-slender π ratios of the target gene loci. The average π ratios of GS3, GW5 and GW7/GL7 within their corresponding 500-kb regions are 0.14, 0.27 and 0.42, compared to 0.74, 0.85 and 1.73 at the whole chromosome level, respectively; these more than three times’ differences are statistically significant (P<2.2×10-16). Thus, artificial and directional selection but not demographic factors played an important role for the reduced nucleotide diversity in these grain-size gene loci during rice domestication. In addition to cultivar, we also investigated the selection sweeps of these gene loci in wild (Supplementary Fig. 9a–h) and landrace rice (Supplementary Fig. 9i–p). We found that, for GS3, GW5, GW7, GS2 and GL1, the genetic diversity decreased gradually with the domestication process from wild to landrace or cultivar, while GW5.1 and GW8 did not display clear decreasing (Supplementary Fig. 9 and Fig. 5a–h). The selection sweeps of major genes GS3 and GW5 have been shaped during the landrace period, indicating the early domestication has fixed their selection signatures (Supplementary Fig. 9a, i, e, m and Fig. 5a, e). These results together showed the differential extent of selection on the eight genes is likely to be related with the phenotype effect size in the large diverse population.
To further understand the relationship among the size of phenotype effect, the extent of artificial selection and nucleotide variations during grain-size domestication and improvement, we analyzed the pairwise correlation for the eight grain-size genes (Fig. 5i and j). The results showed the strong positive correlation among nucleotide variations, selection intensities, grain-shape effects and phenotype association significance, and this correlation is much higher through selection than that not through selection (Fig. 5i and j). The genes (such as GW5, GS3 and GW7/GL7) under stronger artificial selection due to their location in stronger selection sweep regions (Fig. 5a–h) had more genetic contribution to the whole grain-shape variations (Fig. 5i), usually with a large mount of natural variations between wild and cultivar rice. Although GS2 has a huge effect for grain length in bi-parental linkage populations (Fig.2)38, 39, it has little contribution to the whole phenotype variations in the large natural population due to its fewest nucleotide variations and weak selection (Fig.5i and Fig.4g). However, the favored natural variation of GS2, already present in cultivars, also had been identified by RapMap. Thus, selection plays an essential role in increasing the correlation between DNA variations and grain-size phenotypes.