DNA variability is the basis of species diversity and determines the progress of crop improvements1. Major genomic variants include single nucleotide polymorphisms (SNPs), small insertions and deletions (S-indels), large insertions and deletions (L-indels), structural variations (SVs, including insertions, deletions, inversions, duplications, and translocations), copy number variations (CNVs), transposon insertion polymorphisms (TIPs), and presence-absence variations (PAVs). However, most recent studies have focused on SNPs or SVs, which represent only a very small portion of total polymorphisms in genomes2,3,4,5. Moreover, previous studies revealed that most phenotypic diversity is controlled by several types of variants4,6,7. For instance, the yellow flesh color of peach is controlled by at least three types of variants, including SNP, S-indel, and TIP8. Therefore, it is necessary to construct a genomic variation map with a comprehensive view of all types of genomic variations (named the panvariome) and to associate this diverse set of variants with phenotypic variability.
Peach (Prunus persica L.) is one of the most cosmopolitan fruit species worldwide. In view of its small genome (~224.7 Mb)9, self-compatible mating system, and short juvenile phase (2~3 years), peach is considered a model plant species for Rosaceae family. Although several studies have identified genome-wide SNPs and SVs in peach3,4,10,11,12,13,14, the complete set of variants across the genome keeps largely unclear. As peach is a globally cultivated fruit species, the genetic relationships of global peaches and its route of global spread worldwide need further investigation. In addition, functional gene mining for perennial woody fruit trees always takes many years, resulting in only a few genes controlling agronomic traits have been described in depth15, which significantly limits the development of molecular breeding.
Approximately 2,000 peach accessions are available in the peach gene bank in China16. To enable more efficient utilization of these accessions, we sequenced more than 1,000 peach genomes. Here we report construction and analyses of panvariome and pangenome of 1,020 accessions. Genomic variations, genetic relationships, evolutionary history, and introgression features of this collection are analyzed. A panvariome-based genome-wide association study (GWASPV) is proposed and performed on more than 1900 traits, which significantly improved the power of association studies, precisely identifying conferring genes and causal mutations via ‘one-step GWAS’, thereby accelerating gene mining. Our results provide broad insights into genomic research of plants, animals, and humans.
A full-spectrum panvariome map of peach genomes
A total of 330 TB of sequencing data for 1,020 accessions were used in this study (Supplementary Table 1). These accessions encompassed a collection of peach with global representations, covering all continents, including 80 wild relatives, 331 landraces, and 609 improved cultivars (Supplementary Table 1). Reads were aligned against the ‘Lovell’ reference genome9 with a resulting average coverage of 94.7% and depth of 12.0×. To construct the panvariome, we developed a new framework that enabled rapid and synchronous identification and integration of multiple types of genome-wide variants, including SNPs, S-indels (≤ 6 bp), L-indels (> 6bp), SVs (≥ 30 bp), CNVs, TIPs, and PAVs (Fig. 1a). This framework provides a new solution for rapid and comprehensive identification of genome-wide DNA variants.
Using this framework, we identified 20,324,264 initial SNPs that included all SNPs detected in previous studies3,4,10; 15,344,005 (~75.5%) SNPs were newly identified. Filtering based on a missing rate > 80%, multiallelism, and a minor allele frequency < 0.005 reduced this set to a final set of 6,646,034 SNPs. For multinucleotide variants (MNVs), a total of 1,226,182 S-indels, 1,998,010 L-indels, 431,838 SVs, 8,758 CNVs, and 174,041 TIPs were identified (Fig. 1b-1c). Among the SVs, deletions and translocations were dominant, and inversions were minor (Fig. 1d). By combining variants of different lengths, we generated an initial panvariome of peach comprising 10,484,836 variants, resulting in 46.1 variants per 1 kb and 390.2 variants per gene, representing the variation map with the highest density in peach and the first panvariome in plants. Based on panvariome, we found that most accessions harbored a high ratio of homozygosity (Supplementary Fig. 1), even cultivars from artificial hybridization, implying the narrow genetic background in global peach. Using panvariome, we constructed a world peach core collection of 124 accessions for 1,020 accessions and a culti-core collection for 938 landraces and improved accessions (Supplementary Tables 2 and 3).
Pangenome and gene PAVs in peach genomes
The genome for each accession was de novo assembled, producing a total of 174.8 Gb of contigs longer than 500 bp with an N50 of 3,709 bp (Supplementary Tables 4 and 6). All assembled contigs were compared with ‘Lovell’ reference genome, which produced a nonreference genome comprising a total of 321,253 nonredundant novel contigs (344 Mb) with an N50 of 1,113 bp and an identity < 90% to reference genome (Supplementary Table 5). Approximately 45.5% of the nonreference genome comprised repetitive elements, a slightly greater proportion than reference genome (37.1%) and less than that of nonreference sequences of tomato (78.2%)17. Wild accessions contained many more novel sequences than cultivated accessions (Fig. 2a), with P. davidiana containing the most (average ~5.9 Mb), followed by P. kansuensis (~ 3.0 Mb) and P. mira (~0.95 Mb), indicating a close relationship between P. mira and domesticated peaches.
A total of 3,289 protein-coding genes were predicted in the nonreference genome (Supplementary Table 7). Upon mapping deep sequencing data (40.3×) of ‘Lovell’ against the pangenome, 18 reference genes were not perfectly covered, while 59 novel genes were covered. Notably, the density of novel genes in nonreference sequences in peach was lower than that in tomato (351 Mb, 4,873 genes) and rice (268 Mb, 12,465 genes)2,17. By integrating with the ‘Lovell’ reference genome, the final peach pangenome was formed from a total of 571.4 Mb of sequences and 30,162 protein-coding genes. A total of 2,567 (78.0%) novel genes contained the best hits in the nucleotide sequence database or Pfam domain (Supplementary Table 8). Among these genes, 1,700 showed high sequence similarity with genes in other Prunus fruit species (Fig. 2b, Supplementary Table 9), including 1,417 (83.3%) with almond (P. dulcis)18, 143 (10.0%) with sweet cherry (P. avium)19, and 106 (7.5%) with Japanese apricot (P. mume)20, indicating shared ancestries or interspecific introgressions between Prunus species. Novel genes were clustered into various biological processes, including photosynthesis, stress response, and development, etc (Supplementary Table 10). Novel genes identified in wild accessions included abundant stress resistance genes, and 61 leucine-rich (LRR) family genes, including 28 plant disease resistance genes and four virus resistance genes, were identified (Supplementary Table 10). For instance, we identified a novel gene, PpRPW8, in powdery mildew (PM) resistant accessions (Fig. 2c), located in a previously mapped major quantitative trait locus (QTL)21. A gene homologous to PpRPW8 has been found to confer resistance to PM in Arabidopsis22, making PpRPW8 a strong candidate for PM resistance in peach.
We categorized genes in the pangenome based on their frequency: 22,523 (74.7%) core genes shared by all accessions and 7,639 dispensable genes present in only some of the collection—1,785 (5.9%) softcore, 3,474 (11.5%) shell, and 2,380 (7.9%) cloud genes shared by 99-100%, 1-99% and less than 1% of the accessions, respectively (Fig. 2d). Modeling of pangenome size by iterative random sampling revealed an open pangenome with a continuous decrease in core genome along with an increase in the pangenome (Fig. 2e), suggesting an abundance of variants and a large portion of dispensable genes in peach genome. For novel genes, 42 were present in all of the accessions, and 4, 863, and 2,380 were identified as softcore, shell and cloud genes, respectively. The proportion of core genes in the peach pangenome was similar to that in tomato (74.2%)17 and Arabidopsis (70.0%)23, lower than that in apple (81.3–87.3%)24 and cotton (85.8%)25, and higher than that in Brassica napus (62.0%)26 and rice (54.0%)2. The ‘Lovell’ reference genome contained 100% of core genes but only 57.8% of dispensable genes. The genotypes of PAVs for 30,162 genes in 1,020 accessions were integrated into the initial panvariome, producing a final version of panvariome composed of a total of 10,515,025 variants (Fig. 1b).
Significant gene gains during peach breeding
Genomes of landraces and improved accessions encoded significantly more genes than wild accessions, whereas improved accessions contained slightly more genes than landraces (Fig. 2f), suggesting a general trend of gene gain during peach domestication and subsequent improvement, different from the usual gene loss during domestication in other species, such as tomato17 and cotton25. Furthermore, more genes were gained during domestication than improvement. However, a different trend was found by comparisons using only novel genes, suggesting loss of novel genes and gain of reference genes during domestication and improvement (Fig. 2g). To eliminate the impacts of PAV identification, we further confirmed the conclusion of gene gains by identifying PAVs with different exon coverage thresholds (Fig. 2h). For instance, a reference gene, PpFBX92, encoding an F-box-containing protein involved in regulation of leaf size27, was present in 98.4% of landraces but was rare in wild accessions (0.06%, Supplementary Table 11), implying the gain of an agronomically favorable gene during domestication.
We identified PAVs under selection during the history of peach breeding using two sets of comparisons, between landrace and wild accessions (domestication) and between improved and landrace accessions (improvement) (Supplementary Fig. 2). In total, we identified 805 favorable and 409 unfavorable genes during domestication (Fig. 2i; Supplementary Tables 12 and 13) and 70 favorable and 236 unfavorable genes during improvement (Fig. 2j; Supplementary Tables 14 and 15). Approximately 70% of improvement favorable genes were novel, while only 9.8% of domestication favorable genes were novel, suggesting ongoing selection on novel genes during improvement. Notably, most of unfavorable genes of domestication (95.6%) and improvement (90.3%) were novel. These results suggest that more genes were selected for than selected against during domestication but opposite during improvement, revealing a positive selection during domestication and a negative or purifying selection during improvement. Improvement favorable gene was absent on chromosomes 4 and 5, suggesting that genes on these chromosomes have been fixed during domestication and selections on genes on other chromosomes conferring the improvement. Only 6 (0.8%) favorable and 63 (15.4%) unfavorable genes were shared by domestication and improvement, suggesting their distinct selection targets. Enrichment analysis indicated that plant-pathogen interaction pathway was enriched in domestication favorable genes and improvement unfavorable genes, suggesting that landraces contained certain resistance but lost during improvement subsequently (Fig. 2k).
Reconstruction of the evolutionary history of world peach
To understand the evolutionary patterns of global peach, the 1,020 peach accessions were classified into nine subgroups according to the NJ tree using panvarime (Fig. 3a), which showed high congruence with geographic origins. Nine subgroups were detected: wild relatives (P. mira, P. davidiana, P. kansuensis) (WR); ornamental accessions and landraces from South China (OST), the middle and lower reaches of the Yangtze River (YT), the Yun-Gui Plateau in Southwest China (YG), Northeast China (NE), the North Plain of China (NP), and Northwest China (NW); improved cultivars from Western countries (WI); improved cultivars from Eastern countries (EI). The WI and EI groups showed mixed patterns, as they shared parents during breeding.
Although previous studies addressed the phylogeny of peach3,10,13, the wild progenitors of domesticated peach and relationships among wild relatives remain largely unclear. By analyzing novel sequences and genes in wild relatives, we found that P. davidiana had the most novel sequences, while P. mira had the fewest (Fig. 2a). P. davidiana included a total of 3,071 novel genes, with an average of 341.2 and 25,760.1 novel and reference genes per accession, respectively. Further analysis found that the majority of novel genes in P. davidiana were derived from almond (Fig. 3b), suggesting introgression from almond in the origin of P. davidiana, which is consistent with phenotypic similarities between P. davidiana and almond, such as inedible thin flesh, cracked fruit suture lines, stone textures, and early blooming (Supplementary Fig. 3). Moreover, a gene flow event from almond to P. davidiana also provided strong evidence (Supplementary Fig. 4). In addition, the considerable phenotypic differences between peach and P. davidiana also suggested the presence of a “nonpeach” contributor to its origin (Supplementary Fig. 5). We found a novel gene, PpDOT1, in P. davidiana, almond and the interspecific almond × peach rootstock ‘GF677’ (Fig. 3c-3e), but was absent in P. persica, providing evidence for the introgression. P. mira contained the fewest novel sequences (Fig. 2a), making it the most likely wild progenitor of domesticated peach.
The evolutionary history of global peach was reconstructed based on D-statistics and NJ tree (Fig. 3d; Supplementary Table 16). We detected significant gene flow between P. mira and P. davidiana, further supporting a close relationship between these two species (Supplementary Table 17). The most significant introgression event between wild and domesticated accessions was observed between WR and YT groups, suggesting that YT group was initial domestication group of peach, consistent with fossil evidence28. After domestication, two different spread events to North and South China occurred, generating YG and NP groups. Subsequently, the NP and YG groups became the origin subcenter of North and South China, respectively, with the former derived the NE and NW groups.
Gene flow across the ancient Silk Road (ASR) was analyzed to track the travel history of peach from China to Europe and world (Fig. 3d; Supplementary Table 16). Significant gene flow was detected following the ASR within China, from Shaanxi to Gansu (Z-score=7.14, P < 0.01) and from Gansu to Xinjiang (Z-score=6.40, P < 0.01). Subsequently, peach was transported to Central Asia from Xinjiang over the Pamir Plateau, showing highly significant gene flow events from Xinjiang to Central Asia (Z score=3.14, P < 0.01). Finally, peach arrived in Mediterranean countries from Central Asia (Z score=3.29, P < 0.01) and later traveled from Western Europe to America and Africa. We found the strongest level of private allele sharing with WI group for landraces from NW group (81.0%) (Supplementary Fig. 6), implying accessions from Europe derived from Northwest China via the ASR. Collectively, our data provide genomic evidences for the global evolution pattern of peach and confirm the key role of the ASR in the movement of peach from the East to the West.
Gene flow for peaches from major peach production countries was analyzed, including China, Europe, the United States (US), South Africa, Japan, and South Korea, covering more than 85% of world’s total production (Fig. 3d; Supplementary Table 16). We found extensive pairwise bidirectional gene flows among accessions from different countries, as shared parents during breeding. For US cultivars, major gene flow was observed from European cultivars and landrace from China. For Japanese and South Korean cultivars, significant gene flow was observed from improved cultivars from China, Europe, and America. For South African cultivars, most of genetic background was derived from Europe and North America. For Chinese cultivars, the major introgressions of genetic background were from Europe and South Korea, followed by North America and Japan.
Global introgression of peach
To further understand genome connections of global peach, a genome-wide introgression analysis of 1,020 accessions was performed. In total, we identified 4,942,029 pairwise identical-by-descent (IBD) segments with an average length of 892.2 kb among the 1,020 accessions, covering the entire genome. Only 60,000 (1.2%) cases of IBD were identified between wild and cultivated accessions (Fig. 4a; Supplementary Table 18), indicating that crosses with wild relatives were rare during domestication and improvement. Among domesticated accessions, the ornamental group inherited more IBD segments (4.66 Mb per accession) from wild accessions than rootstocks (2.64 Mb), landraces (0.75 Mb) and improved cultivars (0.35 Mb) (Fig. 4b). A total of 48,582 (0.9%) shared IBD segments were found between wild relatives and a group of accessions including edible landraces and improved cultivars (Fig. 4C), with the most segments derived from P. mira (120.8 Mb), followed by P. kansuensis (105.0 Mb) and P. davidiana (64.9 Mb) (Fig. 4c). The average length of shared IBD segments between cultivated accessions and P. mira (113.8 kb) was longer than that between cultivated accessions and P. kansuensis (89.4 kb) or P. davidiana (34.8 kb) (Fig. 4d). These results further supported P. mira as the most likely wild progenitor of cultivated peach. A total of 2,924 (9.9 Mb) IBD segments were shared by P. mira and P. davidiana, 934 IBD segments (5.3 Mb) shared by P. davidiana and P. kansuensis, and 500 IBD segments (2.2 Mb) shared by P. mira and P. kansuensis (Fig. 4e-4f), further supporting a close genetic relationship between P. mira and P. davidiana.
A previous study has indicated that P. ferganensis is indistinguishable from domesticated peaches9, but its origin is still unclear. Most wild-origin IBD segments in P. ferganensis were derived from P. kansuensis (61.4%), followed by P. mira (27.3%) and P. davidiana (11.3%), suggesting introgression from P. kansuensis in its origin or direct domestication from P. kansuensis (Fig. 3e), consistent with their close geographical distributions and similar stone phenotypes (Supplementary Fig. 7). Moreover, the introgressions from almond also be found in P. ferganensis (Supplementary Fig. 4).
The proportion of introgressed segments from wild accessions in domesticated genomes ranged from 0.000022 to 0.48, with an average of 0.034 (Supplementary Table 19). Ornamental Landraces (average 0.16) had many more wild-origin IBD segments than did edible Landraces (0.028). Landraces harbored a greater percentage of wild-origin IBD segments (average 0.035) than did improved accessions (0.019). A total of 10 introgression hotspots were identified on chromosomes 1, 2, 3, 4, 6, and 7, including six wild introgression hotspots on chromosomes 2, 3, 4, and 6 (Fig. 4g; Supplementary Table 20). A total of 698 genes were located in introgression hotspots, with an average gene number of 87 per Mb, which was lower than genome level (118 genes per Mb), indicating that wild introgressions may be biased toward regions with fewer genes. Stress-related genes were enriched in wild introgression hotspots, including 20 LRR proteins, 38 NB-ARC proteins, and 8 PPR proteins (Supplementary Table 21). Cultivated accessions with higher portion of wild-origin IBDs provided materials with great values for resistant breeding and expansion of genetic background (Supplementary Table 22). For instance, we identified a PpRCI2A gene involved in cold resistance in IBD between P. kansuensis (cold resistant) and cultivated peach in an IBD hotspot on chromosome 6 (Fig. 4g)29. Moreover, we found that expression of PpRCI2A was induced by cold treatment (-24°C) (Fig. 4g), providing new insights for cold resistance gene mining.
We further explored the new utilizations of IBDs in phenotype prediction and gene mining based on shared overlapping IBDs. We found that a total of 11 accessions shared IBD segment containing the Rm3 gene conferring resistance to the green peach aphid from resistant accession ‘Hong Shou Xing’30. Further phenotyping supported the resistance of all 11 accessions (Fig. 4h), providing new insights into IBD-based marker development for traits with known QTLs but unclear causative genes. Another example was the chilling requirement (CR) trait, which is a key trait for peach adaptation. A major QTL (qCR1) for CR has been mapped to chromosome 131,32. Low CR is an essential breeding target for peach in low-altitude regions. To explore the donor gene responsible for low CR, we tracked shared IBD segments harboring qCR1 among low-CR accessions. Finally, a total of 255 shared IBD segments covering qCR1 were identified, with an overlap of 499.9 kb sequences (Pp01:43,409,702-43,909,637) (Fig. 4i). Using overlapping sequences, we constructed an NJ tree and found that the landrace accession ‘Nan Shan Tian Tao’ from South China was the progenitor of low-CR allele (Supplementary Fig. 8).
Large MNVs in the peach genome
Large MNVs (> 6 bp) are crucial polymorphisms for evolution and trait variability but are still poorly studied. A total of 2,626,686 large MNVs were included in our panvariome (Fig. 5a), comprising L-indels, SVs, CNVs, TIPs, and PAVs (Fig. 5b-5e). The TIPs consisted of 34,968 reference and 136,767 nonreference novel TIPs, which could be divided into five major superfamilies, namely, 38,997 terminal inverted repeats (TIRs), 35,211 Copia long terminal repeats (LTRs), 32,532 Gypsy LTRs, 29,995 MITEs, and 36,608 unclassified LTRs. Of these, 112,734 TIPs (72.0%) occurred in the bodies (14,232, 53.0%) or regulatory regions (11,632, 43.3%) of 25,864 peach genes (96.2%). Several known causative TIPs underlying agronomic traits were identified, for instance, an LTR insertion in PpMYB25 conferring a nectarine phenotype (G locus, PpMYB25; LTR, Pp05: 15,893,165)33 and an LTR in PpCCD4 underlying yellow flesh (Y locus, PpCCD4; LTR, Pp01: 26,614,904)8. However, these two causal variants could not explain phenotypic variations of all 1,020 accessions. For flesh color, a rarely new Copia LTR insertion (Pp01: 26,615,680) in the third exon resulting in loss of function of PpCCD4 underlying yellow flesh was identified in five accessions (Supplementary Fig. 9). The previously reported TIP underlying nectarine was absent in 15 nectarine accessions, and a new causal nonsynonymous SNP located at a highly conserved site of PpMYB25 (Pp05: 15,893,290, A>G) was identified (Supplementary Fig. 10). Intriguingly, all the 21 accessions were from Northwest China, suggesting private causal variants and novel origin events of these two traits. The similar situation was also observed in fruit flesh texture gene, Hd (Supplementary Fig. 11). TIPs showed a distinct pattern compared with other MNVs, showing greater abundance in landrace and improved accessions but a lower abundance in wild accessions and suggesting the generation of novel TIPs during breeding (Fig. 5d).
Among the SVs, 229,569 (~49.7%) of which were more abundant than in previous works3,4, 244,068 (56.5%) were found in fewer than 10 accessions (<1%), and 347,439 (80.5%) were observed in fewer than 50 accessions (<5%) (Supplementary Fig. 12). A total of 34,847 (31.6%) and 28,089 (25.5%) SVs were longer than 5 kb and 1 Mb, respectively. Wild relatives harbored more SVs than domesticated accessions (Fig. 5e), with different patterns for DELs, INSs, DUPs, INVs, and TRAs (Fig. 5f-5j). INVs have been reported to impact phenotypes, fertility, and recombination in humans and plants34,35,36. We identified 4,982 INVs in 1,020 accessions, with an average length of 627.6 kb, and 256 (5.3%) were longer than 1 Mb. Wild accessions contained fewer INVs than landraces and improved cultivars, suggesting the generation of new INVs during domestication and improvement (Fig. 5j). Most of INVs had a low frequency, and 85.9% had a frequency lower than 1% (Fig. 5k). The distribution patterns of INVs and SNPs across the genome were often opposite (Fig. 5l), implying that the generation of SNPs is limited by INVs.
Previously, we identified a 1.67 Mb INV that impacted the function of PpOPF1 underlying flat vs. round fruit shape3,4. Using an INV-based GWAS, the 1.67 Mb INV (Pp06: 26.85-28.52 Mb) was also found to be associated with fruit shape in this study (Fig. 5m-5n). We found this INV did not alter gene expressions but induced a strong long-distance differentiation (Fig. 5o-5p). Moreover, the linkage disequilibrium (LD) within this INV in flat accessions (LD decay 13.5 kb) was significantly greater than that in round accessions (LD decay 4.0 kb), and a long block with high LD (16.7-30.6 Mb) was identified (Fig. 5q), suggesting that recombination is strongly suppressed by this INV. To further verify this inference, we investigated recombination rate using a round × flat peach cross, and the suppressed recombination was observed around this INV (Fig. 5p). Upon construction of an NJ-tree for cross population, strong separation between round- and flat-fruit accessions was found (Fig. 5r). We also found that flat-fruit accessions with this INV were clustered on relatively independent evolutionary branches in NJ tree (Supplementary Fig. 13). These results suggest that INVs have extensive impacts on genome landscape and contributed to the formation of a new horticultural type within peach. Overall, the large-scale panvariome has contributed to the mining of new or rare natural functional variants.
GWASPV, panvariome based GWAS, enabled the precise mapping of trait-conferring genes via “one-step GWAS”
The panvariome enabled an updated version of GWAS, which we named GWASPV, and significantly improved statistical power, making it possible to efficiently identify key genes and causal mutations involved in major gene-based traits with only one step, which achieved a shortening of 8 years for gene mining in maximum (Fig. 6a). GWASPV of 40 agronomic traits, 1,858 metabolic traits, and 51 environmental variables (only for landraces) was performed, and more than 2,000 novel associations were identified (Supplementary Tables 24 and 26). The causal variants and genes underlying the phenotypic variability of 6 well-characterized traits were directly identified by using the top signals with GWASPV, namely, fruit shape (1.67 Mb INV, Pp06: 26,847,156, S, PpOFP1)4(Fig. 6b), flesh color (2 bp S-indel, Pp01: 26,614,083, Y, PpCCD4)8(Fig. 6c), fruit hairiness (6.0 kb TIP, Pp05: 15,893,169, G, PpMYB25)37, flesh texture (70.5 kb DEL, Pp04: 19,026,186, M, PpPGF)3, flesh adhesion (PAV, Pp04: 19,081,325, F, PpPGM)38, and weeping habit (1.37 kb DEL, Pp03: 20,945,671, Pl, PpWEEP)39, validating the precise mapping of causal genes using GWASPV with only one step (Supplementary Table 24).
Trait-conferring genes and causal mutations for traits that have been mapped but not identified were discovered using GWASPV. Previous study has mapped a locus associated with kernel taste (bitter/sweet) 40, but causal gene keeps unknown. In this study, GWASPV revealed that a 6,492 bp TIP was strongly associated with kernel taste (Fig. 6d; Supplementary Fig. 14a), resulting in loss of function of PpbHLH14, the homology of which has been reported to underlie kernel taste in almond41. Similarly, for pollen sterility gene (Ps), the causal gene remains unknown, but has been mapped42. Using GWASPV, we found that Ps was defined by PpRLK1.1, encoding a G-type lectin S-receptor-like protein kinase (Fig. 6e), which is homologous to genes that participate in regulation of male sterility in Arabidopsis and several crops43. Another example is flower color, we identified a strong candidate gene, PpWD40.1, using GWASPV (Supplementary Fig. 14b).
Previous studies on brachytic dwarfism have revealed a locus (Dw) and a candidate gene, PpGID1c, encoding a gibberellin receptor, with a nonsynonymous SNP and a stop-gain SNP in dwarf accessions44,45. We genotyped the two SNPs in 20 dwarf accessions and found that dwarf phenotypes could not be completely explained by PpGID1c variations. For instance, ‘Le Yuan’, a dwarf accession from selfed offspring of ‘Early Red 2’ (normal growth), was expected to be homozygous for these SNPs44,45 because the dwarf allele is recessive, but heterozygous was observed, suggesting inability of these SNPs to explain dwarfism. Using GWASPV, we identified a nonsynonymous SNP located in the third exon of PpMEF20 associated with dwarfism (Fig. 6f), including in ‘Le Yuan’. Moreover, mutations in homology of PpMEF20 in Arabidopsis and rice lead to slow-growing phenotypes46, further supporting the hypothesis that PpMEF20 defined Dw locus.
Other examples include three color related traits: fruit skin color (H), flesh color surrounding the stone (Cs), and leaf color (Gr), which have been mapped10,47,48. Using GWASPV, we found that a 486 bp DEL in the promoter of PpMYB10.1 was associated with flesh color surrounding the stone, and the presence of a 486 bp DEL resulted in a red flesh color surrounding the stone (Fig. 6g). For fruit skin color, an associated SNP 1,319 bp upstream of PpMYB10.2 were identified (Fig. 6h). For leaf color, a large TRA downstream of PpMYB10.4 on chromosome 6 hijacked new sequences from chromosome 8 were identified (Fig. 6i), underlying the genetic variation in red leaves.
Maturity date is an important breeding target for market life of peach. A major gene (MD) was identified on chromosome 449, with a 9 bp L-indel in a candidate gene, PpNAC5. By analyzing of 1,020 accessions, we found that the 9 bp L-indel had poor power in explaining phenotypic variation (accuracy of 36.8%). In this study, two phenotypes were measured: ripen date and fruit development period. GWASPV on these two measurements revealed a 210 bp INV and a 486 bp DEL associated with the trait, and both variants were located upstream of PpNAC1 (Fig. 6j-6k). Moreover, the agreement rates (75.7% and 80.3%) between phenotype and genotype of the two variants were greater than that of the 9 bp L-indel. Therefore, our results support PpNAC1 as causal gene for maturity date, which is consistent with findings of a recent study on an early-ripening bud sport of peach50.
In peach, the dominant sugar and acid are sucrose and malic acid, which confer most of sweetness and acidity to fruit flavor, but key genes remain unclear. Using GWASPV, the top signal for malic acid content was mapped on chromosome 5, with a nonsynonymous SNP in the third exon of the PpTST1 gene, leading to a high acid content (Fig. 6l). The biological function of PpTST1 in the regulation of organic acids was verified in our previous study51. For sucrose content, we identified a new locus at the top of chromosome 2, PpNCED1, with an 8 bp L-indel associated with this trait (Fig. 6m). The expression of this gene in fruit flesh at the mature stage was greater in sweet-fruit cultivars (Supplementary Fig. 15). PpNCED1 is a key gene involved in synthesis of abscisic acid (ABA), which contributes to accumulation of sugars via interactions between ABA signal transduction and PpSPS112. In addition, we also identified a gyp1p superfamily gene (Prupe.4G179600) and its 11 bp L-indel underlying fruit weight, supporting by gene expression (Supplementary Fig. 16) and similar function of homologous gene in rice52.
In view of traits always conferred by multiple genes or multiple types of variations and the higher quality of SNP calling than MNV calling using short reads, the SNP associations might be overrepresented and real associations of SV might be depressed in GWASPV; therefore, a separate GWAS based on different types of variations within panvariome, which we named GWASPVMulti, was also included as an optional and complementary method in GWASPV. Using GWASPVMulti, more than 50,000 associations for 41 agronomic, 1,858 metabolic, and 51 environmental traits were identified (Supplementary Tables 27 and 29), providing abundant candidates for gene mining. For example, the top association signal from GWASPV for CR was a SNP at Pp01: 43,717,948, which was 244.4 kb from the strong candidate gene PpDAM632 (Supplementary Fig. 18). However, using GWASPVMulti, the key gene PpDAM6 and a 30 bp casual SV in its promoter were successfully identified (Supplementary Fig. 17).
Collectively, GWASPV significantly improved gene mapping power compared with that of conventional GWAS based on a single type of variant, making it possible to determine causative genes and causal polymorphisms via ‘one-step GWAS’. The critical reason for high efficiency of GWASPV is the use of full-spectrum panvariome. We found that the genome-wide LD level for panvariome (half LD decay distance of 21 bp, r2= 0.209) was significantly lower than that for SNPs (3.0 kb, r2= 0.259) (Supplementary Fig. 18), suggesting that many more recombination events were considered in GWASPV, improving the precision of associations. Furthermore, the number of LD blocks estimated by panvariome was greater than that of SNPs (Supplementary Fig. 18), which also improved the power of GWAS. The identified associations provided abundant of functional variations and valuable markers for genomic selection and genome design breeding.