Panvariome and pangenome of 1,020 global peach accessions shed light on evolution pattern, hidden natural variation and efficient gene discovery

doi:10.21203/rs.3.rs-4407657/v1

Download PDF

Article

Panvariome and pangenome of 1,020 global peach accessions shed light on evolution pattern, hidden natural variation and efficient gene discovery

https://doi.org/10.21203/rs.3.rs-4407657/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Natural variations are the basis of crop improvement. However, genomic variability remains largely understudied. We present the full-spectrum panvariome and pangenome of 1,020 peach accessions, including 10.5 million SNPs, indels, SVs, CNVs, TIPs, PAVs, uncovering 70.6% novel variants and 3,289 novel genes. Analysis of the panvariome reconstructs the globally evolutionary history of peach and identifies several trait-causally rare variants. Landraces and improved accessions encode more genes than wild accessions, suggesting gene gains during evolution. Global introgression patterns reveal its new utilizations in phenotype prediction and gene mining and suggest that the most likely wild progenitor of domesticated peach is Prunus mira and almond was involved in the origin of Prunus davidiana. We develop a novel panvariome-based solution for association study, GWASPV, that achieves rapid and precise identification of trait-conferring genes using only one-step GWAS. Our study provides a novel solution for gene mining, with important implications in accelerating plant breeding.

Biological sciences/Plant sciences/Plant genetics

Biological sciences/Genetics/Population genetics

DNA variability is the basis of species diversity and determines the progress of crop improvements¹. Major genomic variants include single nucleotide polymorphisms (SNPs), small insertions and deletions (S-indels), large insertions and deletions (L-indels), structural variations (SVs, including insertions, deletions, inversions, duplications, and translocations), copy number variations (CNVs), transposon insertion polymorphisms (TIPs), and presence-absence variations (PAVs). However, most recent studies have focused on SNPs or SVs, which represent only a very small portion of total polymorphisms in genomes^2,3,4,5. Moreover, previous studies revealed that most phenotypic diversity is controlled by several types of variants^4,6,7. For instance, the yellow flesh color of peach is controlled by at least three types of variants, including SNP, S-indel, and TIP⁸. Therefore, it is necessary to construct a genomic variation map with a comprehensive view of all types of genomic variations (named the panvariome) and to associate this diverse set of variants with phenotypic variability.

Peach (Prunus persica L.) is one of the most cosmopolitan fruit species worldwide. In view of its small genome (~224.7 Mb)⁹, self-compatible mating system, and short juvenile phase (2~3 years), peach is considered a model plant species for Rosaceae family. Although several studies have identified genome-wide SNPs and SVs in peach^{3,4,10,11,12,13,14}, the complete set of variants across the genome keeps largely unclear. As peach is a globally cultivated fruit species, the genetic relationships of global peaches and its route of global spread worldwide need further investigation. In addition, functional gene mining for perennial woody fruit trees always takes many years, resulting in only a few genes controlling agronomic traits have been described in depth¹⁵, which significantly limits the development of molecular breeding.

Approximately 2,000 peach accessions are available in the peach gene bank in China¹⁶. To enable more efficient utilization of these accessions, we sequenced more than 1,000 peach genomes. Here we report construction and analyses of panvariome and pangenome of 1,020 accessions. Genomic variations, genetic relationships, evolutionary history, and introgression features of this collection are analyzed. A panvariome-based genome-wide association study (GWASPV) is proposed and performed on more than 1900 traits, which significantly improved the power of association studies, precisely identifying conferring genes and causal mutations via ‘one-step GWAS’, thereby accelerating gene mining. Our results provide broad insights into genomic research of plants, animals, and humans.

A full-spectrum panvariome map of peach genomes

A total of 330 TB of sequencing data for 1,020 accessions were used in this study (Supplementary Table 1). These accessions encompassed a collection of peach with global representations, covering all continents, including 80 wild relatives, 331 landraces, and 609 improved cultivars (Supplementary Table 1). Reads were aligned against the ‘Lovell’ reference genome⁹ with a resulting average coverage of 94.7% and depth of 12.0×. To construct the panvariome, we developed a new framework that enabled rapid and synchronous identification and integration of multiple types of genome-wide variants, including SNPs, S-indels (≤ 6 bp), L-indels (> 6bp), SVs (≥ 30 bp), CNVs, TIPs, and PAVs (Fig. 1a). This framework provides a new solution for rapid and comprehensive identification of genome-wide DNA variants.

Using this framework, we identified 20,324,264 initial SNPs that included all SNPs detected in previous studies^3,4,10; 15,344,005 (~75.5%) SNPs were newly identified. Filtering based on a missing rate > 80%, multiallelism, and a minor allele frequency < 0.005 reduced this set to a final set of 6,646,034 SNPs. For multinucleotide variants (MNVs), a total of 1,226,182 S-indels, 1,998,010 L-indels, 431,838 SVs, 8,758 CNVs, and 174,041 TIPs were identified (Fig. 1b-1c). Among the SVs, deletions and translocations were dominant, and inversions were minor (Fig. 1d). By combining variants of different lengths, we generated an initial panvariome of peach comprising 10,484,836 variants, resulting in 46.1 variants per 1 kb and 390.2 variants per gene, representing the variation map with the highest density in peach and the first panvariome in plants. Based on panvariome, we found that most accessions harbored a high ratio of homozygosity (Supplementary Fig. 1), even cultivars from artificial hybridization, implying the narrow genetic background in global peach. Using panvariome, we constructed a world peach core collection of 124 accessions for 1,020 accessions and a culti-core collection for 938 landraces and improved accessions (Supplementary Tables 2 and 3).

Pangenome and gene PAVs in peach genomes

The genome for each accession was de novo assembled, producing a total of 174.8 Gb of contigs longer than 500 bp with an N50 of 3,709 bp (Supplementary Tables 4 and 6). All assembled contigs were compared with ‘Lovell’ reference genome, which produced a nonreference genome comprising a total of 321,253 nonredundant novel contigs (344 Mb) with an N50 of 1,113 bp and an identity < 90% to reference genome (Supplementary Table 5). Approximately 45.5% of the nonreference genome comprised repetitive elements, a slightly greater proportion than reference genome (37.1%) and less than that of nonreference sequences of tomato (78.2%)¹⁷. Wild accessions contained many more novel sequences than cultivated accessions (Fig. 2a), with P. davidiana containing the most (average ~5.9 Mb), followed by P. kansuensis (~ 3.0 Mb) and P. mira (~0.95 Mb), indicating a close relationship between P. mira and domesticated peaches.

A total of 3,289 protein-coding genes were predicted in the nonreference genome (Supplementary Table 7). Upon mapping deep sequencing data (40.3×) of ‘Lovell’ against the pangenome, 18 reference genes were not perfectly covered, while 59 novel genes were covered. Notably, the density of novel genes in nonreference sequences in peach was lower than that in tomato (351 Mb, 4,873 genes) and rice (268 Mb, 12,465 genes)^2,17. By integrating with the ‘Lovell’ reference genome, the final peach pangenome was formed from a total of 571.4 Mb of sequences and 30,162 protein-coding genes. A total of 2,567 (78.0%) novel genes contained the best hits in the nucleotide sequence database or Pfam domain (Supplementary Table 8). Among these genes, 1,700 showed high sequence similarity with genes in other Prunus fruit species (Fig. 2b, Supplementary Table 9), including 1,417 (83.3%) with almond (P. dulcis)¹⁸, 143 (10.0%) with sweet cherry (P. avium)¹⁹, and 106 (7.5%) with Japanese apricot (P. mume)²⁰, indicating shared ancestries or interspecific introgressions between Prunus species. Novel genes were clustered into various biological processes, including photosynthesis, stress response, and development, etc (Supplementary Table 10). Novel genes identified in wild accessions included abundant stress resistance genes, and 61 leucine-rich (LRR) family genes, including 28 plant disease resistance genes and four virus resistance genes, were identified (Supplementary Table 10). For instance, we identified a novel gene, PpRPW8, in powdery mildew (PM) resistant accessions (Fig. 2c), located in a previously mapped major quantitative trait locus (QTL)²¹. A gene homologous to PpRPW8 has been found to confer resistance to PM in Arabidopsis²², making PpRPW8 a strong candidate for PM resistance in peach.

We categorized genes in the pangenome based on their frequency: 22,523 (74.7%) core genes shared by all accessions and 7,639 dispensable genes present in only some of the collection—1,785 (5.9%) softcore, 3,474 (11.5%) shell, and 2,380 (7.9%) cloud genes shared by 99-100%, 1-99% and less than 1% of the accessions, respectively (Fig. 2d). Modeling of pangenome size by iterative random sampling revealed an open pangenome with a continuous decrease in core genome along with an increase in the pangenome (Fig. 2e), suggesting an abundance of variants and a large portion of dispensable genes in peach genome. For novel genes, 42 were present in all of the accessions, and 4, 863, and 2,380 were identified as softcore, shell and cloud genes, respectively. The proportion of core genes in the peach pangenome was similar to that in tomato (74.2%)¹⁷ and Arabidopsis (70.0%)²³, lower than that in apple (81.3–87.3%)²⁴ and cotton (85.8%)²⁵, and higher than that in Brassica napus (62.0%)²⁶ and rice (54.0%)². The ‘Lovell’ reference genome contained 100% of core genes but only 57.8% of dispensable genes. The genotypes of PAVs for 30,162 genes in 1,020 accessions were integrated into the initial panvariome, producing a final version of panvariome composed of a total of 10,515,025 variants (Fig. 1b).

Significant gene gains during peach breeding

Genomes of landraces and improved accessions encoded significantly more genes than wild accessions, whereas improved accessions contained slightly more genes than landraces (Fig. 2f), suggesting a general trend of gene gain during peach domestication and subsequent improvement, different from the usual gene loss during domestication in other species, such as tomato¹⁷ and cotton²⁵. Furthermore, more genes were gained during domestication than improvement. However, a different trend was found by comparisons using only novel genes, suggesting loss of novel genes and gain of reference genes during domestication and improvement (Fig. 2g). To eliminate the impacts of PAV identification, we further confirmed the conclusion of gene gains by identifying PAVs with different exon coverage thresholds (Fig. 2h). For instance, a reference gene, PpFBX92, encoding an F-box-containing protein involved in regulation of leaf size²⁷, was present in 98.4% of landraces but was rare in wild accessions (0.06%, Supplementary Table 11), implying the gain of an agronomically favorable gene during domestication.

We identified PAVs under selection during the history of peach breeding using two sets of comparisons, between landrace and wild accessions (domestication) and between improved and landrace accessions (improvement) (Supplementary Fig. 2). In total, we identified 805 favorable and 409 unfavorable genes during domestication (Fig. 2i; Supplementary Tables 12 and 13) and 70 favorable and 236 unfavorable genes during improvement (Fig. 2j; Supplementary Tables 14 and 15). Approximately 70% of improvement favorable genes were novel, while only 9.8% of domestication favorable genes were novel, suggesting ongoing selection on novel genes during improvement. Notably, most of unfavorable genes of domestication (95.6%) and improvement (90.3%) were novel. These results suggest that more genes were selected for than selected against during domestication but opposite during improvement, revealing a positive selection during domestication and a negative or purifying selection during improvement. Improvement favorable gene was absent on chromosomes 4 and 5, suggesting that genes on these chromosomes have been fixed during domestication and selections on genes on other chromosomes conferring the improvement. Only 6 (0.8%) favorable and 63 (15.4%) unfavorable genes were shared by domestication and improvement, suggesting their distinct selection targets. Enrichment analysis indicated that plant-pathogen interaction pathway was enriched in domestication favorable genes and improvement unfavorable genes, suggesting that landraces contained certain resistance but lost during improvement subsequently (Fig. 2k).

Reconstruction of the evolutionary history of world peach

To understand the evolutionary patterns of global peach, the 1,020 peach accessions were classified into nine subgroups according to the NJ tree using panvarime (Fig. 3a), which showed high congruence with geographic origins. Nine subgroups were detected: wild relatives (P. mira, P. davidiana, P. kansuensis) (WR); ornamental accessions and landraces from South China (OST), the middle and lower reaches of the Yangtze River (YT), the Yun-Gui Plateau in Southwest China (YG), Northeast China (NE), the North Plain of China (NP), and Northwest China (NW); improved cultivars from Western countries (WI); improved cultivars from Eastern countries (EI). The WI and EI groups showed mixed patterns, as they shared parents during breeding.

Although previous studies addressed the phylogeny of peach^3,10,13, the wild progenitors of domesticated peach and relationships among wild relatives remain largely unclear. By analyzing novel sequences and genes in wild relatives, we found that P. davidiana had the most novel sequences, while P. mira had the fewest (Fig. 2a). P. davidiana included a total of 3,071 novel genes, with an average of 341.2 and 25,760.1 novel and reference genes per accession, respectively. Further analysis found that the majority of novel genes in P. davidiana were derived from almond (Fig. 3b), suggesting introgression from almond in the origin of P. davidiana, which is consistent with phenotypic similarities between P. davidiana and almond, such as inedible thin flesh, cracked fruit suture lines, stone textures, and early blooming (Supplementary Fig. 3). Moreover, a gene flow event from almond to P. davidiana also provided strong evidence (Supplementary Fig. 4). In addition, the considerable phenotypic differences between peach and P. davidiana also suggested the presence of a “nonpeach” contributor to its origin (Supplementary Fig. 5). We found a novel gene, PpDOT1, in P. davidiana, almond and the interspecific almond × peach rootstock ‘GF677’ (Fig. 3c-3e), but was absent in P. persica, providing evidence for the introgression. P. mira contained the fewest novel sequences (Fig. 2a), making it the most likely wild progenitor of domesticated peach.

The evolutionary history of global peach was reconstructed based on D-statistics and NJ tree (Fig. 3d; Supplementary Table 16). We detected significant gene flow between P. mira and P. davidiana, further supporting a close relationship between these two species (Supplementary Table 17). The most significant introgression event between wild and domesticated accessions was observed between WR and YT groups, suggesting that YT group was initial domestication group of peach, consistent with fossil evidence²⁸. After domestication, two different spread events to North and South China occurred, generating YG and NP groups. Subsequently, the NP and YG groups became the origin subcenter of North and South China, respectively, with the former derived the NE and NW groups.

Gene flow across the ancient Silk Road (ASR) was analyzed to track the travel history of peach from China to Europe and world (Fig. 3d; Supplementary Table 16). Significant gene flow was detected following the ASR within China, from Shaanxi to Gansu (Z-score=7.14, P < 0.01) and from Gansu to Xinjiang (Z-score=6.40, P < 0.01). Subsequently, peach was transported to Central Asia from Xinjiang over the Pamir Plateau, showing highly significant gene flow events from Xinjiang to Central Asia (Z score=3.14, P < 0.01). Finally, peach arrived in Mediterranean countries from Central Asia (Z score=3.29, P < 0.01) and later traveled from Western Europe to America and Africa. We found the strongest level of private allele sharing with WI group for landraces from NW group (81.0%) (Supplementary Fig. 6), implying accessions from Europe derived from Northwest China via the ASR. Collectively, our data provide genomic evidences for the global evolution pattern of peach and confirm the key role of the ASR in the movement of peach from the East to the West.

Gene flow for peaches from major peach production countries was analyzed, including China, Europe, the United States (US), South Africa, Japan, and South Korea, covering more than 85% of world’s total production (Fig. 3d; Supplementary Table 16). We found extensive pairwise bidirectional gene flows among accessions from different countries, as shared parents during breeding. For US cultivars, major gene flow was observed from European cultivars and landrace from China. For Japanese and South Korean cultivars, significant gene flow was observed from improved cultivars from China, Europe, and America. For South African cultivars, most of genetic background was derived from Europe and North America. For Chinese cultivars, the major introgressions of genetic background were from Europe and South Korea, followed by North America and Japan.

Global introgression of peach

To further understand genome connections of global peach, a genome-wide introgression analysis of 1,020 accessions was performed. In total, we identified 4,942,029 pairwise identical-by-descent (IBD) segments with an average length of 892.2 kb among the 1,020 accessions, covering the entire genome. Only 60,000 (1.2%) cases of IBD were identified between wild and cultivated accessions (Fig. 4a; Supplementary Table 18), indicating that crosses with wild relatives were rare during domestication and improvement. Among domesticated accessions, the ornamental group inherited more IBD segments (4.66 Mb per accession) from wild accessions than rootstocks (2.64 Mb), landraces (0.75 Mb) and improved cultivars (0.35 Mb) (Fig. 4b). A total of 48,582 (0.9%) shared IBD segments were found between wild relatives and a group of accessions including edible landraces and improved cultivars (Fig. 4C), with the most segments derived from P. mira (120.8 Mb), followed by P. kansuensis (105.0 Mb) and P. davidiana (64.9 Mb) (Fig. 4c). The average length of shared IBD segments between cultivated accessions and P. mira (113.8 kb) was longer than that between cultivated accessions and P. kansuensis (89.4 kb) or P. davidiana (34.8 kb) (Fig. 4d). These results further supported P. mira as the most likely wild progenitor of cultivated peach. A total of 2,924 (9.9 Mb) IBD segments were shared by P. mira and P. davidiana, 934 IBD segments (5.3 Mb) shared by P. davidiana and P. kansuensis, and 500 IBD segments (2.2 Mb) shared by P. mira and P. kansuensis (Fig. 4e-4f), further supporting a close genetic relationship between P. mira and P. davidiana.

A previous study has indicated that P. ferganensis is indistinguishable from domesticated peaches⁹, but its origin is still unclear. Most wild-origin IBD segments in P. ferganensis were derived from P. kansuensis (61.4%), followed by P. mira (27.3%) and P. davidiana (11.3%), suggesting introgression from P. kansuensis in its origin or direct domestication from P. kansuensis (Fig. 3e), consistent with their close geographical distributions and similar stone phenotypes (Supplementary Fig. 7). Moreover, the introgressions from almond also be found in P. ferganensis (Supplementary Fig. 4).

The proportion of introgressed segments from wild accessions in domesticated genomes ranged from 0.000022 to 0.48, with an average of 0.034 (Supplementary Table 19). Ornamental Landraces (average 0.16) had many more wild-origin IBD segments than did edible Landraces (0.028). Landraces harbored a greater percentage of wild-origin IBD segments (average 0.035) than did improved accessions (0.019). A total of 10 introgression hotspots were identified on chromosomes 1, 2, 3, 4, 6, and 7, including six wild introgression hotspots on chromosomes 2, 3, 4, and 6 (Fig. 4g; Supplementary Table 20). A total of 698 genes were located in introgression hotspots, with an average gene number of 87 per Mb, which was lower than genome level (118 genes per Mb), indicating that wild introgressions may be biased toward regions with fewer genes. Stress-related genes were enriched in wild introgression hotspots, including 20 LRR proteins, 38 NB-ARC proteins, and 8 PPR proteins (Supplementary Table 21). Cultivated accessions with higher portion of wild-origin IBDs provided materials with great values for resistant breeding and expansion of genetic background (Supplementary Table 22). For instance, we identified a PpRCI2A gene involved in cold resistance in IBD between P. kansuensis (cold resistant) and cultivated peach in an IBD hotspot on chromosome 6 (Fig. 4g)²⁹. Moreover, we found that expression of PpRCI2A was induced by cold treatment (-24°C) (Fig. 4g), providing new insights for cold resistance gene mining.

We further explored the new utilizations of IBDs in phenotype prediction and gene mining based on shared overlapping IBDs. We found that a total of 11 accessions shared IBD segment containing the Rm3 gene conferring resistance to the green peach aphid from resistant accession ‘Hong Shou Xing’³⁰. Further phenotyping supported the resistance of all 11 accessions (Fig. 4h), providing new insights into IBD-based marker development for traits with known QTLs but unclear causative genes. Another example was the chilling requirement (CR) trait, which is a key trait for peach adaptation. A major QTL (qCR1) for CR has been mapped to chromosome 1^31,32. Low CR is an essential breeding target for peach in low-altitude regions. To explore the donor gene responsible for low CR, we tracked shared IBD segments harboring qCR1 among low-CR accessions. Finally, a total of 255 shared IBD segments covering qCR1 were identified, with an overlap of 499.9 kb sequences (Pp01:43,409,702-43,909,637) (Fig. 4i). Using overlapping sequences, we constructed an NJ tree and found that the landrace accession ‘Nan Shan Tian Tao’ from South China was the progenitor of low-CR allele (Supplementary Fig. 8).

Large MNVs in the peach genome

Large MNVs (> 6 bp) are crucial polymorphisms for evolution and trait variability but are still poorly studied. A total of 2,626,686 large MNVs were included in our panvariome (Fig. 5a), comprising L-indels, SVs, CNVs, TIPs, and PAVs (Fig. 5b-5e). The TIPs consisted of 34,968 reference and 136,767 nonreference novel TIPs, which could be divided into five major superfamilies, namely, 38,997 terminal inverted repeats (TIRs), 35,211 Copia long terminal repeats (LTRs), 32,532 Gypsy LTRs, 29,995 MITEs, and 36,608 unclassified LTRs. Of these, 112,734 TIPs (72.0%) occurred in the bodies (14,232, 53.0%) or regulatory regions (11,632, 43.3%) of 25,864 peach genes (96.2%). Several known causative TIPs underlying agronomic traits were identified, for instance, an LTR insertion in PpMYB25 conferring a nectarine phenotype (G locus, PpMYB25; LTR, Pp05: 15,893,165)³³ and an LTR in PpCCD4 underlying yellow flesh (Y locus, PpCCD4; LTR, Pp01: 26,614,904)⁸. However, these two causal variants could not explain phenotypic variations of all 1,020 accessions. For flesh color, a rarely new Copia LTR insertion (Pp01: 26,615,680) in the third exon resulting in loss of function of PpCCD4 underlying yellow flesh was identified in five accessions (Supplementary Fig. 9). The previously reported TIP underlying nectarine was absent in 15 nectarine accessions, and a new causal nonsynonymous SNP located at a highly conserved site of PpMYB25 (Pp05: 15,893,290, A>G) was identified (Supplementary Fig. 10). Intriguingly, all the 21 accessions were from Northwest China, suggesting private causal variants and novel origin events of these two traits. The similar situation was also observed in fruit flesh texture gene, Hd (Supplementary Fig. 11). TIPs showed a distinct pattern compared with other MNVs, showing greater abundance in landrace and improved accessions but a lower abundance in wild accessions and suggesting the generation of novel TIPs during breeding (Fig. 5d).

Among the SVs, 229,569 (~49.7%) of which were more abundant than in previous works^3,4, 244,068 (56.5%) were found in fewer than 10 accessions (<1%), and 347,439 (80.5%) were observed in fewer than 50 accessions (<5%) (Supplementary Fig. 12). A total of 34,847 (31.6%) and 28,089 (25.5%) SVs were longer than 5 kb and 1 Mb, respectively. Wild relatives harbored more SVs than domesticated accessions (Fig. 5e), with different patterns for DELs, INSs, DUPs, INVs, and TRAs (Fig. 5f-5j). INVs have been reported to impact phenotypes, fertility, and recombination in humans and plants^34,35,36. We identified 4,982 INVs in 1,020 accessions, with an average length of 627.6 kb, and 256 (5.3%) were longer than 1 Mb. Wild accessions contained fewer INVs than landraces and improved cultivars, suggesting the generation of new INVs during domestication and improvement (Fig. 5j). Most of INVs had a low frequency, and 85.9% had a frequency lower than 1% (Fig. 5k). The distribution patterns of INVs and SNPs across the genome were often opposite (Fig. 5l), implying that the generation of SNPs is limited by INVs.

Previously, we identified a 1.67 Mb INV that impacted the function of PpOPF1 underlying flat vs. round fruit shape^3,4. Using an INV-based GWAS, the 1.67 Mb INV (Pp06: 26.85-28.52 Mb) was also found to be associated with fruit shape in this study (Fig. 5m-5n). We found this INV did not alter gene expressions but induced a strong long-distance differentiation (Fig. 5o-5p). Moreover, the linkage disequilibrium (LD) within this INV in flat accessions (LD decay 13.5 kb) was significantly greater than that in round accessions (LD decay 4.0 kb), and a long block with high LD (16.7-30.6 Mb) was identified (Fig. 5q), suggesting that recombination is strongly suppressed by this INV. To further verify this inference, we investigated recombination rate using a round × flat peach cross, and the suppressed recombination was observed around this INV (Fig. 5p). Upon construction of an NJ-tree for cross population, strong separation between round- and flat-fruit accessions was found (Fig. 5r). We also found that flat-fruit accessions with this INV were clustered on relatively independent evolutionary branches in NJ tree (Supplementary Fig. 13). These results suggest that INVs have extensive impacts on genome landscape and contributed to the formation of a new horticultural type within peach. Overall, the large-scale panvariome has contributed to the mining of new or rare natural functional variants.

GWASPV, panvariome based GWAS, enabled the precise mapping of trait-conferring genes via “one-step GWAS”

The panvariome enabled an updated version of GWAS, which we named GWASPV, and significantly improved statistical power, making it possible to efficiently identify key genes and causal mutations involved in major gene-based traits with only one step, which achieved a shortening of 8 years for gene mining in maximum (Fig. 6a). GWASPV of 40 agronomic traits, 1,858 metabolic traits, and 51 environmental variables (only for landraces) was performed, and more than 2,000 novel associations were identified (Supplementary Tables 24 and 26). The causal variants and genes underlying the phenotypic variability of 6 well-characterized traits were directly identified by using the top signals with GWASPV, namely, fruit shape (1.67 Mb INV, Pp06: 26,847,156, S, PpOFP1)⁴(Fig. 6b), flesh color (2 bp S-indel, Pp01: 26,614,083, Y, PpCCD4)⁸(Fig. 6c), fruit hairiness (6.0 kb TIP, Pp05: 15,893,169, G, PpMYB25)³⁷, flesh texture (70.5 kb DEL, Pp04: 19,026,186, M, PpPGF)³, flesh adhesion (PAV, Pp04: 19,081,325, F, PpPGM)³⁸, and weeping habit (1.37 kb DEL, Pp03: 20,945,671, Pl, PpWEEP)³⁹, validating the precise mapping of causal genes using GWASPV with only one step (Supplementary Table 24).

Trait-conferring genes and causal mutations for traits that have been mapped but not identified were discovered using GWASPV. Previous study has mapped a locus associated with kernel taste (bitter/sweet)⁴⁰, but causal gene keeps unknown. In this study, GWASPV revealed that a 6,492 bp TIP was strongly associated with kernel taste (Fig. 6d; Supplementary Fig. 14a), resulting in loss of function of PpbHLH14, the homology of which has been reported to underlie kernel taste in almond⁴¹. Similarly, for pollen sterility gene (Ps), the causal gene remains unknown, but has been mapped⁴². Using GWASPV, we found that Ps was defined by PpRLK1.1, encoding a G-type lectin S-receptor-like protein kinase (Fig. 6e), which is homologous to genes that participate in regulation of male sterility in Arabidopsis and several crops⁴³. Another example is flower color, we identified a strong candidate gene, PpWD40.1, using GWASPV (Supplementary Fig. 14b).

Previous studies on brachytic dwarfism have revealed a locus (Dw) and a candidate gene, PpGID1c, encoding a gibberellin receptor, with a nonsynonymous SNP and a stop-gain SNP in dwarf accessions^44,45. We genotyped the two SNPs in 20 dwarf accessions and found that dwarf phenotypes could not be completely explained by PpGID1c variations. For instance, ‘Le Yuan’, a dwarf accession from selfed offspring of ‘Early Red 2’ (normal growth), was expected to be homozygous for these SNPs^44,45 because the dwarf allele is recessive, but heterozygous was observed, suggesting inability of these SNPs to explain dwarfism. Using GWASPV, we identified a nonsynonymous SNP located in the third exon of PpMEF20 associated with dwarfism (Fig. 6f), including in ‘Le Yuan’. Moreover, mutations in homology of PpMEF20 in Arabidopsis and rice lead to slow-growing phenotypes⁴⁶, further supporting the hypothesis that PpMEF20 defined Dw locus.

Other examples include three color related traits: fruit skin color (H), flesh color surrounding the stone (Cs), and leaf color (Gr), which have been mapped^10,47,48. Using GWASPV, we found that a 486 bp DEL in the promoter of PpMYB10.1 was associated with flesh color surrounding the stone, and the presence of a 486 bp DEL resulted in a red flesh color surrounding the stone (Fig. 6g). For fruit skin color, an associated SNP 1,319 bp upstream of PpMYB10.2 were identified (Fig. 6h). For leaf color, a large TRA downstream of PpMYB10.4 on chromosome 6 hijacked new sequences from chromosome 8 were identified (Fig. 6i), underlying the genetic variation in red leaves.

Maturity date is an important breeding target for market life of peach. A major gene (MD) was identified on chromosome 4⁴⁹, with a 9 bp L-indel in a candidate gene, PpNAC5. By analyzing of 1,020 accessions, we found that the 9 bp L-indel had poor power in explaining phenotypic variation (accuracy of 36.8%). In this study, two phenotypes were measured: ripen date and fruit development period. GWASPV on these two measurements revealed a 210 bp INV and a 486 bp DEL associated with the trait, and both variants were located upstream of PpNAC1 (Fig. 6j-6k). Moreover, the agreement rates (75.7% and 80.3%) between phenotype and genotype of the two variants were greater than that of the 9 bp L-indel. Therefore, our results support PpNAC1 as causal gene for maturity date, which is consistent with findings of a recent study on an early-ripening bud sport of peach⁵⁰.

In peach, the dominant sugar and acid are sucrose and malic acid, which confer most of sweetness and acidity to fruit flavor, but key genes remain unclear. Using GWASPV, the top signal for malic acid content was mapped on chromosome 5, with a nonsynonymous SNP in the third exon of the PpTST1 gene, leading to a high acid content (Fig. 6l). The biological function of PpTST1 in the regulation of organic acids was verified in our previous study⁵¹. For sucrose content, we identified a new locus at the top of chromosome 2, PpNCED1, with an 8 bp L-indel associated with this trait (Fig. 6m). The expression of this gene in fruit flesh at the mature stage was greater in sweet-fruit cultivars (Supplementary Fig. 15). PpNCED1 is a key gene involved in synthesis of abscisic acid (ABA), which contributes to accumulation of sugars via interactions between ABA signal transduction and PpSPS1¹². In addition, we also identified a gyp1p superfamily gene (Prupe.4G179600) and its 11 bp L-indel underlying fruit weight, supporting by gene expression (Supplementary Fig. 16) and similar function of homologous gene in rice⁵².

In view of traits always conferred by multiple genes or multiple types of variations and the higher quality of SNP calling than MNV calling using short reads, the SNP associations might be overrepresented and real associations of SV might be depressed in GWASPV; therefore, a separate GWAS based on different types of variations within panvariome, which we named GWASPVMulti, was also included as an optional and complementary method in GWASPV. Using GWASPVMulti, more than 50,000 associations for 41 agronomic, 1,858 metabolic, and 51 environmental traits were identified (Supplementary Tables 27 and 29), providing abundant candidates for gene mining. For example, the top association signal from GWASPV for CR was a SNP at Pp01: 43,717,948, which was 244.4 kb from the strong candidate gene PpDAM6³²(Supplementary Fig. 18). However, using GWASPVMulti, the key gene PpDAM6 and a 30 bp casual SV in its promoter were successfully identified (Supplementary Fig. 17).

Collectively, GWASPV significantly improved gene mapping power compared with that of conventional GWAS based on a single type of variant, making it possible to determine causative genes and causal polymorphisms via ‘one-step GWAS’. The critical reason for high efficiency of GWASPV is the use of full-spectrum panvariome. We found that the genome-wide LD level for panvariome (half LD decay distance of 21 bp, r²= 0.209) was significantly lower than that for SNPs (3.0 kb, r²= 0.259) (Supplementary Fig. 18), suggesting that many more recombination events were considered in GWASPV, improving the precision of associations. Furthermore, the number of LD blocks estimated by panvariome was greater than that of SNPs (Supplementary Fig. 18), which also improved the power of GWAS. The identified associations provided abundant of functional variations and valuable markers for genomic selection and genome design breeding.

In summary, the first panvariome of plants was constructed in this study, which represented the full spectrum of genomic variations. The panvariome contains more and real genomic information, such as the footprints of evolution, phenotype-genotype associations, and genetic diversity characteristics. The application of panvariome could provide a more comprehensive understanding of genomes as well as help confirm and supplement the conclusions of previous studies, especially for identification of functional rare variants. Most critically, a new gene mining solution, GWASPV, was developed and verified to be efficient and accurate. Our GWASPV solution could accelerate not only the gene mining of plants and animals but also the discovery of key genes and casual variants associated with disease risk in humans. Our study gives a paradigm for comprehensive research on the complete genomic variations (Supplementary Fig. 19).

Sequencing data

The sequencing data for a collection of 1,020 peach accessions were used in this study. Of these, 737 were obtained from our previous studies^3,4,10,12, and the remaining 283 were newly sequenced. All accessions were collected from the National Horticulture Germplasm Resource Center (NHGRC, Zhengzhou, China). At least 5 g of young leaves for each accession were sampled. DNA was sequenced using the Illumina Novaseq 6000 platform, following the manufacturer’s (Illumina Inc.). Paired-end sequencing libraries with an insert size of approximately 300 bp or 500 bp were constructed and finally generated the sequencing data with a length for 125 bp or 150 bp for each accession. Summaries of the passport information and sequencing data for all accessions were detailed in Supplementary Table 1.

SNP calling

Pair-end reads were aligned against the peach “Lovell” reference genome using BWA-MEM (version 0.7.17-r1188)⁵³. The mapped reads were sorted, and duplicates were removed by Picard tools (version 1.136) (http://broadinstitute.github.io/picard/). To obtain accurate SNPs, the reads align quality <20 were eliminated for further analyses. The genomic regions around indels were realigned using the RealignerTargetCreator and IndelRealigner package in the Genome Analysis Toolkit (GATK) (release 4.1.2.0)⁵⁴. The variation for each accession was detected using GATK HaplotypeCaller resulting a GVCF file for each accession. The population level of SNPs was called by GATK GenotypeGVCFs package based on GVCF files with a hard filtration using the following parameters: QUAL< 40, QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, ReadPosRankSum < -8.0. Further SNP filtration was performed using VCFtools (version 0.1.16) and plink (version 1.9)^55,56.

Indel calling

INDEL calling was performed using the same pipeline as SNP calling since GATK is capable of calling SNPs and indels simultaneously⁵⁴. To reduce false positives, we also supplied a harder filter for raw INDELs using GTAK VariantFiltration with the following parameters: QD < 2.0, FS > 200.0, ReadPosRankSum < − 20.0. Insertions and deletions ≤ 6 bp were defined as small indels (S-indels). Insertions and deletions > 6 bp and < 30 bp were termed as large indels (L-indels).

SV calling

DELLY (version 0.8.5)⁵⁷ and LUMPY (version 0.2.13)⁵⁸ were used to detect the SV. For DELLY, mapped pair-end reads in BAM format generated by BWA-MEM and Picard tools after sorting and marking PCR duplicates were used as input. SVs for each accession were identified and genotyped using DELLY call package with default parameters. SV files in VCF format for all accessions were merged into a population-level VCF file using bcftools⁵⁹. For LUMPY, reads aligning as well as split and discordant read-pairs extractions were performed by SpeedSeq (version 0.1.2)⁶⁰. SV were jointly called and then genotyped using LUMPY lumpyexpress package and SVTyper (version 0.7.1)⁶⁰ based on pre-extracted splitters and discordants. Comparison of SV abundance between groups was performed using accession with sequencing depth < 10×.

TIP calling

TIP was identified using TEFLoN which could detect both referenced and non-referenced transposable element insertions⁶¹. Pair-end reads were aligned against the “Lovell” pseudo-reference genome with known TE sequences separated out using BWA-MEM with a parameter of -Y. TIP detection for each accession was performed by TEFLoN using sorted aligned reads in BAM format. Finally, the population level genotyping of TIP for 1,020 accessions was conducted using teflon_genotype.py module in TEFLoN with default parameters.

CNV calling

CNVnator (v0.4.1)⁶² and CNVcaller (version 1.0.0)⁶³ were used for CNV detection. For CNVcaller, a new specific reference genome for CNV calling was generated with a window size of 2,000 bp and a step of 1,000 bp. Aligned, sorted, PCR duplication remarked BAM files processed by BWA-MEM and Picard-tools were used as input. Reads of each window were counted from BAM file and the boundaries of CNV regions were detected using normalized mean read depth with the following parameters: -f 0.05, -h 3, -r 0.1. Finally, the CNV genotype for all accessions was clustered with the input sample using a Gaussian Mixture Model. For CNVnator, the same aligned BAM file with CNVcaller was used. The CNVs were detected and genotyped with a bin size of 10,000 bp and with the length >1,000 bp.

Pangenome construction

We utilized a "map-to-pan" strategy to construct the pangenome of 1,020 peach accessions and followed the pipeline in tomato pangenome^17,64. To improve the accuracy of genome assembly, we first filtered the low-quality sequences using Trimmomatic with parameters ‘SLIDINGWINDOW:4:20 MINLEN:50’ (version 0.33)⁶⁵. The resulting high-quality Illumina reads from each accession were de novo assembled using Megahit with default parameters (version 1.2.9)⁶⁶. Only assembled contigs with lengths >500 bp were used for further analyses. The contigs were aligned to the peach reference genomes, including the nuclear genome (version 2.0.a1) and chloroplast genome (RefSeq: NC_014697.1), using the nucmer module in MUMmer package (version 4.0.0rc1)⁶⁷. Contigs with alignments shorter than 300 bp and sequence identity lower than 80% were kept as unaligned contigs. For contigs containing the alignments longer than 300 bp and identity higher than 80%, but with continuous unaligned regions longer than 500 bp, these regions were also extracted as unaligned sequences and termed as partially unaligned contigs. The redundant unaligned contigs and partially unaligned sequences were removed (identity > 0.9) using CD-HIT (version 4.6.5)⁶⁸. The non-redundant sequences were searched against the NCBI GenBank nucleotide database using blastn (version 2.5.0+)⁶⁹. Contigs with best hits from outside the green plants or covered by other known plant mitochondrial or chloroplast genomes, were possible contaminations and eliminated. The final cleaned non-redundant non-reference sequences (novel sequences) and the reference genome were merged as the pangenome of peach.

Pangenome annotation

Novel protein-coding genes were predicted from nonreference genome following MAKER2 pipeline (version 2.31.8)⁷⁰. In this pipeline, Ab initio gene prediction was performed using GeneMark⁷¹, Augustus⁷², and SNAP⁷³. In view of the absence of a ‘peach’ standard model, the ‘Arabidopsis’ and ‘rice’ model were selected for GeneMark and Augustus prediction, and the ‘Arabidposis’ model was selected for SNAP prediction. RNA-Seq data of fruit flesh at the mature period of 185 accessions in our previous study⁷⁴ were used as transcript evidence, following pipeline in Gao et al¹⁷. Protein homologous evidence-based gene predictions were performed by comparing with the high-quality protein sequences in Uniprot database. Finally, gene predictions based on ab initio approaches and protein evidence were integrated using the MAKER2 pipeline. The predicted gene models were checked against the InterPro domain database using InterProScan (version 5.54-87.0)⁷⁵. Gene functions for novel predicted genes were annotated by comparing their protein sequences against the GenBank nonredundant database and multiple domain database in InterProScan database. GO and KEGG annotation and enrichment analyses were performed using the KOBAS (version KOBAS-i)⁷⁶. The repeat sequence was also predicted by RepeatMasker (www.repeatmasker.org) and RepeatRunner⁷⁷ following MAKER2 pipeline based repeat sequence library from Repbase database⁷⁸.

PAV analysis

To eliminate impacts of sequencing depth, only accessions with depth > 10 × were used for PAV analyses. Pair-end reads were aligned to the pangenome using BWA-MEM with default parameters. Gene body coverage and the cds coverage of each gene were calculated by geneCov module in EUPAN (version 0.44)⁶⁴ using sorted BAM file. The presence or absence of each gene in each accession was determined using geneExist in EUPAN⁶⁴. In brief, for a given gene in a given accession, if more than 80% of its exon regions were covered, this gene was treated as present in that accession, otherwise it was considered absent. To eliminate the impacts of sequencing depth, we identified PAVs with different exon coverage thresholds, including 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, and 100%.

Population genetic analysis

The genetic distance matrix of all accessions was computed with PLINK⁵⁶ using the 10,515,025 variants of the peach panvariome. The unweighted neighbor-joining tree was constructed based on the distance matrix using PHYLIP software (version 3.696)⁷⁹ with 1,000 bootstrap replicates. The population structure was analyzed using ADMIXTURE software (version 1.3.0)⁸⁰ based on the SNP set. The best cluster number (K) was determined to be 9 by running K from 2 to 20 with 10,000 iterations.

D-statistics

The ABBA-BABA test was performed using the qpDstat module in AdmixTools (version 662)⁸¹, and considered all triplets of the WGS tree, using wild relative accessions (WR groups) as the outgroup. We assessed significance through a block-jackknifing approach as implemented in AdmixTools, and applied a Bonferroni correction to assign significance at the 95% confidence level.

Identity By Descent (IBD) identification

To perform the IBD identification, genotype imputation and phasing were performed using BEAGLE software (version 5.4)⁸² with the following parameters: window=50000, impute-its=10. The pairwise IBDs of 1,020 accessions were detected for eight chromosomes separately using phased SNPs by IBDseq (verion 2.0) with default parameters⁸³.

Trait phenotyping

The phenotyping for agronomic traits was performed based on the protocols proposed by RosBREED project (https://www.rosbreed.org/breeding/peach). The metabolic and environmental traits were from our previous studies^12,74. The resistance to green peach aphid was performed by artificial infestation with 30 wingless adult aphids (M. persicae) of similar body sizes were placed on plant shoot tips with three replicate per accession. The cross population between round and flat peach [‘September free’ (round) × ‘Zhong You Pan No. 9’ (flat)] were used to analysis the impacts of 1.67 Mb INV on genome landscape.

Panvariome based GWAS (GWASPV)

Two major modules were contained in GWASPV, including GWASPV based on panvariome and GWASPVMulti based on diverse types of variations separately. For GWASPV, to minimize false positives, population structure was taken into account by the kinship matrix estimated with the Efficient Mixed-Model Association eXpedited (EMMAX) emmax-kin program (version beta)⁸⁴. To further control the population structure, the PCA was also selected as the covariate that was estimated by GCAT software (version 1.94.1)⁸⁵. The EMMAX program based on the mixed linear model (MLM) was used to carry out the GWAS analyses. For GWASPVMulti, kinship and PCA were also optional to improve statistical power. GWASPVMulti could run multiple times of GWAS using MLM model in EMMAX program based on SNPs, L-indels, S-indels, SVs, CNVs, TIPs, and PAVs successively and separately. The whole-genome significance cutoff as the Bonferroni test threshold, which was set as 0.05/total variants.

Core collection

The core collection that captured most of the allelic diversity of the 1,020 peach accessions was selected using GenoCore⁸⁶ software with the following parameters: -d 0.001, -cv 99. In addition, a core collection for cultivated peach (culti-core) was generated from the 875 domesticated accessions using the same methods. The core collection and cultivated core collection were further evaluated by PCA, using the same methods described above for the entire collection.

Gene expression and RNA-seq data

The expression of PpNCED1 and PpYPT were from the RNA-seq data in PRJNA762288.

To verify the candidate gene for cold resistance, the cold treatment with -16, -20, -24, -28, -32℃ were performed on ‘Tai Nong 2’ (cold sensitive) and ‘Zhou Xing Shan Tao’ (cold resistance). We performed qRT-PCR of candidate gene PpRCI2A in samples with different treatments using the LightCycler System (Roche LightCycler 480; Roche Diagnostics), following the manufacturer’s protocol. Relative expression levels were estimated by the 2^−ΔΔCT method.

Acknowledgments

We thank Dr. Josep Casacuberta and Dr. Raúl Castanera from Centre for Research in Agricultural Genomics in Universidad Autónoma de Barcelona for the assistance in identification of transpose element polymorphism. This work was supported by the National Key Research and Development Program (2023YFE0105400, 2019YFD1000200), National Natural Science Foundation of China (32341042), Central Public-interest Scientific Institution Basal Research Fund (No. Y2022QC23, 1610192023310), Agricultural Science and Technology Innovation Program (CAAS-ASTIP-2024-ZFRI-01), Natural Science Foundation of Henan (232300421042), National Science and Technology Major Project of Yunan (202302AE090005-3), and Crop Germplasm Resources Conservation Project (2016NWB041).

Contributions

L.W. and K.C. conceived the project; Y.L., J.W., and W.F. collected the plant samples and generated sequencing data; G.Z., K.C., C.C., and X.W. performed the phenotyping. Y.L. performed the data analyses. Y.L. and J.W. performed the experiments. Y.L. drafted the manuscript. P.A. and L.W. revised the manuscript. All authors read and approved the manuscript.

Data Deposition

The sequencing data for 1020 accession have been deposited into the Genome Sequence Archive database under accessions number of PRJCA025647.

Competing interests

The authors declare no competing interests.

Liang, Y., Liu, H., Yan, J., & Tian, F. Natural variation in crops: realized understanding, continuing promise. Annu. Rev. Plant Biol.72, 357-385 (2021).
Wang, W., et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature557, 43-49 (2018).
Li, Y., et al. Genomic analyses of an extensive collection of wild and cultivated accessions provide new insights into peach breeding history. Genome Biol. 20, 36 (2019).
Guo, J., et al. An integrated peach genome structural variation map uncovers genes associated with fruit traits. Genome Biol. 21. 36 (2020).
Coe, K., et al. Population genomics identifies genetic signatures of carrot domestication and improvement and uncovers the origin of high-carotenoid orange carrots. Nat. Plants9, 1643-1658 (2023).
Soyk, S., et al. Duplication of a domestication locus neutralized a cryptic variant that caused a breeding barrier in tomato. Nat. Plants5, 471-479 (2019).
Alonge, M., et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell182, 145-161 (2020).
Falchi, R., et al. Three distinct mutational mechanisms acting on a single gene underpin the origin of yellow flesh in peach. Plant J. 76,175-187 (2013).
Verde, I., et al. The Peach v2.0 release: high-resolution linkage mapping and deep resequencing improve chromosome-scale assembly and contiguity. BMC Genomics18, 25 (2017).
Cao, K,. et al. Genome-wide association study of 12 agronomic traits in peach. Nat. Commun.7, 13246 (2016).
Guan, J., et al. Genome structure variation analyses of peach reveal population dynamics and a 1.67 Mb causal inversion for fruit shape. Genome Biol. 22, 13 (2021).
Li, Y., et al. Genomic analyses provide insights into peach local adaptation and responses to climate change. Genome Res. 31, 592-606 (2021).
Yu, Y., et al. Genome re-sequencing reveals the evolutionary history of peach fruit edibility. Nat. Commun. 9, 5404 (2018).
Yu, Y., et al. Population-scale peach genome analyses unravel selection patterns and biochemical basis underlying fruit flavor. Nat. Commun.12, 3604 (2021).
Aranzana, M.J., et al. Prunus genetics and applications after de novo genome sequencing: achievements and prospects. Hortic. Res.6, 58 (2019).
Wang, L., Zhu, G., & Fang, W. (2012). China Peach Resources. Beijing.
Gao, L., et al. (2019). The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 51, 1044-1051.
Alioto, T., et al. Transposons played a major role in the diversification between the closely related almond and peach genomes: results from the almond genome sequence. Plant J.101, 455-47(2020).
Wang, J., et al. Chromosome-scale genome assembly of sweet cherry (Prunus avium L.) cv. Tieton obtained using long-read and Hi-C sequencing. Hortic. Res.7, 122 (2020).
Groppi, A., et al. Population genomics of apricots unravels domestication history and adaptive events. Nat. Commun.12, 3956 (2021).
Verde, I., Quarta, R., Cedrola, C., & Dettori, M.T. QTL analysis of agronomic traits in a BC1 peach population. Acta Hortic.592, 291-297 (2002).
Xiao, S., et al. Broad-spectrum mildew resistance in Arabidopsis thaliana mediated by RPW8. Science 291, 118-120 (2001).
Contreras-M, B., et al. Analysis of plant pan-genomes and transcriptomes with GET_HOMOLOGUES-EST, a clustering solution for sequences of the same species. Front. Plant Sci.8, 184 (2017).
Sun, X., et al. Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication. Nat. Genet.52, 1423-1432 (2020).
Li, J., et al. Cotton pan-genome retrieves the lost sequences and genes during domestication and selection. Genome Biol. 22, 119 (2021).
Hurgobin, B., et al. Homoeologous exchange is a major cause of gene presence/absence variation in the amphidiploid Brassica napus. Plant Biotechnol. J.16, 1265-1274 (2017).
Joke, B., et al. F-Box protein FBX92 affects leaf size in Arabidopsis thaliana. Plant Cell Physiol.58, 962-975 (2017).
Zheng, Y., Crawford, G.W., & Chen, X. Archaeological evidence for peach (Prunus persica) cultivation and domestication in China. PLoS One 9, e106595 (2014).
Capel, J., Jarillo, J.A., Salinas, J., & Martínez-Zapater, J.M. Two homologous low-temperature-inducible genes from Arabidopsis encode highly hydrophobic proteins. Plant Physiol.115, 569-576 (1997).
Pan, L., et al. NLR1 is a strong candidate for the Rm3 dominant green peach aphid (Myzus persicae) resistance trait in peach. J. Exp. Bot. 73, 1357-1369 (2022).
Fan, S., et al. Mapping quantitative trait loci associated with chilling requirement, heat requirement and bloom date in peach (Prunus persica). New Phytol. 185, 917-930 (2010).
Zhao, Y., et al. MADS-box protein PpDAM6 regulates chilling requirement-mediated dormancy and bud break in peach. Plant Physiol.193, 448-465 (2023).
Vendramin, E., et al. A unique mutation in a MYB gene cosegregates with the nectarine phenotype in peach. PLoS ONE9, e112032 (2014).
Fransz, P., et al. Molecular, genetic and evolutionary analysis of a paracentric inversion in Arabidopsis thaliana. Plant J.88, 159-178 (2016).
Giner-Delgado, C., et al. Evolutionary and functional impact of common polymorphic inversions in the human genome. Nat. Commun. 10, 4222 (2019).
Zhou, Y., et al. Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice. Nat. Commun. 14, 1567 (2023).
Yang, Q., et al. Two R2R3-MYB genes cooperatively control trichome development and cuticular wax biosynthesis in Prunus persica. New Phytol.234,179-196 (2022).
Gu, C., et al. Copy number variation of a gene cluster encoding endopolygalacturonase mediates flesh texture and stone adhesion in peach. J. Exp. Bot. 76, 1993-2005 (2016).
Hollender, C.A., et al. Loss of a highly conserved sterile alpha motif domain gene (WEEP) results in pendulous branch growth in peach trees. Proc. Natl. Acad. Sci. U S A115, E4690-4699 (2018).
Bliss, F.A., et al. An expanded genetic linkage map of Prunus based on an interspecific cross between almond and peach. Genome 45, 520-529 (2002).
Sánchez-Pérez, R., et al. Mutation of a bHLH transcription factor allowed almond domestication. Science 364, 1095-1098 (2019).
Dirlewanger, E., et al. Genetic linkage map of peach (Prunus persica (L.) Batsch) using morphological and molecular markers. Theor. Appl. Genet. 97, 888-895 (1998).
Zhu, L., et al. Receptor-like kinases and their signaling cascades for plant male fertility: loyal messengers. New Phytol. doi: 10.1111/nph.19527 (2024).
Hollender, C.A., Hadiarto, T., Srinivasan, C., Scorza, R., & Dardick, C.. A brachytic dwarfism trait (dw) in peach trees is caused by a nonsense mutation within the gibberellic acid receptor PpeGID1c. New Phytol.210, 227-239 (2016).
Cheng, J., et al. A single nucleotide mutation in GID1c disrupts its interaction with DELLA1 and causes a GA-insensitive dwarf phenotype in peach. Plant Biotechnol. J. 17, 1723-1735 (2019).
Andrés-Colás, N., et al. Multiple PPR protein interactions are involved in the RNA editing system in Arabidopsis mitochondria and plastids. Proc. Natl. Acad. Sci. U S A 114, 8883-8888 (2017).
Yamamoto, T., Shimada, T., Imai, T., & Bliss, F.A. Characterization of morphological traits based on a genetic linkage map in peach. Breeding Sci.51, 271-278 (2001).
Bretó, M.P., Cantin, C.M., Iglesias, I., Arús, P., & Eduardo, I. Mapping a major gene for red skin color suppression (highlighter) in peach. Euphytica213, 14 (2017).
Pirona, R., et al. Fine mapping and identification of a candidate gene for a major locus controlling maturity date in peach. BMC Plant Biol. 13, 166 (2013).
Zhou, H., et al. A large-scale behavior of allelic dropout and imbalance caused by DNA methylation changes in an early-ripening bud sport of peach. New Phytol. 239, 13-18 (2023).
Wang, Q., et al. Multi-omics approaches identify a key gene, PpTST1, for organic acid accumulation in peach. Hortic. Res.9, uhac026 (2022).
Zhang, Y., Xiong, Y., Liu, R., Xue, H.W., & Yang, Z. The Rho-family GTPase OsRac1 controls rice grain size and yield by regulating cell division. Proc. Natl. Acad. Sci. U S A 116, 16121-16126 (2019).
Li, H., & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760 (2009).
McKenna, A., et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.20, 1297-1303 (2010).
Danecek, P., et al. The variant call format and VCFtools. Bioinformatics 27, 2156-2158 (2011).
Purcell, S., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet.81, 559-575 (2007).
Tobias, R., et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics28, i333-339 (2012).
Layer, R.M., Chiang, C., Quinlan, A.R., & Hall, I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15: R84 (2014).
Danecek, P.,et al. Twelve years of SAMtools and BCFtools. Gigascience10, giab008 (2021).
Chiang, C., et al. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat. Methods 12, 966-968 (2015).
Adrion, J.R., Song, M.J., Schrider, D.R., Hahn, M.W., & Schaack, S. Genome-wide estimates of transposable element insertion and deletion rates in Drosophila melanogaster. Genome Biol. Evol. 9, 1329-1340 (2017).
Abyzov, A., Urban, A.E., Snyder, M., & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res.21, 974-984 (2011).
Wang, X., et al. CNVcaller: highly efficient and widely applicable software for detecting copy number variations in large populations. Gigascience 6, 1-12 (2017).
Hu, Z., et al. EUPAN enables pan-genome studies of a large number of eukaryotic genomes. Bioinformatics33, 2408-2409 (2017).
Bolger, A.M., Lohse, M., & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics30, 2114-2120 (2014).
Li, D., et al. MEGAHITv1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods102, 3-11 (2016).
Kurtz, S., et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Li, W., & Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).
Camacho, C., et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).
Holt, C., & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
Besemer, J., & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451-W454 (2005).
Mario, S., Mark, D., Robert, B., & David, H. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics24, 637-644 (2008).
Korf, I. Gene finding in novel Genomes. BMC Bioinformatics5, 59 (2004).
Cao, K., et al. Combined nature and human selections reshaped peach fruit metabolome. Genome Biol. 21, 46 (2022).
Quevillon, E., et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116-W120 (2005).
Bu, D., et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res 49, W317-W325 (2021).
Smith, C.D., et al. Improved repeat identification and masking in Dipterans. Gene389, 1-9 (2007).
Bao, W., Kojima, K.K., & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA6, 11 (2015).
Felsenstein, J. PHYLIP-phylogeny inference package (version 3.2). Cladistics 5, 164-166 (1989).
Alexander, D.H., Novembre, J., & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655-1664 (2009).
Patterson, N., et al. Ancient admixture in human history. Genetics 192, 1065-1093 (2012).
Browning, B.L., & Browning, S.R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116-126 (2016).
Browning, B.L., & Browning, S.R. Detecting identity by descent and estimating genotype error rates in sequence data. Am. J. Hum. Genet. 93, 840-851 (2013).
Kang, H.M., et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet.42, 348-354 (2010).
Yang, J., Lee, S.H., Goddard, M.E., & Visscher, P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76-82 (2011).
Jeong, S., et al. GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets. PLOS ONE12, e0181420 (2017).

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Panvariome and pangenome of 1,020 global peach accessions shed light on evolution pattern, hidden natural variation and efficient gene discovery

Status:

Version 1

Abstract

Figures

Main

Summary and conclusions

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1