3.1 Genome assembly and annotation
To generate the genome assembly, we sequenced a total of 63.0 Gb of Illumina short reads with about 137 x coverage (Table S1). After filtering, a total of 61.3 Gb of clean data was retained for further analysis (Table S1). To gain insight into the characteristics of C. pseudocerasu genome, a genome survey analysis was conducted employing the K-mer method (Marçais and Kingsford 2011; Ranallo-Benavidez et al. 2020; Simpson et al. 2009), the results represented that it was an autotetraploid genome (the AAAB genome structure rate was 70%), with estimated genome size, heterozygosity rate, and repetition rate of 459.67 Mb, 1.61%, and 1.75%, respectively (Fig. 1, Table S2). The final assembly size of ‘Huangguo’ genome was 340.99Mb, consisting of 261,760 scaffolds. Additionally, the scaffold N50 was 1.56Kb and the GC content rate was determined to be 37.58% (Table 1). The quality and completeness of ‘Huangguo’ genome were assessed through BUSCO (Manni et al. 2021) in eukaryota_odb10 database. BUSCO analysis revealed that the genome completeness was 50.6% (Table S3).
For genome annotation, a total of 51,264 coding genes were predicted in ‘Huangguo’ genome with a mean coding sequence length of 572 bp (Table 1, Table S4). Moreover, the genome contained approximately 158.78 Mb (46.57%) of repetitive sequences, comprising 143.1 Mb of dispersed repeats and 15.68 Mb of tandem repeats (Table 1, Table S5). In genome function annotation, 92.67% of genes were annotated by the NR, GO, COG, KEGG, or Swiss-Prot databases (Table S6).
(A) ‘Huangguo’ Genome K-mer frequency distribution map. (B) The smudge plot analysis in C. pseudocerasu genome.
Table 1
Summary of ‘Huangguo’ genome assembly
Genomic feature
|
Huangguo
|
Estimated genome size (Mb)
|
459.67
|
Assembled genome size (Mb)
|
340.99
|
Total number of scaffolds
|
261760
|
Scaffold N50 (Kb)
|
1.56
|
Complete BUSCOs (%)
|
50.6
|
Number of genes
|
51,264
|
Percentage of repeat sequences (%)
|
46.57
|
3.2 Genome resequencing and SNPs/Indels calling
To clarify the evolutionary relationships of 8 cultivated Chinese cherries (‘Changbing’, ‘Duanbing’, ‘HXxinzhong’, ‘HXyaoduan’, ‘Yongzhuyihao’, ‘HXyaohuang’, ‘Heizhenzhu’, and ‘Hongfei’), we generated a total of 388.27 Gb of raw Illumina sequence data with an average of 48.53 Gb per individual (Table S7). Approximately 375.99 Gb of high-quality sequencing reads were aligned to the ‘Huangguo’ genome, yielding a mapping rate between 99.19% and 99.39% (Table S7, Table S8). Moreover, sequencing coverage depths varied from 78 to 121 x (Table S8). Consequently, we called 27,169,381 SNPs and 3,733,525 Indels for eight Chinese cherry cultivars (Fig. 2A, Table S9). Among the 8 cultivars, ‘Yongzhuyihao’ has the fewest SNP/Indel variant sites, while ‘Changbing’ has the most (Fig. 2A, Table S9). Additionally, Type C: G change to T: A and type T: A change to C: G constituted the majority of the SNP mutation types (Fig. 2B).
In all identified SNPs, a total of 78.29% (21.3 million) were located in the intergenic regions, while only 5.6% (1.5 million) were situated within the exonic regions (Fig. 2C, Table S10). For these SNPs in coding regions, we detected 727,248 synonymous SNPs and 778,697 non-synonymous SNPs, resulting in a non-synonymous/synonymous ratio of 1.07 (Table S10). Similarly, concerning all identified Indels, 80.63% (3.0 million) were located in the intergenic regions, and only 1.2% (0.04 million) were detected within the exonic regions (Fig. 2D, Table S11).
(A) Statistics on the various numbers of SNPs and InDels in eight samples. (B) Distribution of SNP mutation type in eight samples. The x-axis delineates the type of SNP mutation, and the y-axis quantifies the number of SNPs. (C) Annotation distribution of SNPs variation. Different colors show the SNPs distribution of intergenic region, upstream, downstream, exon, intron, and splice. (D) Annotation distribution of InDels variation. Different colors show the InDels distribution of intergenic region, upstream, downstream, exon, intron, splice, and UTR5.
3.3 Phylogenetics and population structure
In order to explore the genetic relatedness among eight domestic Chinese cherry varieties, we constructed an unrooted neighbor-joining phylogenetic tree based on the filtered SNP data. The result indicates that the 8 individuals were divided into two major subclades. The first major subclade consists of ‘Changbing’, ‘Duanbing’, and ‘HXxinzhong’, with ‘Changbing’ and ‘Duanbing’ clustering on an internal node, suggesting that these two individuals have the closest evolutionary relationship. Similarly, the second major subclade includes ‘Yongzhuyihao’, ‘HXyaohuang’, ‘Heizhenzhu’, ‘Hongfei’, and ‘HXyaoduan’, with ‘HXyaohuang’ and ‘Yongzhuyihao’ clustering on an internal node, inferring that these two individuals have the closest genetic relationship within this branch (Fig. 3A). To further estimate different ancestral proportions in 8 individuals, population structure analysis was conducted by assuming a given number (K) of ancestral populations. When K = 2, the CV (Cross validation) error value was minimized (Fig. 3B), and the 8 cherry accessions were divided into two groups, which was consistent with the subclades of the phylogenetic tree. Additionally, ‘HXxinzhong’ and ‘HXyaohuang’ were shared ancestry between these two groups. When K = 4, there was a division or shared ancestry between each individual except ‘Changbing’ and ‘Duanbing’ (Fig. 3C). Further combined with the analysis of the G matrix, the obtained G values for the ‘Changbing’ and ‘Duanbing’ cultivars reached up to 0.85 (Fig. 3E), indicating a high degree of homology or closely related breeding histories between them. Moreover, The Principal Component Analysis (PCA) also yielded similar results in the phylogenetic tree among the individuals. (Fig. 3D).
(A) Unrooted neighbor-joining tree constructed with a bootstrap value of 1000 between individuals. (B) Cross-validation error results from different K values. (C) ADMIXTRUE analysis among the accessions (K = 2–4). (D) The principal component plot of 8 individuals. (E) Genetic relationship (G) matrix analysis between pairs of individuals, where the larger G value indicates the closer relationship between the samples.
3.4 SVs identification and annotation
A total of 2,829,651 SVs were identified in 8 domestic Chinese cherry varieties. Among them, the individual ‘Duanbing’ had the fewest variant sites with 259,237, while ‘Yongzhuyihao’ had the highest number of variant sites with 407,093, which was inconsistent with the variation trend for SNPs and InDels. In terms of structure variation types, there were 2,700,528 CTX types, 102,302 DEL types, 16,879 INV types, and 9,942 DUP types in all 8 samples. The proportion of CTX (95.44%) is extremely high among the different types of structure variations, potentially attributable to the limitation of genome scaffold assembly. Therefore, we took DEL, INV, and DUP types in 8 samples for further analysis.
After merged by SURVIVOR (Jeffares et al. 2017), there were 16,878 DEL types, 1,883 DUP types, and 753 INV types were found at least in two individuals. We also analyzed the distribution of these SVs throughout the genome and determined that 27.5% of SVs coincide with genic regions, including CDS, introns, and promoters (Table S12).
3.5 Secondary genotyping of heterozygous variants in SVs
As is well-known, there are three strategies to detect structural variation: discordant read pair, split reads, and read depth (Alkan et al. 2011). But for the DEL type, no matter which detection method is used, there is no doubt that the read depth of the variation region is always lower than that at wild-type sites (Cai et al. 2019). Therefore, the core idea of the secondary typing of the DEL variant type is to determine whether the heterozygous genotype (0/1) is 0001, 0101, or 0111 according to the ratio of the reads depth at the heterozygous mutation locus (that is, the number of reads supporting reference) to the reads depth within a specified interval size (such as 100bp) on the both sides of the variant without any variation.
The workflow includes four steps and is described in materials and methods in detail. Since these sites have been identified as heterozygous genetype (0/1) of DEL variations by Lumpy (Layer et al. 2014) software algorithms, we employ boxplots to examine the distribution pattern and remove outliers within the radio data that we acquired in 8 individuals (Table 2, Fig. 4). In the 8 sample boxplots, the upper quartile, median, and lower quartile distributions were 0.762–0.773, 0.631–0.658, and 0.476–0.510, respectively. And the proportion of outliers in each file ranges from 2.29–3.06% (Table 2).
Table 2 The statistics of the original radio data in 8 individuals.
Count: total number of heterozygous genotypes in DEL variation. Std: standard deviation. 25%: the lower quartile. 50%: the same as the median. 75%: the upper quartile. Count_outliers: the number and proportion of outliers filtered by the boxplot.
(A) Boxplots of the raw ratio data in 8 individuals. (B) Boxplot of the ratio excluding outliers in 8 individuals. The y-axis indicates the radio of Vrd/Srd. Vrd: the average read depth of the variation; Srd: the average reads depth of its surrounding area.
Subsequently, we use a threshold range to determine secondary typing on each individual file, and replace the heterozygous loci of each DEL type in the merged file based on the results with the secondary typing type. After obtaining the population vcf file modified into secondary typing, we selected the genotypes that exhibit a similar trend based on the phenotype data of fruit peel color from 8 samples. In 8 individuals, the ‘Changbing’ variety showed the reddest fruit peel color, with an average of 26.736 in a* value, which was more than twice that of the yellow sample ‘HXyaohuang’ (10.042) (Fig. 5A). It indicates that the genotype of fruit peel color in the two samples was significantly different. Given that it is uncertain whether the variant exerts a dominant or recessive effect on the phenotype, it can only be extrapolated that if one individual exhibits a genotype of 1111, the corresponding individual must invariably display a genotype of 0000, and the other samples show an increasing or decreasing trend of dominance. Based on this, 2 loci correlated with fruit peel color have been discerned within 10 scaffold locations (Fig. 5C-D, Table S13). For the phenotypic data of fruit weights, we utilized the same strategy and identified a locus associated with fruit weights from 11 scaffold positions (Fig. 5E, Table S14).
(A) Comparison of fruit peel color in 8 samples. The y-axis indicates the a* value in fruit peel color, the higher the value, the redder the peel color. (B) Comparison of fruit weight in 8 samples. ∗∗∗∗ P-value < 0.0001, t-test. The sample is arranged in ascending order of values. (C) The reads depth of the candidate scaffold site (‘scaffold7622’) related to fruit peel color in 8 samples. (D) The reads depth of the candidate scaffold site (‘scaffold112378’) related to fruit peel color in 8 samples. (E) The reads depth of the candidate scaffold site (‘scaffold45003’) related to fruit weight in 8 samples. The red dashed line represents the DEL-type SV interval in 8 samples.