Genome sequencing and sequence assembly
To avoiding the influence the potential heterozygous, we extracted DNA from the single plant leaves of common vetch for libraries constructing (Fig. 1). After filtering the low quality data, we obtained approximately 79.84 Gbp of high-quality data from the sequencing library, which were approximately 51 times of the estimated genome size. The Q20 and Q30 of the obtained data were greater than 97% and 92%, indicating the reliable of the genome survey sequencing. We then de novo assembled (K-mer = 75) all of the high quality data by using the de Bruijn graph-based SOAPdenovo software. A total of 4,227,942 raw contigs were obtained, and the total length of raw contigs was 1,475,990,986 bp and the contig N50 length of 1,245 bp (Table 1). Finally, the assembled common vetch genome consisted of 3,754,145 scaffolds which had a total length of 1,516,858,186 bp, and the scaffold N50 length of 3,556 bp (Table 1).
Genomic characteristics
The peak K-mer depth and the number of K-mers were calculated as 45 and 70, 575, 281,718, respectively, based on the K-mer analysis (K-mer = 17). The genome size of common vetch was estimated at 1, 568 Mbp, while the heterozygosity rate of this genome was 0.4345%, indicating that common vetch was a self-pollinating species (Fig. 2a).
In order to investigate the guanine plus cytosine (GC) content of the common vetch genome, we built a scatterplot graph by using scaffolds larger than 500bp, elucidating the information on sequencing data bias (Fig. 2a). The results showed that the GC content of the common vetch genome was 35%, which was consistent with the main peak in the scatterplot graph. Moreover, we also noticed that the confidence area (shown in red) was around the peak at 35, suggesting that the DNA sample for genome survey sequencing was not polluted by DNA from other species.
Genomic SSR markers development
The assembled scaffolds were employed for genomic SSR search via the MISA software (http://pgrc.ipk-gatersleben.de/misa/misa.html). A total of 76,810 putative SSRs were identified from 58,373 isoforms and 12,050 isoforms contained more than one SSR. Among the identified putative SSRs, 4,932 SSRs were present in compound formation. We found that the most abundant SSR type was Dinucleotide, accounting for 44.94% of the total SSRs, followed by Tri- (35.82%), Tetra- (13.22%), Penta- (4.47%) and hexa nucleotide (1.54%) SSRs (Fig. 3). The density of SSRs identified in the assembled common vetch genome was one SSR per 20.41 kb.
The SSRs were categorized by their repeat motifs. The most abundant repeats were AG/CT (17.29%) and AC/GT (15.54%), followed by AT/AT (12.02%), AAC/GTT (10.07%), AAT/ATT (9.93%) and AAG/CTT (9.37%), and AAAT/ATTT (4.95%). The most abundant pentanucleotide repeats were AAAAT/ATTTT (1.27%) and AAACC/CGTTT (0.79%) (Fig. 4). Furthermore, we designed primers for 58,175 SSRs by using Primer 3.0 software. The detailed primers are shown in Table S3.
Genetic diversity and cluster analysis of Chinese common vetch
Ten SSR markers with polymorphisms were selected randomly to investigate genetic diversity of 68 Chinese common vetch accessions. In total, we obtained 76 alleles from the 10 SSR loci (Table 2). For each SSR loci, the number of different alleles (Na) and the effective number of alleles (Ne) were ranged from 3 (SSR-12) to 16 (SSR-13) and 1.2786 (SSR-5) to 6.1286 (SSR-13), respectively. The mean Na and Ne were 7.6 and 3.4905. The index of observed heterozygosity (Ho) and expected heterozygosity (He) ranged from 0 (SSR-12) to 0.1765 (SSR-10) and 0.3195 (SSR-5) to 0.8430 (SSR-13), with the average of 0.0632 and 0.6438, respectively. The polymorphism information content (PIC) ranged from 0.217802 (SSR-5) to 0.836845 (SSR-13) with an average of 0.639076. Other parameter, such as Shannon’s information index (I), ranged from 0.5341(SSR-5) to 2.1759 (SSR-5) with an average of 1.3387. Together, we noticed that SSR-13 harbored the highest polymorphism, followed by SSR-14, and the polymorphism of SSR-5 was the lowest (Table 2). These results suggested that the 68 common vetch accessions from China harbored high genetic diversity.
In addition, we also constructed the hierarchical tree of the Chinese common vetch accessions based on dissimilarity data, to infer phylogenetic relationships among these 68 accessions. Unweighted neighbor-joining analysis resulted in a dendrogram with two main subgroups (A and B) with 6 and 10 clusters, respectively (Fig. 5). In detail, subgroup A consisted of 33 accessions and most of them were wild accessions or landraces; in contrast, subgroup B was composed of 35 accessions but only 17 of them were wild accessions (4) and landraces (13). In addition, we hardly connected the clusters with their original places, suggesting that more markers should be hired in further population structure analysis.