Identification and characterization of SSRs in the transcriptome. Microsatellites can be divided into perfect SSRs, imperfect SSRs and composite SSRs29. In this study, the perfect and composite SSRs in the full-length transcriptome of C. chekiangoleosa were statistically analyzed, and some microsatellite information is shown in Table 1. A total of 97510 SSRs (including 17690 composite SSRs) were retrieved from 65215 unigene sequences with a total length of 188333521 bp, among which 48281 unigene sequences contained SSRs. The frequency of the occurrence of SSRs was 74.03%, with an average of 1 SSR occurring every 1.93 kb. There were significant differences in the frequency of each SSR repeat type in the full-length transcriptome of C. chekiangoleosa. Mononucleotide repeats were the main repeat type, accounting for 51.29% of the total SSRs, followed by dinucleotide (34.36%), trinucleotide (11.24%), tetranucleotide (1.44%), hexanucleotide (1.11%) and pentanucleotide repeats (0.56%) (Table 1).
According to the motifs of several of the main SSR repeat types (Supplementary Table S1), there were 2, 4, 12 and 30 motifs for the mono-, di-, tri- and tetranucleotide repeat types, respectively. Mononucleotide repeats were dominated by A/T repeat motifs, accounting for 49.95% of these repeats, while the number of C/G repeat motifs was relatively small, accounting for 1.34% of these repeats. Among dinucleotide repeats, the number of AG repeats (23.66%) was highest, followed by AT (7.89%) and AC repeats (2.67%), while the number of CG repeats (0.04%) was lowest. Among the trinucleotide repeats, AAG (2.29%) repeats accounted for the largest proportion, followed by ATT (1.28%) and ACC (1.83%), and the proportions of the other nine repeat motifs were all low. Among the tetranucleotide repeats, A/T-rich repeat motifs (AAAT, AAAG, AAAC, AACT, AATC, AATG, AATT, AGAT, ATAC, ATTT) accounted for 1.24% of all SSR repeat types, while G/C-rich motifs were relatively rare. We found that all the major repeat motifs of different SSR repeat types were rich in A/T nucleotides.
Table 1. The number and frequency of SSRs in C. chekiangoleosa
Characters
|
Transcript sequence
|
Total number of sequences examined
|
65215
|
Total size covered by examined sequences/bp
|
188333521
|
Total number of SSRs identified
|
97510
|
Number of compound microsatellites
|
17690
|
Number of SSR-containing sequences
|
48281
|
Total frequency of occurrence
|
0.74
|
Average distance/bp
|
1931.43
|
Mononucleotide repeat (MNRs)
|
40942(51.29%)
|
Dinucleotide repeat (DNRs)
|
27428(34.36%)
|
Trinucleotide repeat (TNRs)
|
8974(11.24%)
|
Tetranucleotide repeat (TTNRs)
|
1146(1.44%)
|
Pentanucleotide repeat (PTNRs)
|
442(0.56%)
|
Hexanucleotide repeat (HXNRs)
|
889(1.11%)
|
The results showed that there were significant differences in the length variation of different repeat types of SSRs in the whole transcriptome of C. chekiangoleosa (Fig. 1). In Figure 1, SSRs with a frequency ≤ 1% were merged into the same black section. The number of sections in the pie chart represents the variation in SSR length. The more sections there are, the higher the polymorphism of the SSRs. Based on the changes in the number of section, the highest degree of length variation was found for mononucleotide repeats, while the lowest was found for pentanucleotide repeats. For mononucleotide to pentanucleotide repeats, the variation in the SSRs was inversely proportional to the length of the repeat type.
According to the statistics on the SSR distribution in unigenes in the C. chekiangoleosa full-length transcriptomic SSR database, the proportions of SSRs in the 5'UTR and 3'UTR were 43.62% and 37.54%, respectively, and only a small fraction of SSRs (10.76%) were distributed in the CDS region (Fig. 2(a)). Based on the statistical analysis of perfect SSRs located in CDS and UTR regions, the proportions of SSRs of each repeat type in the 3'UTR and 5'UTR presented the following order from high to low: mono-, di-, tri-, tetra-, hexa- and pentanucleotide (Fig. 2(b)). In the CDS region, trinucleotide repeats were the main type of SSR (42.95%), followed by dinucleotide repeats (37.39%), while pentanucleotide repeats were the least common, accounting for only 0.36% of the SSRs (Fig. 2(b)). Figure 2(c) shows that the SSRs with mononucleotide, tetranucleotide and pentanucleotide repeats were mainly distributed in the 3'UTR, accounting for 50.61%, 54.81% and 49.88% of the total SSRs, respectively. The SSRs with trinucleotide (44.80%) and hexanucleotide repeats (38.06%) were mainly distributed in the CDS region, whereas 56.26% of the dinucleotide repeats were distributed in the 5'UTR, and only 12.95% were located in the CDS region.
Functional analysis and transcription factor prediction based on transcripts containing SSRs. A total of 65215 unigenes (48323 containing SSRs) were compared with the GO and KEGG databases. The analysis revealed that the number of unigenes containing SSRs and the total number of unigenes showed a very significant correlation (P<0.01) regarding the distribution ratio of the annotated GO functional groups and annotated KEGG metabolic pathways. There were 31382 unigenes (69.93% containing SSRs) in the GO database that had been annotated (Supplementary Table S2, Fig. 3). A total of 35095 (70.64% containing SSRs), 38455 (69.55%), and 49670 (69.70%) unigenes were classified into the cellular component, molecular function and biological process categories, respectively. Within the cellular component category, cells and cell parts (6393 unigenes, 70.20% containing SSRs) constituted the largest group of unigenes, followed by membrane structure (5799, 72.03%), whereas the nucleoid (3, 33.33%) constituted the smallest group. In this category, the highest proportion of unigenes containing SSRs was associated with cell junctions (100.00%), and the lowest proportion was associated with the nucleoid (33.33%). Similarly, in the molecular function category, the unigenes involved in binding (19147, 70.15%) constituted the largest group, and there were very few unigenes related to obsolete signal transmitter activity (5, 20.00%) and cargo receiver activity (2, 100.00%). The proportion of unigenes (20.0%) containing SSRs that were related to absolute signal transmitter activity was the lowest, while that related to cargo receiver activity was the highest (100%). Most unigenes involved in the biological process category were annotated to metabolic process (14921, 69.33%) and cellular process (13520, 69.55%). All unigenes annotated to nitrogen utilization, pigmentation and obsolete mitochondrial respiratory chain complex IV biogenesis groups contained SSRs. In the carbohydrate utilization and cell killing functional groups, only 33.3% of unigenes contained SSRs.
A total of 54366 unigenes (72.63% containing SSRs) were annotated in the KEGG database; these unigenes were involved in 6 categories (metabolism, genetic information processing, cellular processes, environmental information processing, body systems and human diseases) and 357 metabolic pathways (Supplementary Table S3, Fig. 4). The greatest number of unigenes was related to metabolism (12920), followed by human diseases (9212), and the fewest was related to cellular processes (4296). The proportions of unigenes containing SSRs involved in metabolism, genetic information processing, cellular processes, environmental information processing, biological systems and human diseases were 73.35%, 70.72%, 75.54%, 75.63%, 74.73%, and 71.79%, respectively. There were four kinds of metabolic pathways related to oil: fatty acid metabolism (272 unigenes, 76.10% containing SSR), fatty acid biosynthesis (188, 77.66%), unsaturated fatty acid biosynthesis (100, 71.00%) and alpha-linolenic acid metabolism (107, 74.77%). In addition, some metabolic pathways were related to glycolysis (357, 77.31%), the phosphatidylinositol signaling system (212, 72.64%), plant hormone signal transduction (508, 78.94%), the MAPK signaling pathway (103, 64.08%), the AMPK signaling pathway (345, 81.16%) and the calcium signaling pathway (108, 71.30%) (Supplementary Table S3, Fig. 4). We predicted that 3091 unigenes encoded TFs, among which 74.60% also contained SSRs (Supplementary Tables S4a and S4b). These TFs were divided into 86 TF families, among which the main families were SNF2 (149, 5.84%), C3H (140, 5.48%), MYB-related (102, 4.00%), PHD (98, 3.84%), SET (96, 3.76%) and C2H2 (93, 3.64%) (Supplementary Table S4c, Fig. 5).
SSR primer screening and polymorphism verification. The results showed that there were 30 (60.00%), 34 (68.00%), 33 (66.00%) and 36 pairs (72.00%) of amplifiable primers for the di-, tri-, tetra-, pentanucleotide repeat types, respectively, while there were 28 (56.00%), 20 (40.00%), 19 (38.00%) and 31 pairs (62.00%) of polymorphic primers, and the proportions of polymorphic primers were 93.33%, 58.82%, 57.58% and 86.11%, respectively (Table 2, Fig. 6A). Among the amplifiable primers, the proportions of primers with a base length of ≥20 bp accounted for 73.33%, 29.41%, 57.58% and 86.11% of the primers. Finally, 580 pairs of SSR primers were counted. After screening, 300 pairs (51.72%) of primers were able to amplify clear bands, among which 155 pairs (26.72%) of polymorphic SSR primers were identified (Supplementary Table S5), and the total proportion of polymorphic primers was 51.67%. A total of 360 primer pairs targeting 3'UTR (120 pairs), 5'UTR (120 pairs) and CDS regions (120 pairs) were randomly selected from the 580 synthesized pairs of SSR primers. The statistical results showed that the amplification efficiencies of the primers targeting the 3'UTR and 5'UTR were 62.50% and 54.17%, the development efficiencies of the polymorphic primers were 33.33% and 25.00%, and the proportions of polymorphic primers were 53.33% and 46.15%, respectively. The primer amplification efficiency, polymorphic primer development efficiency and proportion of polymorphic primers in the CDS region were 50.83%, 20.83% and 40.98%, respectively (Fig. 6B).
Table 2. Experimental results for di-, tri-, tetra- and pentanucleotide repeat SSR markers.
Repeat type
|
Primer development efficiency
|
Proportion of polymorphic primers
|
Overall
|
< 20 bp
|
≥ 20 bp
|
DNRs
|
60.00%
|
93.33%
|
20.00%
|
73.33%
|
TNRs
|
68.00%
|
58.82%
|
29.41%
|
29.41%
|
TTNRs
|
66.00%
|
57.58%
|
0
|
57.58%
|
PTNRs
|
72.00%
|
86.11%
|
0
|
86.11%
|
Effectiveness of the SSR primers based on population analysis. We selected 44 samples of C. chekiangoleosa to further evaluate the polymorphisms in 27 pairs of primers. The test results showed that a total of 103 alleles were obtained, the number of alleles (Na) ranged from 2 to 7 per locus, and the average number of alleles was 4 (Table 3). The values of observed heterozygosity (Ho) and expected heterozygosity (He) ranged from 0 to 0.795 and 0.087 to 0.782, respectively, and the mean values were 0.402 and 0.585, respectively. The LP and XP populations showed the highest (0.504) and lowest (0.390) genetic diversity, respectively (Supplementary Table S6). The polymorphism information content (PIC) of the 27 SSR markers ranged from 0.083 to 0.748, with an average value of 0.528. Based on the UPGMA clustering method, 44 C. chekiangoleosa genotypes were clearly divided into four clusters (Fig. 7). All individuals in the XP, WYS and WY populations were grouped into cluster I, cluster II, and cluster III, respectively. The fourth cluster was mixed and included three populations of KH, DXY and LP, and there was almost no boundary between the DXY and LP populations.