Illumina sequencing, de novo assembly, and functional annotation
Illumina sequencing of eleven libraries for seed, leaf, and stem tissues generated an average of 48,945,309 raw reads and 46,614,613 clean reads, and high-quality clean reads were used for subsequent analyses (Table 1). Transcripts amounting to 450,361 were assembled with an N50 length of 1346 bp and an N90 length of 281 bp, ranging from 201 to14,257 bp with a mean of 766 bp. The transcripts were then clustered into a total of 257,773 unigenes with 1129 bp, 1629 bp, and 531 bp for mean length, N50, and N90, respectively (Table 2 & Fig. S1). The different expression patterns and normal distributions of the density distributions of FPKM in the eleven samples are shown in Fig. S2. A total of 193,082 (74.90%) of the unigenes were annotated against public protein databases, with 160,662 (62.32%), 155,087 (60.16%), 119,080 (46.19%), 113,501 (44.03%), 60,321 (23.4%), 44,775 (17.36%), and 121,098 (46.97%) unigenes having at least one hit from the NCBI Nr, Nt, PFAM, Swiss-Prot, KEGG, KOG, and GO databases, respectively (Fig. 1a). Unigenes amounting to 25,739 (9.98%) were annotated in all seven databases, and 64,691 unigenes (25.1%) did not have any matches, and are potentially novel genes that arose during the evolution of the B. catharticus genome. Based on Nr database annotation, compared to more than 700 species, unigenes of B. catharticus were shown with the top match of Hordeum vulgare (24.6%), followed by Aegilops tauschii (19.9%), Triticum urartu (12.0%), Brachypodium distachyon (11.7%), and Triticum aestivum (5.9%; Fig. 1b). GO enrichment analysis of unigenes showed that cellular processes, metabolic processes, and single-organism processes were highly enriched, especially “binding” (66,849) and “catalytic activity” (53,923) in the molecular function category, and “cell” (38,314) and “cell part” (38,293) among cellular components (Fig. S3a). Among the five categories of KEGG pathways, metabolism was the largest, with 11 subgroups and 28,653 genes, including the most significant subgroup of transcription with 7,828 genes (Fig. S3b).
Table 1
Summary of transcriptome sequencing data for B. catharticus
Sample IDs
|
Raw reads
|
Clean reads
|
Clean bases (G)
|
Q20 (%)
|
GC content (%)
|
S1
|
49,551,462
|
47,034,074
|
7.06
|
96.27
|
57.56
|
S2
|
44,389,500
|
41,857,858
|
6.28
|
95.96
|
55.53
|
S3
|
47,129,920
|
44,433,520
|
6.67
|
96.23
|
56.39
|
F1
|
50,042,628
|
47,488,618
|
7.12
|
96.30
|
53.44
|
F2
|
56,344,998
|
53,995,100
|
8.10
|
96.19
|
54.39
|
L1
|
41,792,200
|
39,406,238
|
5.91
|
96.26
|
56.06
|
L2
|
46,567,082
|
44,247,074
|
6.64
|
96.31
|
55.51
|
L3
|
48,262,052
|
46,279,372
|
6.94
|
96.42
|
55.47
|
T1
|
46,451,132
|
44,600,276
|
6.69
|
96.17
|
55.56
|
T2
|
53,009,106
|
50,781,390
|
7.62
|
96.14
|
56.32
|
T3
|
54,858,318
|
52,637,220
|
7.90
|
96.29
|
55.14
|
Mean
|
48,945,309
|
46,614,613
|
6.99
|
96.23
|
55.58
|
Table 2
Statistics of transcripts and unigenes obtained via Trinity assembly.
|
Min length
|
Mean length
|
Median length
|
Max length
|
N50
|
N90
|
Total nucleotides
|
Transcript (bp)
|
201
|
766
|
402
|
14,257
|
1,346
|
281
|
344,828,318
|
Unigene (bp)
|
201
|
1,129
|
808
|
14,257
|
1,629
|
531
|
290,927,105
|
Analysis of DEGs in tissues of B. catharticus
In order to identify tissue-specific genes, we performed comparative analysis between samples S1, S2, F1, and F2 (seeds, named as S), L1, L2, and L3 (leaf, as L), and T1, T2 and T3 (stem, as T). We conducted a comparative analysis among S, L, and T for DEGs (p < 0.05), and determined tissue-specific genes by overlapping upregulated DEGs in the two comparisons. A total of 1803 genes were found to be specifically expressed in developing and germinating seeds (overlapping DEGs in “S vs L” and “S vs T,” Fig. 2a), and 3,030 and 1,570 genes were specifically expressed in leaves (overlapping DEGs in “L vs S” and “L vs T,” Fig. 2b) and stems (overlapping DEGs in “T vs S” and “T vs L,” Fig. 2c), respectively. We found genes specifically expressed in developing and germinating seeds were significantly enriched in the KEGG pathway of “ribosome” (Fig. 2d). Leaf-specific genes were significantly enriched in genes involved in carbon metabolism, photosynthesis, glyoxylate, and dicarboxylate metabolism, and in carbon fixation in photosynthetic organisms (Fig. 2e). Stem-specific genes were significantly enriched in genes involved in phenylpropanoid biosynthesis, plant-pathogen interaction, amino acid biosynthesis, and the MAPK signaling pathway (Fig. 2f).
Identification of EST-SSR markers and genetic diversity
A total of 37,288 SSRs were identified in 32,370 (14.47%) unigenes, in which more than one SSR was present in 4,238 transcripts. Tri-nucleotide SSR repeats were the most abundant (15,208), followed by mono-nucleotides (12,892) and di-nucleotides (8,174). A or T repeats (10,513) were four or five times more common than C or T repeats (2,379) among mono-nucleotide repeats. CCG/CGG (3,554), AGG/CCT (1,894) and AGC/CTG (1,638) comprised 46.59% of all the tri-nucleotide repeat motifs among SSRs. AAAG/CTTT, ATCCG/ATCGG, and ACCTCC/AGGTGG repeats dominated the tetra-, penta-, and hexa-nucleotide repeats, respectively (Fig. 3), although they were rare. The longest repetition type (24 tandem repeats) was C/G repeats (Table S2). Of SSRs, 27,540 were successfully used to design > 80,000 primers; a total of 420 primer pairs were randomly selected to determine polymorphisms and performance, and finally 350 primer pairs and their PCR products were used successfully to amplify genomic DNA fragments across 24 B. catharticus accessions. To further validate the sequences of the SSR loci, ten PCR products from nine accessions for two SSR markers were sequenced. In all of the cases, the sequenced alleles from the selected accessions were homologous to the original locus from which the primers were designed, which effectively guarantees acceptable downstream analysis (Fig. S4).
We further selected 52 effective primer pairs that were designed based on the di-, tri-, and tetra-nucleotide SSRs with 5 ~ 10 repeats (Table S3). The mean PP, PIC, Rp, and MI values were 91.60%, 0.254%, 1.14%, and 0.67%, respectively. Among the three repeat types, tetra-nucleotide SSRs showed relatively higher PIC (0.290), Rp (1.29), and MI (0.73) values, indicating a high discriminatory power. Primers from five-repeat SSRs showed higher PIC values (0.285), while seven-repeat SSRs showed higher Rp (1.29) and MI (0.86) values (Table 3). A total of 152 fragments were used to estimate Nei’s genetic distance of the 24 accessions. The genetic distance values ranged from 0.008 to 0.790, with an average distance of 0.364. The maximum value (0.790) was observed between accession PI442077 and PI309955, and the minimum coefficient value (0.008) was observed between PI187000 and PI595114. To study the genetic relationship and population structure of B. catharticus, UPGMA clustering analysis and STRUCTURE analysis were performed (Fig. 4). Based on maximum likelihood and delta K values (K = 6), the accessions in this study were divided into five STRUCTURE subgroups (represented by five colors: blue-green, purple, orange, green, and light yellow). The result of four UPGMA clusters (Cluster I ~ IV) was consistent with STRUCTURE analysis. Cluster II contained two STRUCTURE subgroups (purple and orange). Both Clusters I and II consisted of 10 accessions, and only two accessions were found in Clusters III and IV. In addition, 14 accessions (> 50%) had high membership coefficients (Q-value > 80%), including 7, 2, 2, and 3 accessions from South America, Asia, Europe, and North America, respectively (Fig. 4), indicating rare cross pollination events in B. catharticus. Genetic diversity in the subpopulation mostly matched the geographical locations. The genetic diversity index (He) among the four continents varied from 0.173 (Europe) to 0.238 (Asia), with an average of 0.209 (Table 4). Additionally, accessions collected from the wild had relatively high genetic diversity (0.259), followed by cultivar materials (0.251), and then materials of uncertain breeding status (0.179).
Table 3
Marker parameters calculated for each SSR type based on 52 primer pairs.
SSR type
|
NP
|
ANF
|
APF
|
PP (%)
|
PIC
|
Rp
|
MI
|
Repeat types
|
Di-
|
5
|
3.20
|
3.00
|
95.00%
|
0.153
|
0.58
|
0.45
|
Tri-
|
18
|
3.22
|
2.89
|
89.35%
|
0.225
|
1.06
|
0.64
|
Tetra-
|
29
|
3.17
|
2.90
|
92.41%
|
0.290
|
1.29
|
0.73
|
Number of repeats
|
10
|
1
|
5.00
|
5.00
|
100.00%
|
0.168
|
1.00
|
0.84
|
9
|
1
|
3.00
|
3.00
|
100.00%
|
0.192
|
0.67
|
0.58
|
8
|
2
|
3.00
|
2.50
|
87.50%
|
0.126
|
0.46
|
0.27
|
7
|
6
|
3.33
|
3.33
|
100.00%
|
0.238
|
1.29
|
0.86
|
6
|
15
|
3.27
|
2.87
|
87.22%
|
0.232
|
1.06
|
0.62
|
5
|
27
|
3.07
|
2.78
|
91.85%
|
0.285
|
1.23
|
0.69
|
Total
|
52
|
3.19
|
2.90
|
91.60%
|
0.254
|
1.14
|
0.67
|
NP- Number of primer pairs; ANF- Average number of fragments; APF- Average number of polymorphic fragments; PP- Percent of polymorphic fragments; PIC- Polymorphic information content; Rp- Resolving power; MI- Marker index. |
Table 4
Genetic diversity estimates for groups based on geographical distribution and breeding status
|
Pop
|
N
|
Na
|
Ne
|
I
|
He
|
Geographical distribution
|
South America
|
8
|
1.461
|
1.398
|
0.354
|
0.234
|
|
Asia
|
6
|
1.447
|
1.414
|
0.358
|
0.238
|
|
Europe
|
4
|
1.086
|
1.297
|
0.256
|
0.173
|
|
North America
|
6
|
1.263
|
1.323
|
0.285
|
0.189
|
|
Mean
|
6
|
1.314
|
1.358
|
0.313
|
0.209
|
Breeding status
|
Cultivar
|
6
|
1.520
|
1.430
|
0.378
|
0.251
|
|
Uncertain material
|
8
|
1.125
|
1.308
|
0.266
|
0.179
|
|
Wild material
|
10
|
1.743
|
1.435
|
0.397
|
0.259
|
|
Mean
|
8
|
1.463
|
1.391
|
0.347
|
0.230
|
N- Number of accessions; Na- Number of different alleles; Ne- Number of effective alleles; I- Shannon' information index; He- Expected heterozygosity. |
Phenotypic variation and Mantel analysis
The basis for evaluation of plant germplasm resources was the phenotype. In this study, 11 quantitative traits in 24 B. cartharticus accessions were investigated to estimate breeding potential and phenotypic variation. The descriptive statistics showed variation coefficients ranged from 2.21–24.87%, with an average of 11.61%, indicating a high level of phenotypic variance (Table S4). Analysis of UPGMA clusters and PCA showed that all 24 accessions were divided into four groups (A, B, C, and D), which showed a weak relationship with geographical distribution (Fig. 5). Group A contained only one accession (PI 442077) from Asia, with low biological yield, PH, SD, TN, and wide flag leaf. Group B consisted of four accessions from South America and two from Europe and Asia, and showed relatively high PH, SD, leaf width, and low TN and LFI. Group C contained almost all the accessions from North America (5) and Europe (3), and other accessions from Asia (4) and South America (3), and showed high TN and LFI, and low SD and leaf width. There were two accessions in Group D, PI308506 from South America and PI595118 from North America, which both had a particularly high TN, and low SD with a dwarf phenotype (Fig. 5a, Table S4). Moreover, PI308506 showed narrow leaves and late heading, and PI595118 showed broad leaves and early heading. The PCA analysis also indicated that the first two components accounted for 91.97% of the variance (76.45% and 15.53%, respectively), and that there were four groups (Fig. 5b).
Mantel analysis was performed to understand the relationship between genetic distance and 11 traits based phenotypic distance, which showed a high correlation (r = 0.612, P < 0.01) (Fig. 6). While Mantel analysis was also used to analyze the correlation between the Euclidean distance of 11 phenotypic traits and genetic distance of 24 accessions respectively (Table 5). The results showed that the coefficients ranged from 0.340 to 0.666, and four traits (PH, FLW, SD, and FMY) had high significant correlation (r > 0.6, P < 0.001). The genetic data could mainly explain the phenotypic difference. For example, the low biological yield, PH, SD, TN, and wide flag leaf of PI442077 in Group A might be due to particular genetic information in Cluster IV (Figs. 4 & 5, Table S4). All ten accessions in Cluster I were found in Group C, and all six accessions in Group B were in Cluster II. Both PI308506 and PI595118 in Group D showed dwarf phenotype, with more tillers and thinner stems (Fig. 5, Table S4), but there were significant differences in heading date and leaf width between PI308506 in Cluster IV and PI595118 in Cluster III, which might be explained by a split in genetic clustering.
Table 5
Correlation coefficients between Euclidean distances based on 11 traits and genetic distances of 24 accessions
|
Euclidean distance
|
DMY
|
FLL
|
FLW
|
HD
|
LFI
|
PH
|
PLL
|
PLW
|
SD
|
TN
|
FMY
|
r
|
0.664
|
0.412
|
0.612
|
0.597
|
0.369
|
0.656
|
0.340
|
0.532
|
0.666
|
0.428
|
0.664
|
P value
|
0.010
|
0.019
|
0.001
|
0.003
|
0.020
|
0.001
|
0.038
|
0.005
|
0.001
|
0.007
|
0.001
|
Note: r is calculated as correlation coefficient based on the genetic distances of 24 accessions and Euclidean distances of 11 traits. |