Transcriptome sequencing and de novo assembly
We obtained averages of 48,786,977 and 55,202,249 raw reads for the rhizome (HSYK_1, HSYK_2, and HSYK_3) tissues and leaf (HSYY_1, HSYY_2, and HSYY_3) tissues, respectively, which consisted of 7,366,833,477 bp and 8,335,539,549 bp. We obtained 46,982,031 and 53,689,620 clean reads, including 7,053,884,682 bp and 8,067,994,222 bp clean bases, respectively. The sequences can be found in NCBI SRA with the accession number PRJNA628588. Moreover, the average GC content of rhizome and leaf tissues was 46.8% and 46.7%, respectively. The k-mer size of 25, 27, 30, 31 and 32 were selected to evaluate the assembly results based on the assembly size, number of contigs, N50 and complete BUSCOs. The results showed that the k-mer size of 30 for the final assembly results in the best final assembly. Trinity assembly resulted in 50,765 unigenes with fragment size ranging from 201 bp to 16,991 bp, and the average length of each unigene was 1049.6 bp. In addition, all of the unigenes consisted of a total of 53,283,125 bp. The unigenes were retained with an N50 length of 1825 bp and N90 length of 2797 bp. The TransRate assembly and Benchmarking UniversalSingle-Copy Orthologs (BUSCO) completeness assessment score were 0.26405 and 68.9%, respectively. The detailed sequencing and assembly results for the leaf and rhizome tissues of D. opposita are shown in Table 2.
Table 2
Optimized assembly results of unigenes from D. opposita rhizome and leaf tissues
Type | Unigene |
Total sequence num | 50,765 |
Total sequence base (bp) | 53,283,125 |
Largest (bp) | 16,991 |
Smallest (bp) | 201 |
Average length(bp) | 1,050 |
N50 (bp) | 1,825 |
N90 (bp) | 2,797 |
TransRate score | 0.26405 |
Complete single copy BUSCOs | 68.9% |
Functional annotation of D. opposita unigenes
The functional annotation results indicated that a total of 24,840 unigenes were annotated (Table 3). Among them, different unigenes were matched in different databases, for example, 23,848 (46.98%) unigenes were matched in the NR database.In addition, 18,531 (36.50%) unigenes were matched in the Swiss-Prot database.
Table 3
The annotation statistics of D. opposita
Type | Unigene number (percent%) |
NR | 23,848(46.98) |
Swiss-Prot | 18,531(36.50) |
Pfam | 18,805(37.04) |
COG | 21,973(43.28) |
GO | 20,014(39.42) |
KEGG | 10,617(20.91) |
Total_annot | 24,840(48.93) |
Total | 50,765(1) |
The length distributions of annotated and unannotated unigenes results indicated that most annotated and unannotated unigenes were distrbuted in the 200–499 bp range (annotated unigenes: 5970/24,840; unannotated unigenes: 16,040/25,925), followed by the 500–799 bp ranges (annotated unigenes: 3446/24,840; unannotated unigenes:5566/25,925), and 800–1099 bp range (annotated unigenes: 2394/24,840; unannotated unigenes: 2194/25,925). Long sequences of unigens were clearly more likely to be annotated than those with short sequences. A total of 7151 unigenes measuring over 2000 bp in length were annotated, and only 371 unigenes were unannotated.
Functional classification results of unigenes
To further evaluate the annotated unigene data, the unigene sequences were searched against the cluster of orthologous groups of proteins (COG) databases. In total, 23,139 unigenes assigned to 23 COG categories under the following types: information storage and processing, cellular processes and signaling, metabolism, and poorly characterized (Fig. 1). Among them, alphabetized“Posttranslational modification, protein turnover, chaperones”(O) (1373; 5.93 %) was the categories with the most members in the COG databases.
Using the Blast2GO database, the 20,014 unigenes in the gene ontology (GO) analysis could be summarized into three main GO categories biological process (BP), molecular function (MF), and cellular component (CC) on two levels. In the BP category, the cellular process (7345;36.70%), metabolic process (6480; 32.389%), and biological regulation (2125; 10.62%) were represented the most. Likewise, cell part (7756; 38.75%), membrane part (7018; 35.07%),organelle (4292; 21.44%) and organelle part (3154; 15.76%)were the most abundant in the CC category. Furthermore, the MF category mainly consisted of binding (10,949; 54.71%), catalytic activity (10,794; 53.93%), transporter activity (1074; 5.37%), and structural molecule activity (730; 3.65%) (Fig. S1).
In order to perform an in-depth analysis of unigene function based on the GO database,all of the three categories (BP, MF, and CC) were subdivided into level 3 and level 4 terms in order to determine their function. For example, cellular process is the main level 2 term from the BP category. In the level 3 terms, cellular process was assigned to the following specific functions: cellular metabolic process (GO: 0044237; 5615 unigenes; 28.06%), cellular component organization (GO:0016043; 1147 unigenes; 5.73%), cellular response to stimulus (GO:0051716; 555 unigenes; 2.78%), and signal transduction (GO:0007165; 526 unigenes; 2.63%), etc. (Fig. S2).
Through the KEGG database BLAST, 10,617 unigene sequences were assigned to 145 pathways (Table S1). Among them, the three pathways with the greatest number of unigene sequences were ribosome (map03010; 605 unigenes; 5.70%), plant-pathogen interaction (map04626; 281 unigenes; 2.65%),protein processing in endoplasmic reticulum (map04141; 276 unigenes; 2.60%), and thermogenesis (map04714; 248 unigenes; 2.34%) (Table S1). In addition, all the pathways were divided into five main types of KEGG metabolic pathways (Fig. 2).
Expression analysis of unigenes from leaf and rhizome tissues
In order to explore the differential expression in leaf and rhizome tissues of D. opposita, the clean reads from the leaf and rhizome library were compared with the unigene database by using the RNA-seq by RSEM package. The results indicated that 41,692,444 reads, respectively 77.62% of the original clean reads (53,689,620), were aligned in leaf tissues (HSYY: average reads of HSYY_1, HSYY_2, and HSYY_3). Likewise, 37,978,278 reads, representing 80.83% of the original clean reads (46,982.031), were aligned in rhizome tissues (HSYK: average reads of HSYK_1, HSYK_2, and HSYK_3).
In this study, the sample correlation analysis was performed using RSEM software (Tables S1). A high correlation coefficient was obtained among the three biological replicates of each tissue. The correlation coefficients of rhizome_1 (HSYK_1) with rhizome_2 (HSYK_2) and rhizome_3 (HSYK_3) were 0.935 and 0.913, respectively. The correlation coefficients of leaf_1 (HSYY_1) with leaf_2 (HSYY_2) and leaf_3 (HSYY_3) were 0.919 and 0.935, respectively. The correlation coefficient between two different of leaf and rhizome tissues was significantly lower than that among the three biological replicates of each leaf and rhizome tissue. The results indicated that all three biological replicates of each tissue are all clustered together, thus suggesting that the sequencing data are suitable for further biological analysis (Fig. S3).
Venn analysis of samples with a unigene expression greater than 2 was performed using RSEM software (Fig. S4). A total of 14,627 unigenes were shared among the six samples. Moreover, a total of 1552, 2295, 1271, 1195, 1588, and 1409 unigenes were specifically expressed in leaf_1 (HSYY_1), leaf_2 (HSYY_2), leaf_3 (HSYY_3), rhizome_1 (HSYK_1), rhizome_2 (HSYK_2), and rhizome_3 (HSYK_3), respectively. Overall, more unigenes were assembled in leaf tissues than in the rhizome tissues.
Principal component analysis (PCA) of the expression levels of total transcriptome (Fig. S5), revealed that the leaf (HSYY_1, HSYY_2, HSYY_3) and rhizome (HSYK_1, HSYK_2, HSYK_3) treatment groups could be obviously distinguished. In other words, the three biological replicates of each sampling group clustered well together. Hence, the sequencing data can be used for further analysis.
Identification and functional enrichment analysis of DEGs
To identify the differentially expressed genes (DEGs), we compared the expression unigenes in leaf and rhizome tissues based on the false discovery rate (FDR) through the DESeq2 software. The results showed that 13,259 significant DEGs (FDR< 0.05 and |log2FC|≥1) were identified in the leaf versus rhizome tissues, which included 5536 up-regulated unigenes and 7723 down-regulated unigenes (Fig. 3, Table S2).
A total of 13259 DEGs were enriched to 179 GO terms based on thresholds of FDR < 0.05, |log2FC| ≥ 1, p-value < 0.05 and p-value_corrected < 0.05 (Table S2). Fig. 4 indicates the top 20 enriched GO enrichment terms of DEGs. Among them, the membrane part (2964/8239), intrinsic component of membrane (2887/8239), integral component of membrane (2882/8239), and oxidoreductase activity (778/8239) were the most enriched GO terms of the DEGs (Fig. 4). By contrast, oxidoreductase activity, acting on iron-sulfur proteins as donors (13/8239), ferredoxin-NADP+reductase activity (10/8239) and omega peptidase activity (10/8239) were the least-enriched GO terms (p-value_corrected < 0.05).
To thoroughly understand the functions and interactions of DEGs between leaf and rhizome tissues, all DEGs were subjected to KEGG pathway enrichment analysis by mapping against the KEGG database using KEGG orthology based annotation system (KOBAS) software. The results indicated that 13,259 DEGs were involved in 144 metabolic pathways. There 23 pathways were enriched based on the p-value _uncorrected< 0.05 (Table S3). We found that these pathways were mainly related to the biosynthesis of other secondary metabolites, genetic information processing, and signal transduction, such as tropane, piperidine and pyridine alkaloid biosynthesis (map00960;28/4059), DNA replication (map03030; 12/4059), and phosphatidylinositol signaling system (map04070; 22/4059).
The top 20 enriched pathways with a p-value of < 0.05 are presented in Fig. 5. We found that phenylpropanoid biosynthesis (map00940; 76/4059; 1.87%), photosynthesis (map04626; 54/4059; 1.33%), glutathione metabolism (map00480; 51/4059; 1.26%), glycine, serine and threonine metabolism (map00260; 49/4059; 1.21%), and flavonoid biosynthesis (map00941; 40/4059; 0.99%) were the highest enriched pathways.
Detection and Characterization of SSR marker
Using MISA 1.0 software, a total of 20,607 potential simple sequence repeats (SSRs) were identified across 13,770 unigenes (27.12%), of which 3464 unigenes contained more than one SSR. Among the 20,607 SSRs, the mono- and di-nucleotide repeats were the most abundant types (10,031, 48.68%; 5312, 25.78%, respectively), followed by tri- (4603, 22.34%), tetra- (440, 2.14%), hexa - (128, 0.62%), penta - (93, 0.45%), and nucleotide repeats. In addition, the repetitions of different SSRs types were calculated for four levels of repeat numbers 1–5, 6–10, 11–15, and >15 and are presented in Fig. S6. The results indicated that most mono-nucleotide repetitions were distributed in repeat number 6–10 (4385), followed by mono-nucleotide in 6–10 (4385), di-nucleotide in 6–10 (4095), tri-nucleotide 6–10 (2344), tri-nucleotide 1–5 (1991), di-nucleotide in 11–15 (874), and tetra-nucleotide 1–5 (290). Analysis of the SSR motif length distribution showed that most of the SSR motifs are less than 100 bp in length. The results indicated that 17,412,713,and 342 SSRs had motif length in the ranges of 10-50 , 51-100 and 101-150 bp, respectively.
Identification of unigenes involved in terpene biosynthesis
Comparison to the unigenes with the KEGG pathway database, indicated, a total of 174 unigenes related to terpene biosynthesis (Table S1), including terpenoid backbone biosynthesis (61), monoterpenoid biosynthesis (37), diterpenoid biosynthesis (19), sesquiterpenoid and triterpenoid biosynthesis (8), biosynthesis of terpenoids and steroids (3), and ubiquinone and other terpenoid-quinone biosynthesis (43). Many unigenes were involved in the pathway of terpenoid backbone biosynthesis, which belongs to the universal metabolic pathway map00900. Among these unigenes, DN2456_c0_g1 (DXS, K01662) is the first enzyme in the MEP pathway. DXS catalyze D-glyceraldehyde 3-phosphate (GA-3P) and pyruvate to form 1-deoxy-D-xylulose-5-phosphate (DOXP). The DXS gene is significantly less expressed in the rhizome than in the leaf. DN26885_c0_g1 (HMGR, K00021) is the main enzyme in the MVA pathway. It catalyzes 3-hydroxy-3-ethylglutaryl CoA (HMG-CoA) to form MVA. The expression of HMGR gene shows no difference in the leaf and rhizome tissues of D. opposita.
qRT-PCR validations of DEGs related to terpene biosynthesis
To further analyze the consistency of RNA-seq data in this study, the DN39197_c0_g1 and DN37057_c0_g12 genes involved in terpene biosynthesis were selected for qRT-PCR validation. The results showed that a similar expression pattern in qRT-PCR analysis was observed in RNA-seq data from the leaf and rhizome tissues of D. opposita (Fig. 6). Further analysis showed that DXR (DN4169_c0_g1, K00099) and GGPS (DN17971_c0_g1, K13789) were the rate-limiting enzymes in the MVP and MEP pathways, which are the basic processes of a terpene synthesis pathway.