Three cDNA libraries generated from RNAs of the whole A. roxburghii plants were subjected to paired-end sequencing on an Illumina platform. A de novo assembly strategy was used to construct A. roxburghii transcripts. In total, 165,643,100 raw reads with nucleotides of 15.43 Gb were generated from the sequencing platform (Table 1). After stringent quality check and data cleaning, 137,679,059 quality reads were obtained. The number of clean reads is 83.12% of raw reads (Table 1). The average length of the clean reads is 95.21 bp (Table 1) with a high Q20 value of 99.24%. All clean reads were assembled into 86,382 contigs, corresponding to 68,938 unigenes using Trinity. 74.23 % of the clean reads could be matched to assembled unigenes, and 87.13% of these clean reads are unique and match to unigenes (Table 1). BUSCO tool [14, 15] was used to assess the quality and completeness of A. roxburghii plant transcriptome, the results showed that 1008 transcripts (70 %) of 1440 BUSCO genes were complete, among these complete genes, 978 were single copy and 30 were duplicated, 144 (10 %) were fragmented and 288 (20 %) were missing. Further analysis showed that the average length of the contigs and unigenes is 421 bp and 404 bp separately, and N50 lengths of contigs and unigenes are 464 bp and 434 bp, respectively (Table 1). All contigs and unigenes are longer than 200 bp, among those, 4012 contigs and 2822 unigenes are between 1000 and 2000 bp, and 245 contigs and 160 unigenes are longer than 2500 bp (supplementary Figure 1s). A. roxburghii transcriptome is 27874926 bp (around 27Mb) with a GC content of 46.81 %, which is higher than that of Arabidopsis (42.5 %), soybean (40.9 %) and chickpea (40.3 %) but lower than that of rice (55 %) [16].
Function annotations All unigenes were aligned to five public protein and gene databases, i.e. COG,EST,Nr,Pfam and Uniprot. Blastx was performed against these databases, and proteins and genes sharing highest similarities (E≤10-5) to query unigenes were adopted to annotate unigenes. All 68,938 unigenes were annotated in this manner, among which 10,040 (14.56%), 29,442 (42.71%), 39,551 (57.37%),34,991 (50.76%) and 28,082 (40.73%) unigenes were annotated via COG,EST,Nr,Pfam and Uniprot, respectively.
Based on species distribution, the highest proportion of matching sequences among the unigenes annotated via Nr is from Vitis vinifera (19.60%), and the number of matched unigenes is 7,753, followed by Oryza sativa Japonica Group (7.62%), Populus trichocarpa (6.84%), Ricinus communis (5.62%), Sorghum bicolor (5.00%), Glycine max (4.48%), Brachypodium distachyon (4.19%), Zea mays (3.99%), Oryza sativa Indica Group (3.55%), Hordeum vulgare subsp. vulgare (2.49%) and Medicago truncatula (2.25%). The number of matched unigenes to these species are 3,014, 2,707, 2,225, 1,979, 1,772, 1,660, 1,579, 1,406, 987, and 890, respectively. The lowest proportion of matching sequences is originated from Arabidopsis thaliana (1.67%) with matched sequences of 663.
GO function categories GO annotation classified all 68,938 unigenes to three categories, i.e. biological process, molecular function and cellular function. The biological process contains 22 categories, among these, 43,206 unigenes were assigned to metabolic processes, corresponding to the highest number of unigenes. Genes related to secondary metabolites biosynthesis, nucleic acid metabolic process, protein metabolic process and cellular metabolic process were annotated by GO. Cellular function group includes 15 categories,in which catalytic processes hold the highest number of 46,978 unigenes. Cellular components can be divided into 18 categories, where cell part has the most unigenes of 20,951 (Figure 2). These GO annotations provide valuable information to investigate specific processes, molecular functions and cellular structures of A. roxburghii transcriptome.
KEGG pathways All unigenes were assigned to 316 KEGG pathways. These containing the most unigenes are metabolic pathways (2,693 unigenes, 5.35%, KEGG pathway ID: map01100). Within the KEGG metabolite pathway classification, secondary metabolite (such as glycoside, flavonoids, terpenoids and terpenoid backbones) biosynthetic pathways are of special interest due to their bioactivities in A. roxburghii plant. Terpenoid backbone biosynthesis (KEGG ID: map00900), diterpenoid biosynthesis (map00904), and sesquiterpenoid and triterpenoid biosynthesis (map00909) contain 61, 20 and 1 unigenes respectively. Here we mainly focus on terpenoid backbone biosynthesis, including the putative triterpenes biosynthetic pathways shown in Figure 1A. All the putative triterpenes biosynthetic genes have E value ≤1e-74 (supplementary Table S1), when blasted with best matching sequences in NCBI, except ArOSC3 (comp324881_c2) matches best to Ananas comosus, all putative triterpenes biosynthetic genes showed best matches to orchid plant, such as Dendrobium officinale, D. huoshanense, D. catenatum and Phalaenopsis equestris (supplementary Table S1).
Gene- to-metabolites correlation analysis
Triterpenoid accumulation
Triterpene accumulation was analyzed using oleanolic acid as a general evaluation index. triterpenoid content was determined by test oleanolic acid using HPLC. Triterpene accumulation in root, stem and leaf showed significant difference (P<0.05). Triterpenoid accumulation is the highest in root (3.75 ± 0.03 mg/g, fresh weight, FW), lowest in the leaves (1.42 ±0.05 mg/g, FW). The content is medium in the stems (1.60±0.04mg/g, FW).
Relative expression level of triterpene biosynthetic genes in A. roxburghii
RT-qPCR was used to analyze relative expression of putative triterpene biosynthetic genes, including ArHMGR in MEV pathway, ArDXS, ArDXR and ArHDR in MEP pathway and ArFDS, ArSS and ArOCS of triterpene biosynthesis, respectively. All tested genes expression showed significant difference (P≤0.05) or extremely significant difference (P≤0.01) in root, stem and leaf.
Relative expressions of HMGR, DXS, DXR and HDR
HMGR is a rate-limiting enzyme in MVA pathway [9]. In A. roxburghii, two unigenes were annotated as ArHMGR1-2 (Supplementary 1, Supplementary Table S1-2). As shown in Figure 1B, ArHMGR1 showed highest expression in root, moderate expression in stem and scarcely expressed in leaf, while ArHMGR2 showed different expression pattern (Figure 1B). Variable correlation coefficients between ArHMGR1-2 expression and triterpene accumulation are 0.877 and 0.051 respectively (Table 2). The data suggest that, ArHMGR1 may play roles in triterpene biosynthesis as its relative expression level correlates with triterpene accumulation, while ArHMGR 2 may be irrelevant with triterpene biosynthesis.
High plants retain both MVA and MEP pathways [17]. They have similar functions in ginsenoside biosynthesis in ginseng root but MEP expressed a much higher level than MEV in ginseng Leaf [18]. To understand whether MEP pathway has any influence on triterpene biosynthesis in A. roxburghii, DXS, DXR and HDR in MEP pathway were analyzed. There are ten unigenes annotated as ArDXS1-10. Among them, ArDXS1, ArDXS4 ArDXS5 and ArDXS10 were highly expressed in root, but scarcely expressed in stem and Leaf. (Figure 1B). The relative expression levels of ArDXS 8-9 are very low in root, stem and leaf. However, root contains more ArDXS 8-9 than in stem, least in leaf. Variable correlation coefficients between the individual gene expression level of ArDXS1, ArDXS4 ArDXS5, ArDXS8-10 and triterpene accumulation, are 0.930, 0.987, 0.981, 0.950, 0.921 and 0.929, respectively (Table 2). These high correlation coefficients suggest a strong relationship of triterpene accumulation with all the six ArDXS genes, and these ArDXSs are candidate genes in MEP pathway for triterpene biosynthesis. The levels of the remaining four genes ArDXS2, ArDXS3, ArDXS6 and ArDXS7, however, only exhibit a moderate correlation with triterpene accumulation. The calculated Pearson correlation coefficients are 0.497, 0.596 0.715 and 0.732, respectively (Table 2)
DXR and HDR are also rate-limiting enzymes in MEP pathway in plant, but their roles vary with plants and conditions [10]. In this study, two unigenes were annotated as ArDXR1 and ArDXR2 and another two unigenes as ArHDR1 and ArHDR2 (Supplementary 1, Supplementary Table S1-2). ArDXR1-2 and ArHDR1-2 showed relative higher expression in root than in stem and leaf. Variable correlation coefficients between triterpene accumulation and the individual gene expression of ArDXR1-2 and ArHDR1-2 are 0.941, 0.944, 0.944 and 0.861 respectively, suggesting that ArDXR1-2 and ArHDR1-2 are positively correlated with triterpene accumulation. Therefore, we believe that ArDXR1-2 and ArHDR1-2 are candidate genes contributing in triterpene biosynthesis.
Relative expression of FDS, SM and OSC
Farnesyl diphosphate synthase (FDS) catalyzes the formation of FDP (Figure 1A), a central precursor for biosynthesis of sesquiterpenes, triterpenes and sterols. Plants contain a small gene family encoding FDS [11] and A. roxburghii has only one. ArFDS showed similar expression in tested tissues. Relative to the reference genes, the levels are ~1.7 in root, ~1.2 in stem and 1.5 (Figure 1B) in leaf, respectively.
Cyclization of 2,3-oxidosqulene is the committed step of triterpene biosynthesis to produce different triterpene skeletons by OSC [19]. In this study, one gene was annotated as ArSM (Supplementary 1, Supplementary Table S1-2), which showed highest expression in root, moderate expression in stem and lowest expression in leaf. The correlation coefficient between ArSM expression levels and triterpene contents was up to 0.955 (Table 2), suggesting that ArSM plays a role in the accumulation of triterpene and is a candidate gene catalyzing the formation of 2,3-oxidosqulene.
Three genes were annotated as ArOCS1-3 (Figure 1B, Supplementary 1, Supplementary Table S1-2). Although ArOCS1 showed higher relative expression in root, moderate in stem and lower in leaf, its expression level is significantly correlated with triterpene content. Therefore, we believe ArOCS1 is a candidate enzyme that catalyzes triterpene skeletons formation.