Quantitative analysis of A. elata aralosides
The saponins present within A. elata primarily contain oleanolic acid and hederagenin aglycone [19], with total and monomer araloside accumulation varying widely between different tissues in these plants. Specifically, the leaves of A. elata have been found to contain the largest quantity of these saponins, with progressively lower levels found in root and stem tissues. The roots of these plants contained higher oleanolic acid levels than did the other tested tissues, whereas hederagenin levels were highest in leaves relative to samples roots and stems (Table 1; Additional file 1: Figure S1). Two selected oleanane-type saponins (chikusetsusaponin IV and araloside X) and one hederagenin-type saponin (araloside VII) were also detected in these A. elata samples, with root chikusetsusaponin IV levels being fairly high whereas they were minimal in leaf and stem tissues. In contrast, we detect aralosides VII, and X showed a high level in the leaves of these plants, suggesting that different glycosyltransferases were responsible for their generation. These tissue-specific saponin distribution results offer significant value as a reference source when identifying those CYP450s and other proteins involved in araloside production in A. elata.
De novo A. elata sequence assembly
In order to identify genes pertaining to saponin biosynthesis in these A. elata plants, we next employed an Illumina HiSeq 4000 platform to sequence the total RNA transcriptome in root, leaf, and stem tissue samples. In total this approach yielded 448,112,618 reads that were assembled into 82,238 contigs, with the longest being 16,016 bp, and with an average contig length of 1,058 bp. We were then able to assemble these contigs into 66,713 unigenes with a 1,846 bp N50 length (Table 2). Next, these unigenes were annotated with the KEGG, UniProt, NCBI nonredundant nucleotide (Nt), and Nr databases via use of the BLASTN and BLASTX algorithms, leading to the annotation of 35,232 (52.81%) unigenes (Additional file 2: Figure S2). These transcriptome sequence data have been deposited in the NCBI Short Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) under the accession number PRJNA555256.
Enrichment of terpenoid backbone and triterpenoid biosynthetic pathways
Following the annotation of 7,291 A. elata unigenes with the KEGG database, we were able to assign unigenes to the terpenoid backbone and sesquiterpenoid/triterpenoid biosynthesis pathways, which contained the upstream MVA and MEP pathways and the 2,3-oxidosqualene biosynthesis pathway (Fig. 1A). In total, we mapped 79 unigenes to the terpenoid backbone biosynthesis pathway, while 47 were mapped to the sesquiterpenoid and triterpenoid biosynthesis pathways. As shown in Fig 1A, 6 putative genes (AACT, HMGS, HMGR, MVK, PMVK, and MVD) and 8 putative genes (DXS, DXR, MEP-CT, COP-MEK, MECDPS, HMBPPS, HMBPPR, and IDI) were associated with the MVA and MEP pathways, respectively, while 5 putative genes (GPS, FPS, SS, SE, and bAS) were found to be associated with carbocyclic biosynthesis (Additional file 3: Table S1). For the majority of these unigenes, we were able to map >1 unigene to a given gene or gene family, suggesting that these sequences may correspond to different fragments and/or isoforms of a given gene [34]. Those unigenes that were expressed at the highest levels in each of these enzymatic steps were next selected and arranged into a heat map showing the differentially expressed genes associated with triterpene saponin biosynthesis (Fig. 1A). The expression of these genes differed substantially in a tissue-dependent manner, as confirmed via qRT-PCR, with those genes involved in the MEP synthesis pathway being highly expressed in leaf samples (Fig. 1B), and those involved in the MVP synthesis pathway being expressed at higher levels in different tissues (Fig. 1B).
A. elata CYP450 identification and classification
Through our transcriptome analysis, we were able to identify 111 full-length and 143 partial CYP450 genes in these A. elata samples. This number of total CYP450 unigenes (254) was lower than the number identified in a study of Panax. ginseng (484). We next aligned the 111 full-length CYP450s with the CYP450 database, using allelic, subfamily, and family variant cutoff values of 97%, 55%, and 40%, respectively [35]. Based on these sequence similarity findings, we were able to classify these CYP450s into 7 clades, 36 families and 64 subfamilies, with 53% being A-type CYP450s and 47% being non-A-type CYP450s (Additional file 4: Table S2). The CYP71 clan was the most highly represented in these samples, containing 59 genes belonging to 16 families (CYP71, CYP73, CYP75-CYP78, CYP80-CYP82, CYP84, CYP89, CYP92, CYP98, CYP701, CYP706, and CYP736). The next largest clan was CYP85, which contained 18 genes belonging to 8 families.
We next analyzed the primary characteristics of each of these CYP450 genes, including their predicted coding sequence (CDS) length, protein sequence length, protein molecular weight (MW), isoelectric point (pI), and subcellular localization (Additional file 4: Table S2). These 111 CYP450 proteins were predicted to be 410 - 620 amino acids long (average: 508), with MWs ranging from 46.8 - 69.5 kDa, and with pI values ranging from 6.01 - 9.55. We additionally calculated the instability index (II) of these proteins, revealing 59 of them to be unstable (stability factor >40), while 52 proteins were stable (stability factor < 40). The GRAVY values were negative for these proteins, indicating them to be hydrophilic. Predictive analyses of the subcellular localization of these CYP450s suggested that they were all localized exclusively to the ER in A. elata. Overall, the results of these predictive analyses were in line with prior studies of CYP450s and their subcellular localization.
A.elata CYP450 pathway enrichment analyses
To further understand the functional importance of the identified CYP450s in A. elata, we next conducted a KEGG pathway enrichment analysis of these 111 genes. Individual CYP450s were assigned to multiple KEGG pathways, with 34 (30.63%) being assigned to 18 metabolism pathways and six classes, including global and overview, biosynthesis of other secondary metabolites, metabolism of terpenoids and polyketides, lipid metabolism, amino acid metabolism, and metabolism of cofactors and vitamins (Additional file 5: Figure S3).The pathways most enriched for these CYP450s were the biosynthesis of secondary metabolites and metabolic pathways, which contained 21 and 19 CYP450s, respectively. In total, 14 CYP71 clan members were represented in 14 distinct pathways, suggesting that members of the CYP71 clan exhibit highly diverse biological activities. We further found that 6 CYP450s, all of which were in the CYP86 clan, were involved in cutin, suberine, and wax biosynthesis. Additionally, the CYP71D603 and CYP71D610 proteins from the CYP71 clan were involved in sesquiterpenoid and triterpenoid biosynthesis (Additional file 5: Figure S3).
CYP450 family member phylogenetic and conserved motif analyses
To further explore the functional roles and evolutionary relationships for these A. elata CYP450s, we next constructed a phylogenetic tree incorporating these 111 CYP540s along with 59 CYP450s from other plants, including A. thaliana (37), P. ginseng (20), Daucus carota (2) and Polygala tenuifolia (1). The resultant tree (Fig. 2) divided these CYP450s into 7 clades, including three single-family clans (CYP51, CYP97 and CYP711) and four multifamily clans (CYP85, CYP86, CYP72 and CYP71). Genes within the same clan clustered in a single clade, with, for example, the 59 CYP71 members that were assigned to 16 families forming a single clade along with 28 representative CYP450s, while the CYP85 clan, which contained 18 CYP450s in 3 families, clustered with 11 representative CYP450s. This analysis additionally identified two clans, CYP711 and CYP51, containing only a single member each. These 7 clades were further grouped into the A-type and non-A type clusters. All A-type CYP450s were grouped into the CYP71 clan, with all other clans being of the non-A-type.
We further sought to identify conserved motifs within these CYP450s using the MEME web server. This analysis identified 10 different conserved motifs, with motif 1, motif 2, motif 3, and motif 4 being evident within all A. elata CYP450s. The most similar CYP450s typically had the most similar motif composition (Additional file 6: Figure S4). For example, the adjacent CYP72 and CYP97 clans exhibited similar motif compositions. Members of the CYP71 clan contained 10 conserved motifs, with motifs 5 and 10 being unique to the members of this clan, suggesting that these motifs may have functional roles that are specific to these proteins. Interestingly, we found that motif 6 and motif 9 were present in both the CYP71 and CYP72 clans, potentially highlighting the functional divergence of CYP450 genes. Together the results of these two analyses served to reaffirm the accuracy of the classification of these A. elata family proteins into defined clans and families.
CYP450 multiple sequence alignment and secondary structural element assignment
The prototypical CYP450 heme-binding domain, K-helix region, PERF motif, and I-helix region were detected in all of these A. elata CYP450s, with the residues in some of these regions being identical across these CYP450s (Fig. 3). Overall, the majority of conserved residues were located in these found different conserved motif regions. For example, C residues in the heme-binding region were highly conserved, as were R residues in the PXRX motif and E and R residues in the EXXR motif. We further used ESPript to conduct homology alignments and secondary structure predictions for these CYP450s, revealing that overall the sequence homology of these proteins was fairly low (< 30%), whereas their secondary structures were very similar (Fig. 3). This analysis revealed that α-helices were the primary elements composing these proteins, with random coils, and β-sheets and turns being more dispersed throughout the protein. We additionally mapped Gotoh’s SRSs and putative substrate binding sites onto these CYP450s through alignment with the P450cam sequence [36], leading to the identification of 6 SRSs in these A. elata CYP450s, with the majority of these being distributed in different structural elements and associated with particular substrates, such as helices 4 and 6, and β-sheets 11 and 12.
A. elata CYP450 gene expression patterns
In order to understand the tissue-specific expression of CYP450s in A. elata, we next conducted an analysis of their expression in the three different tissues that had been subjected to RNA-seq analysis. Of these 111 CYP450s, we found that 97 (87.39%) exhibited patterns of differential expression across these tissues. Leaves expressed a high proportion (41.44%) of highly expressed CYP450s, whereas stem samples contained the smallest number of these genes (20.72%). A hierarchical clustering analysis was used to assess CYP450 coexpression patterns in these different analyses, with an expression profile heat map for these genes being constructed according to their RPKM normalized expression values. This clustering analysis led to the assignment of 111 CYP450s to six clustered (C1-C6; Fig. 4). The CYP450s that were most highly expressed in roots (17 genes), leaves (38 genes), and stems (7 genes) were grouped into clusters C1, C4, and C6, respectively. In addition, those CYP450s in clusters C2 (29 genes), C3 (3 genes), and C5 (17 genes) were expressed at the lowest levels in leaf, stem, and root tissues, respectively. A qRT-PCR approach was further used to validate these transcriptomic findings, with the expression of 12 randomly selected CYP450s representative of 7 clans being quantified (Fig. 5A, B).
Previous research has shown that members of the CYP72A and CYP716A subfamilies are the primary CYP450s involved in pentacyclic triterpenoid saponin biosynthesis. As such, we next specifically focused on the co-expression patterns of the 3 CYP716A and 6 CYP72A genes identified in the A. elata transcriptome with β-amyrin synthase (bAS), which is a oxidosqualene cyclase that catelyzes 2,3-oxidosqualene to form the β-amyrin skeleton (Fig. 6). The qRT-PCR profiles for these genes revealed that two CYP716A genes (CYP716A295 and CYP716A296) and two CYP72A genes (CYP72A762 and CYP72A764) exhibited a similar expression patterns to that of bAS, being expressed at high levels in leaf tissues relative to stems and roots. In contrast, CYP72A759 and CYP72A763 being expressed at higher levels in leaves and at lower levels in stems and roots, whereas CYP716A306, CYP72A760, and CYP72A761 were expressed at the highest levels in stems.
Identification of candidate CYP450s involved in triterpenoid biosynthesis
As previously described, hederagenin aglycone and oleanolic acid were the major sapogenins in A. elata, so we specially focused on CYP450s related to the biosynthesis of these two sapogenins. Up to date, a total of 36 CYP450s have been found to play roles in triterpenoid biosynthesis (Fig. 7A; Additional file 7: Table S3). The CYP716A and CYP72A subfamilies are the primary CYP450 gene families involved in pentacyclic triterpenoid saponin diversification, with the CYP716A family being the largest multifunctional C28-oxidase family involved in such oleanane-type triterpenoid saponins biosynthesis [32, 37, 38] (Fig. 7A). We were able to identify 3 CYP716A genes (CYP716A295, CYP716A296, and CYP716A306) and 6 CYP72A genes (CYP72A759-764) in the A. elata transcriptome. In order to identify the most relevant unigenes involved in pentacyclic triterpenoid saponin biosynthesis for further analysis, we conducted BLASTx searches that compared these A. elata CYP450s to those 36 CYP450s known to be involved in triterpenoid biosynthesis. This analysis revealed that the A. elata CYP716A295 and CYP716A296 exhibited 93.97% and 94.39% sequence identity with P. ginseng CYP716A52v2 respectively, which is a β-amyrin 28-oxidase enzyme involved in oleanolic acid production [39] (Fig. 7B). Moreover, CYP716A295 and CYP716A296 were highly expressed in leaves relative to stems and roots, cosistent with bAS expression pattern and oleanolic acid contents (Fig. 6, Table 1). As such, we selected CYP716A295 and CYP716A296 as the best candidate CYP450s likely to be involved in oleanolic acid biosynthesis in A. elata. Two CYP450s (CYP72A397 and CYP72A68v2) have thus far been identified as oleanolic acid 23-oxidases, catalyzing oleanolic acid oxidation into hederagenin [40, 41] (Fig. 7B). In the present study, we observed higher expression of CYP72A759 and CYP72A763 in leaf tissues, with progressively lower levels in stems and roots, consistent with the observed hederagenin content distribution. A BLASTx analysis further indicated that CYP72A759 encoded a protein with <50% identity to CYP72A397 and CYP72A68v2, and that CYP72A763 shared 52.49% identity with CYP72A397. Given these results, we further selected CYP72A763 as the CYP450 most likely to be involved in hederagenin biosynthesis, although further functional validation will be necessary.
Phylogenetic analyses revealed CYP716A295 and CYP716A296 to be grouped in the CYP716A subfamily and to be most closely related to P. grandiflorus CYP716A140v2 and P. ginseng CYP716A52v2, respectively, both of which encode β-amyrin 28-oxidase enzymes involved in oleanolic acid production [33, 39]. CYP72A763 clustered in the CYP72A group and was most closely related to M. truncatula CYP72A61v2, which can transform 24-OH-β-amyrin into soyasapogenol B [34] (Fig. 7).
Assessment of the subcellular localization of three CYP450:GFP fusion proteins
Almost all CYP450s are membrane-associated proteins that localize to the ER, with relatively few localizing to chloroplasts and mitochondria [12]. As detailed above, all 111 of the A. elata CYP450s identified in this study were predicted to localize to the ER. To confirm this prediction, we therefore conducted the PEG-mediated transient expression of Arabidopsis protoplasts co-transformed with CYP450-GFP reporter proteins and GHD7-RFP (a nuclear marker), with CYP716A295, CYP716A296, and CYP72A763 all being selected for this subcellular localization analysis. We observed no overlap between the CYP716A295, CYP716A296, and CYP72A763 GFP fluorescent proteins and GHD7-RFP fluorescence and , with these CYP450-GFPe proteins instead appearing as a reticular ribbon upon microscopic examination, consistent with their likely localization to the ER (Fig. 8).