According to the results from the comparison of transcript length distribution between Illumina and PacBio Sequel platforms, it was indicated that the transcripts assembled from the Illumina short reads by Trinity software [21] could not accurately represent full-length cDNAs in S. rebaudiana. The average length of assembled unigenes from the Illumina platform (mean 905.8 bp) was notably shorter than those from the PacBio Sequel platform (mean 1949.2 bp). In addition, approximately 69.4% of the assembled unigenes from NGS reads were < 1000 bp, whereas only 13.4% of the unigenes from PACBIO reads were < 1000 bp (Fig. 3). Nevertheless, from this study, it seemed that the SMRT reads further corrected by the NGS data could be considerably better than simply relying on IsoSeq protocols. Since Iso-Seq full-length transcripts are generated directly from sequencing without assembly, these data may be used as an ideal long-read reference transcriptome of S. rebaudiana.
Phylogenetic analysis of the UDP-glycosyltransferase multigene family
UDP-glycosyltransferases (UGTs) are defined by the presence of a C-terminal consensus sequence containing 44 amino acids and are responsible for transferring a glycosyl moiety from an activated donor to an acceptor molecule in all living organisms [22, 23]. In Arabidopsis thaliana, a molecular phylogenetic tree was constructed consisting of ninety-nine UGT sequences and a composite phylogenetic tree that also includes all of the additional plant UGTs with known catalytic activities [22]. This work has significantly promoted the prediction of the evolutionary history, substrate specificities and structure-function relationships of UGTs in Arabidopsis. Nevertheless, although many studies have been performed on the glycosyltransferases of S. rebaudiana, there are still no reports on its phylogenetic tree. Therefore, a comprehensive neighbour-joining tree with ninety-eight complete SrUGTs was constructed (Fig. 4). The alignment included twenty-six SrUGTs functionally characterized in this study. After bootstrap analysis with 1000 replicates, the SrUGTs were strongly divided into fourteen major groups, with each having a support greater than 95% in distance analysis excluding group I (66% bootstrap). The fourteen well-defined major groups of SrUGTs suggest that at one time, there were fourteen ancestral genes. One sequence (SrUGT78D2) with a long unique terminal branch, suggesting accelerated evolutionary rates, tends to distort phylogenetic analyses by reducing apparent bootstrap support for nearby clades. Therefore, the data were reanalysed without this sequence. This analysis provided stronger statistical confidence (bootstrap from 64–88%) to two of the ancestral genes, corresponding to groups M and N, which are likely to share a more recent common origin. Interestingly, a similar AtUGT78D1 gene has been found in Arabidopsis [22], indicating that the UGT78D genes in the glycosyltransferase family may have evolved more rapidly.
In an attempt to predict the structure-function relatedness of the SrUGT family, numerous UGTs identified from a wide range of plant species and having different biochemical functions were aligned with the ninety-eight SrUGTs and constructed a composite phylogenetic tree. Among these additional plant UGTs, seven of the corresponding UGTs were successfully clustered within the fourteen groups identified by this study (Fig. 5). Interestingly, PdUGT94AF1 and PdUGT94AF2 derived from Prunus dulcis involved in the formation 1,6-β-D-glucosidic linkage of Prunasin [24] were clustered in group A but had a long genetic distance between them and SrUGTs of this group, implying that the SrUGTs in this group may have no related specificity. Due to the lack of other glycosyltransferases capable of forming 1,6-glucosidic bonds, this tree cannot predict the glycosyltransferases involved in the synthesis of steviol glycosides containing 1,6-glucosidic bonds (such as Reb L). FeUGT79A8 and PhUGT79A1 identified from Fagopyrum esculentum and Petunia x hybrida could utilize UDP-rhamnose as sugar donors [25]; both UGTs belong to group A and are closely related to the SrUGT79 subfamily (100% bootstrap), demonstrating that the SrUGT79 subfamily may be the glycosyltransferases of UDP-rhamnose specificity in S. rebaudiana. In addition, EUGT11 from Oryza sativa and SrUGT91D2 are known to catalyse the 1,2-β-D-glucosidic linkage of steviol glycosides [7, 26]; therefore, it is reasonable to believe that the SrUGT91 subfamily is responsible for the formation of the 1,2-β-D-glucosidic linkage in stevia. UBGAT from Scutellaria baicalensis is one of the few glycosyltransferases identified in plants that could use UDP-GlcUA as the sugar donor to catalyse a glucuronosylation reaction [27]. In the composite tree, UBGAT was clustered in group C and had a closer relationship with the SrUGT88 subfamily; moreover, the similarity of the PSPG box between the UBGAT and SrUGT88 subfamily was more than 65%; therefore, we speculate that the SrUGT88 subfamily should be the UDP-glucuronic acid-recognizing glycosyltransferases in stevia. AtUGT78D1 identified from Arabidopsis thaliana could utilize UDP-xylose as a sugar donor [28]. In this tree, AtUGT78D1 and SrUGT78D2 are clustered in group L and have high bootstrap support (100%). Therefore, it is speculated that SrUGT78D2 should be a glycosyltransferase in stevia, which may recognize UDP-xylose and participate in the xylosylation of glycosides, such as RF. In 2005, Richman et al. (2005) identified and characterized three UGTs (SrUGT85C2, SrUGT74G1 and SrUGT76G1) involved in the synthesis of steviol glycosides: SrUGT85C2 and SrUGT74G1 glucosylate the C13-hydroxyl and C19-carboxylic acid functional groups of the steviol backbone, forming a β-D-glucoside, respectively, while SrUGT76G1 is capable of catalysing 1,3-β-D-glucosylation at both the C13 and C19 positions of steviol. Most SrUGTs in group N belong to the SrUGT85 subfamily and have a high support value, especially for SrUGT85C3 and SrUGT85C4; therefore, we hypothesized that the SrUT85C subfamily may be responsible for glucosylating the C13-hydroxyl position of the steviol backbone. Similar to the SrUGT85 subfamily, in group J, the SrUGT74G subfamily could theoretically be the glucosylated C19-carboxylic acid functional group of the steviol backbone. Furthermore, the SrUGT76G and SrUGT76I subfamilies not only have a high degree of support (100% bootstrap) but also have a close genetic distance, indicating that these two subfamilies may be responsible for the formation of the 1,3-β-D-glucosidic linkage.
WGCNA co-expression network analysis for the investigation of steviol glycoside biosynthesis
To date, many steps involved in the biosynthesis pathway of steviol glycosides have been successfully uncovered, especially for elucidating four UDP-dependent glucosyltransferases (UGTs) [6, 29], but several glucosylation steps of some glycosides that have not been resolved to date. In addition, for the large family of glucosyltransferases, we speculate that multiple enzymes with similar functions may participate in the same catalytic step in the glucosylation of steviol glycosides. Typically, the traditional method for differential expression analysis is constrained to paired sample analysis and thus unable to perform systematic analysis with large datasets from heterogeneous sources simultaneously [30, 31]. Therefore, in this study, one co-expression network approach named WGCNA, which was proved to be a powerful tool in systematically describing the correlation relationship between clusters of highly correlated genes or modules and external conditions or sample traits [31, 32], was used to analyse the potential UGTs involved in the glucosylation of steviol glycoside. First, we performed qRT-PCR analysis of the expression levels of nine UGTs (SrUGT71H1, SrUGT85B1-2, SrUGT91D2, SrUGT76G1-1, SrUGT91D3, SrUGT85C3, SrUGT79A2, SrUGT73G2 and SrUGT71I1) in the leaves of six genotypes to confirm the reliability of the transcriptome data, and the primers used for qRT-PCR are shown in Table S2. The results showed that the tendency of these genes to be expressed was similar between the qRT-PCR and the transcriptomic data, confirming that the transcriptomic results were reliable (Fig. 6). To avoid transcript loss, first, we compared the assembled transcripts (total number 71718) from NGS with those (total number 30859) sequenced from PACBIO using BLAST + software (http://blast.ncbi.nlm.nih.gov), and the results showed that more than ninety-nine percent of PACBIO isoforms were included in the assembled transcripts of NGS (Fig. 7), indicating that the unigenes assembled by NGS contain all transcripts in S. rebaudiana. Consequently, all 71718 transcripts assembled from NGS as input raw data for WGCNA co-expression network analysis. First, genes with low fluctuation expression (standard deviation ≤ 1) were filtered, and 14995 genes remained. When the power value of adjacency functions for weighted networks was 9, both the correlation coefficient and degree of gene connectivity could satisfy the requirement of scale-free network distribution to the greatest extent possible. Based on the selected power value, a weighted co-expression network model was established, and 14995 genes were eventually divided into fifteen modules, of which the grey module had no reference significance because of the failure to assign to any module. The hierarchical clustering dendrogram of gene networks is visualized in Fig. 8A.
To identify modules that are significantly associated with the traits of steviol glycoside content, fifteen generated modules were correlated with the traits. The modules related to each trait were screened according to the absolute value of correlation coefficient ≥ 0.3 and p-value < 0.05. The colour-coded table in Fig. 8B shows the full module-trait relationships. For each trait-related module, the correlation between the gene expression profile and the corresponding traits (Gene Significance, GS) and the correlation between the gene expression profile and the module eigengenes were calculated. The results showed that the genes in the module are both highly correlated with the traits and the eigengenes. For example, the module eigengenes of turquoise (r = 0.71, correlation p-value = 0.00092) and dark-orange (r=-0.78, correlation p-value = 0.00014) were significantly positively correlated or negatively correlated with RA, respectively. As a result, fourteen gene modules that are highly associated with steviol glycosides were identified. Among these genes in the fourteen modules, two genes belong to the acetylglucosaminyltransferase, fifty-five genes were annotated to be members of the plant UGT superfamily, including three SrUGT85C2, one SrUGT74G1, one SrUGT76G1, one SrUGT85A8 and one SrUGT91D2, which have already been reported to be involved in the glucosylation of steviol glycosides except for SrUGT85A8 [6, 7], illustrating the reliability of our results. Furthermore, the expression levels of these genes in the leaves of the six genotypes are shown in Fig. 9.
In S. rebaudiana, steviol glycosides have been derived from the tetracyclic diterpene steviol backbone [6], and the precursors of steviol are actually synthesized via a series of enzymes consisting of DXS, DXR, CMS, CMK, MCS, HDS, HDR, GGDPS, CPPS, KS, KO and KAH [8, 33, 34]. Accordingly, the genes encoding enzymes involved in steviol biosynthesis might be expected to exhibit a similar co-expression pattern with the SrUGTs involved in the synthesis of steviol glycosides. Therefore, we further performed a similar co-expression analysis between the fifty-five SrUGTs and the genes involved in steviol biosynthesis. Notably, forty-four SrUGTs, including SrUGT85C2, SrUGT74G1, SrUGT76G1 and SrUGT91D2, were then identified as being co-expressed with at least one of the upstream genes (Fig. 9), and it is reasonable to believe that these SrUGTs may be directly involved in the synthesis of the corresponding steviol glycosides, which warrants further research.