De novo transcriptome assembly of P. mirifica
To obtain nucleotide sequences of expressed genes in various tissues of P. mirifica, we constructed a cDNA library from pooled tissues including, young leaves, mature leaves, cortex-excised tubers, and tuberous cortices (Fig. 2a-2d). The library was processed using an Illumina Hiseq 2000 platform, yielding approximately 8.2 giga base pairs and a total of 7,386,137,640 clean nucleotides (nt). The de novo assembly resulted in 166,923 contigs (62,567,517 nt) and 104,283 unigenes (81,810,584 nt), the mean length/N50 for contigs and unigenes was 375/734 nt and 785/1558 nt, respectively. To assess the completeness of transcriptome data, we performed a BUSCO analysis compared to the 2,121 single-copy orthologs of the eudicot lineage. The de novo transcriptome assembly was complete to 87.4%, 6.9% of contigs were fragmented, and 5.7% of the transcriptome was missing (Supplementary Table S1). In addition, the assembly produced here was compared to that of other leguminous plants (summarized in Table 1).
Table 1 Comparison of P. mirifica de novo assembly to other leguminous plants.
Description
|
P. mirifica
|
P. lobata
[18]
|
Ammopiptanthus mongolicus [19]
|
Millettia pinnata
[20]
|
Medicago sativa [21]
|
Total Clean Reads
|
82068196
|
73360286
|
67287120
|
80212402
|
28790610
|
Total Clean Nucleotides (nt)
|
7386137640
|
6973955470
|
6055840800
|
7219116180
|
5642959560
|
Contig
|
|
|
|
|
|
Total Number
|
166923
|
335582
|
148797
|
108731
|
81277
|
Total Length (nt)
|
62567517
|
390188024
|
51308749
|
39658128
|
70966536
|
Mean Length (nt)
|
375
|
1162
|
345
|
365
|
873
|
N50 (nt)
|
734
|
1988
|
619
|
682
|
1323
|
Unigene
|
|
|
|
|
|
Total Number
|
104283
|
163625
|
84583
|
53586
|
40433
|
Total Length (nt)
|
81810584
|
111776902
|
57108594
|
42186440
|
32482946
|
Mean Length (nt)
|
785
|
683
|
675
|
787
|
803
|
N50 (nt)
|
1558
|
1153
|
1191
|
1204
|
1300
|
Functional annotation and classification of protein
Functional annotation of the assembled transcripts provides insights in potential metabolic functions and biological processes within an organism. The functional annotation of P. mirifica assembled transcripts was performed based on similarities with proteins or transcripts according to information that is available in various public databases. The statistics of functional annotation are summarized in Supplementary Table S2. Aligned unigenes showed significant homologies using the National Center for Biotechnology Information (NCBI) non-redundant protein database. Based on the BLAST similarity distribution, over 68% of unigenes exhibited similarities greater than 80%. A top-hit species distribution analysis showed unigenes with BLAST hits sharing high sequence similarities with Glycine max (85.3%), Medicago truncatula (6.2%), and Vitis vinifera (1.6%).
The clusters of orthologous groups (COG) indicated the functional classification of each unigene at the cellular level. In total, 1,091 unigenes were predicted to be involved in secondary metabolite biosynthesis (Supplementary Fig. S2). Gene ontology (GO), which classifies standardized gene function, is useful for annotating gene functions and gene products in various organisms. GO is based on three major dependent factor categories: biological processes, molecular functions, and cellular components. The 37,058 unigenes yielded a corresponding GO term that can be further classified into 56 sub-categories: 23 categories related to biological processes, 17 to cellular components, and 16 to molecular functions. Supplementary Figure S3 shows the substantial number of transcripts related to cellular components and metabolic processes. The remaining 67,225 unigenes produced no BLAST hits. These unannotated unigenes may be uncharacterized genes or assembled sequences that were too short to produce hits.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database for identifying biological pathways and for functional annotation of gene products. Pathway-based annotation helps produce an overview of the different metabolic processes that are active within an organism, and it helps improve our understanding of biological functions of unigenes. All unigenes were analyzed using the KEGG pathway with an e-value cutoff of < 10−5. We obtained 33,317 unigenes, 3,997 of which were related to biosynthesis of secondary metabolites and 8,103 to general metabolic pathways. The top-five ranking pathways were plant hormone signal transduction (2,061 unigenes), endocytosis (1,254 unigenes), RNA transportation (1,232 unigenes), glycerophospholipid metabolism (1,131 unigenes), and purine metabolism (1,113 unigenes). Regarding the crucial capacity of leguminous plants to accumulate functional flavonoids, 476 flavonoid biosynthetic unigenes and 167 isoflavonoid biosynthetic unigenes are shown in Supplementary Figure S4. These functional annotations were used for identifying genes involved in isoflavonoid biosynthesis in P. mirifica.
Proposed miroestrol biosynthetic pathway, differential accumulation of transcripts associated with isoflavonoids, and miroestrol biosynthesis
Miroestrol biosynthesis potentially shares a pathway with isoflavonoid biosynthesis [15]. We propose that daidzein, a common isoflavone aglycone, is hydroxylated by at least two cytochrome P450 enzymes at the 2′ and 3′ carbon of the B-ring to produce 2′,5′-dihydroxydaidzein; these enzyme may be members of the CYP81E subfamily which is known to use isoflavones as substrates [22-25]. Then, 2′,5′-dihydroxydaidzein would be reduced by isoflavone reductase and subsequently would be prenylated by a prenyltransferase using dimethylallyl diphosphate as a co-substrate to produce deoxymiroestrol and miroestrol (Fig. 1).
Based on a functional annotation of each unigene found in the P. mirifica de novo transcriptome assembly and phylogenetic analyses, a total of 14 putative genes involved in isoflavone biosynthesis were predicted as follows: two PAL genes (encoding phenylalanine ammonia lyase), one C4H gene (encoding cinnamate 4-hydroxylase), three 4CL genes (encoding 4-coumarate-CoA ligase), two CHS genes (encoding chalcone synthase), one CHR gene (encoding chalcone reductase), two CHI genes (encoding chalcone isomerase), one IFS gene (encoding IFS), and two HID genes (encoding 2-hydroxyisoflavanone dehydratase). In addition, a total of seven putative genes were identified in our proposed miroestrol biosynthetic pathway: three CYP81E genes, two IFR genes (encoding isoflavone reductase), and two PT genes (encoding prenyltransferase). To investigate gene expression levels of these unigenes across four tissues of P. mirifica, reads per kilobase million of transcripts were calculated from the RNA sequencing data generated using a NextSeq500 platform. An analysis of differentially expressed genes (DEG) on these unigenes (probability threshold q = 0.9) was visualized as a heat map based on our proposed miroestrol biosynthetic pathway (Fig. 1). Additionally, a total of 16 putative genes encoding UDP-glycosyltransferase that are potentially involved in this pathway were phylogenetically identified from the transcriptome data. Their DEGs across the four tissue types are shown as a heatmap in Supplementary Fig. S5.
To validate the DEG analysis obtained from RNA sequencing, candidate unigenes were selected and analyzed based on their expression levels in the four tissue types using quantitative real-time reverse transcription polymerase chain reaction (qRT-PCR). As expected, expression levels of the selected unigenes were positively correlated with DEGs identified by RNA sequencing (Fig. 3). Indeed, most of the isoflavone biosynthetic genes were highly expressed in tuberous cortices, compared to young leaves, mature leaves, and cortex-excised tubers; however, PmCYP81Es, PmIFRs, and PmPT higher expressed in mature leaves than in the other tissues.
Phytoestrogens and annotated constituents in P. mirifica
High-performance liquid chromatography (HPLC) coupled to tandem mass spectroscopy (MS) is a powerful tool with high selectivity and sensitivity. Seven phytoestrogens were identified in P. mirifica by comparison with standard compounds based on their retention times and MS fragmentation patterns. These phytoestrogens included daidzein (RT = 16.33 min, m/z = 254.05791), daidzin (RT = 10.16 min, m/z = 416.11073), genistein (RT = 18.78 min, m/z = 270.05282), genistin (RT = 13.70 min, m/z = 432.10565), 2′-hydroxydaidzein (RT = 14.74 min, m/z = 270.05282), 3′-hydroxydaidzein (RT = 15.18 min, m/z = 270.05282), and puerarin (RT = 10.40 min, m/z = 416.11073). Due to the lack of standard compounds, mirificin, 2′,5′-dihydroxydaidzein, deoxymiroestrol, and miroestrol were predicted on the basis of mass accuracy ranges and MS fragmentation patterns searches conducted on the METLIN metabolomics database [26] or/and comparisons with published data [27]. The relative abundance of major phytoestrogens found in young leaves, mature leaves, cortex-excised tubers, and tuberous cortices of P. mirifica are shown as a heat map in Figure 2e. Genistein, genistin, and annotated 2′, 5′-dihydroxydaidzein were highly accumulated in mature leaves, whereas daidzein, puerarin, annotated mirificin, and annotated deoxymiroestrol were highly accumulated in tuberous cortices, compared to the other tissues. Although miroestrol was not detected in any tissue, its precursor (annotated deoxymiroestrol; RT = 15.03 min, m/z = 342.14672) was observed only in P. mirifica tuberous cortex, which is the main accumulation site of miroestrol [10].
Isoflavone production in N. benthamiana leaves over-expressing P. mirifica isoflavone synthase
Recently, N. benthamiana has been used to identify transient expression of several plant genes to confirm gene functions [28]. To demonstrate that our transcriptome libraries contained candidate genes involved in P. mirifica phytoestrogen biosynthesis, we cloned P. mirifica isoflavone synthase (PmIFS) and conducted a functional characterization using transient (co-)expression in N. benthamiana, which does not produce any isoflavones. Transient expression of green fluorescent protein in five-week-old N. benthamiana leaves was used as a negative control. Co-expression of PmIFS and Arabidopsis R2R3 MYB12 transcription factor (AtMYB12), a regulator enhancing metabolic flux to flavonoid biosynthesis [29], generated two novel major peaks that were identified as genistein (Supplementary Figure S6), suggesting that PmIFS is involved in isoflavone biosynthesis.
Identification of MYB transcription factors involved in isoflavone biosynthesis
We identified 85 putative genes encoding MYB transcription factors in the P. mirifica transcriptome. All putative P. mirifica MYB transcription factors (PmMYB) were aligned to known MYB transcription factors of other plants such as Arabidopsis thaliana, Glycine max, and other species, and a phylogenetic tree was produced from this alignment (Fig. 4). Based on these analyses, we found six candidate PmMYBs that are potentially involved in the regulation of isoflavonoid biosynthesis. PmMYB18 was closely related to GmMYB29, which activates expression of IFS and CHS in soybean [30]. PmMYB23 and PmMYB24 clustered with GmMYB12B2, which regulates expression of CHS in soybean [31, 32]. PmMYB75, PmMYB76, and PmMYB77 clustered with GmMYB176, which is also involved in controlling CHS expression and isoflavonoid synthesis in soybean [33, 34]. Furthermore, also PmMYB24 and PmMYB77 were highly expressed in mature leaves, whereas PmMYB18, PmMYB23, PmMYB75, and PmMYB76 were highly expressed in tuberous cortices (Fig. 5).