U. gibba lncRNAs has a reduced number of intergenic lncRNAs
lncRNAs were obtained from sequencing data of 12 RNA-Seq libraries, 10 from vegetative and 2 from trap tissue from plants subjected to the contrasting abiotic stress or hormone treatments (Additional file 1: Dataset S1). We produced over 19 million mapped reads per library (95.37% mapping to the genome) for a total of 330,258,977 million mapped reads (Additional file 1: Dataset S1). To identify lncRNAs, assembled transcripts were translated into the three potential reading frames and filtered by homology to proteins encoded in the U. gibba genome and other plants in a non-redundant protein database [14], yielding 10,386 putative lncRNAs of at least 200-nt in length. Putative lncRNAs were then filtered to eliminate precursors of tRNAs, rRNAs, miRNAs among others. Putative lncRNAs were further filtered for coding potential in sense-direction, resulting in 4,295 putative lncRNAs (Additional file 2: Fig. S1). The vast majority of these lncRNA loci were relatively short, 89.05% were smaller than 500-nt, and only 1.81% longer than 800-nt (Fig. 1a). The lncRNA mean length was 336.9-nt and the largest one was 2134-nt (Additional file 2: Table S1; Additional file 3: Dataset S2). We also found that in U. gibba 86.51% of lncRNAs had a single exon structure and 13.49% had two or more exons, percentages like other plant species for which lncRNAs have been reported (Fig. 1b; Additional file 1: Dataset S1). When putative lncRNAs were mapped onto the U. gibba genome, we determined that 37.03% mapped to regions corresponding to exons of protein coding genes in antisense orientation, 10.89% to intronic regions, 25.26% overlapped gene bodies and IGRs, and 26.82% were located only in IGRs (Fig. 1c). As it is more complex to determine whether a non-coding transcript that completely overlaps with the transcript of a coding gene indeed corresponds to a lncRNA, we made only a comparison of intergenic lncRNAs (lincRNAs) with other plant species. By mapping directly RNA-Seq reads and assembled contigs, we found that this carnivorous plant, under all conditions tested, expresses a total of 1152 lincRNAs, which is significantly lower than that reported for other plant species that ranges between 1580 and 3100 lincRNAs (Additional file 2: Fig. S2). Additionally, were evaluated the expression profiles for coding and noncoding transcripts and we found a higher level of expression for mRNA than for the entire catalog of lncRNAs or lincRNAs (Fig. 1d). We found no structural differences between lncRNAs annotated as intergenic or intragenic; in both cases the predominant structure was 1 exon (Additional file 2: Fig. S3, S4). To visualize the distribution lincRNAs loci, reads were mapped onto the 18 largest contigs of the U. gibba genome (>1Mb), which included 4 putatively entire chromosomes. loci encoding lncRNAs were distributed across the entire genome with prevalence in high gene density regions, and low frequency in pericentromeric regions (Fig. 2), which can be more clearly seen in the four complete chromosomes (Fig. 2; dark grey color). lncRNAs density was similar in 17 of the largest contigs of the U. gibba genome [15], except in Unitig_8 (6.8 Mb) that has considerably lower density of lncRNA loci (Fig. 2).
U. gibba has an atypical abundance of 24-nt sRNAs compared to other angiosperms
To further characterize noncoding RNA diversity in U. gibba, we carried out small RNA-Seq analysis of RNAs extracted from green tissue and traps from plants grown under the same conditions described for lncRNA identification. We obtained a total of 23.6 million mapped reads, of which 19.9 million had lengths between 20 to 25-nt, which were selected for further analyses (Additional file 1: Dataset S1). Upon mapping the reads onto the U. gibba genome, we found that loci for 20 to 25-nt sRNAs are mainly located at pericentromeric regions, where TE density is higher, but with some peaks in regions of high gene content (Fig. 2). Similar results have been published previously for other plant species [16].
An interesting finding was that most sRNA sequencing reads corresponded to 21-nt sRNAs (52.8%) and only 14.4% to 24-nt sRNAs (Additional file 2: Fig. S5; Additional file 1: Dataset S1). This contrasts with sRNA size distribution for other angiosperms, for which the most abundant sRNA class is 24-nt [17,18]. To confirm that the sRNA size distribution in U. gibba differs from that of other angiosperms, we performed an analysis of sRNA size abundance in representative plants from different clades for which sRNAs have been characterized (Additional file 1: Dataset S1). In total, we analyzed sRNA datasets for 30 plant species, of which 21 were angiosperms and 9 were representative plants outside the angiosperms. Our results confirm that in both monocot and eudicot species, except for U. gibba, the most abundant sRNA class is 24-nt (Fig. 3a). For the case of green algae, the most prevalent sRNAs classes are of 21 to 23-nt, whereas in Volvox carteri the most abundant sRNAs are of 21-nt and 22-nt, in Chara coraline the most abundant are 22-nt and 23-nt (about 30% of each size), and for Chlamydomonas reinhardtii the 21-nt class predominates (Fig. 3a). In early-branching land plants (Marchantia polymorpha, Physcomitrella patens, and Marsilea quadrifolia) and gymnosperms (Picea abies, Gynkgo biloba) the most abundant size class of sRNAs is 21-nt, except for Cycas rumphii, which has similar amounts of 21 and 24-nt sRNAs (Fig. 3a).
A large number of 24-nucleotide small RNA loci produce a small proportion of sRNA reads and are associated preferentially with intergenic regions in U. gibba
In general, sRNA sequence distribution in angiosperms is characterized by a major 24-nt peak containing primarily unique reads, and a 21-nt peak comprising many redundant reads [19]. As expected this is also true in model angiosperm species such as Arabidopsis [20], tomato [21] and rice [22]. For the three samples we sequenced (Additional file 2: Fig. S6), 21-nt sRNAs had higher redundancy (up to 97% of the reads are redundant) in comparison with 24-nt sRNAs, of which 30% were unique and 70% redundant (Additional file 1: Dataset S1). To better classify sRNA loci, we performed an analysis with ShortStack V2.0 [23] to identify in silico which DICER-like (DCL) genes are involved in the biogenesis of miRNAs or siRNAs. ShortStack analysis identified 7478 siRNA loci and only 80 miRNA loci (Fig. 3b; Additional file 4: Dataset S3), which produce nearly 1.5 million and 8 million mapped reads respectively, suggesting that a large number of siRNA loci accumulate a low number of reads (Fig. 3c). Of the 80 miRNA loci identified, 78 already were annotated in the miRBase V22.0 catalog of eukaryotic miRNAs, and 2 represent putative Utricularia specific miRNAs (Additional file 4: Dataset S3; Additional file 2: Fig. S6). miRNA loci grouped into 17 families, of which the miR166, miR156, miR159, miR319 and miR858 families made up 94% of the miRNA reads (Additional file 2: Fig. S7). These miRNA families are conserved in most plant species and produce high levels of mature miRNAs [18].
sRNA loci annotation reveals that the 24-nt sRNA loci are preferentially located at IGRs with 80% of them, while the sRNA loci of 20-nt to 22-nt were located in similar proportions at genic and intergenic regions and for the 23-nt sRNA class, 57% of loci are located at IGRs and 31% in genic regions (Additional file 2: Fig. S8).The annotation is consistent with the distribution of sRNAs at the genome scale and can be clearly seen in unitig_0, in which 21-nt sRNA loci are distributed across the chromosome with significant peaks in regions with high gene density and 24-nt sRNA loci were found to be predominantly located in regions with high TE density, presumably pericentromeric regions (Fig. 3d).
sRNA biogenesis and the RdDM pathway in U. gibba
The unusual proportions of 24-nt and 21-nt sRNAs observed in U. gibba suggest that its sRNA production machinery could differ somehow from those of other angiosperms. To explore this possibility, we focused on the presence of genes involved canonical miRNA biogenesis, genes that are part of the subunits of RNA Pol-IV and RNA Pol-V, homologs of AGO, DCL, RDR, and other key genes involved in siRNA biogenesis and de novo DNA methylation in U. gibba. We searched based on sequence homology (transcript and protein), protein domain conservation, synteny, and through phylogenetic analysis. Furthermore, we performed a comparison with representative plant species (both angiosperm and non-angiosperm) for which key RdDM pathway genes [24] and RNA polymerase compositions were previously reported [25].
We found that key genes involved in canonical miRNA biogenesis such as DICER LIKE 1 (DCL1), SERRATE (SE), HYPONASTIC LEAVES 1 (HYL1), HASTY 1 (HST1) and ARGONAUTE 1 (AGO1) are conserved in U. gibba (Additional file 2: Table S2). Pol-IV and Pol-V are crucial in the RdDM pathway and are constituted by diverse DNA-DIRECTED RNA POLYMERASES IV AND V (NRPD/NRPE) proteins. Pol-IV and Pol-V differ from RNA polymerase II in their second, fourth, fifth, and seventh subunits (NRPD2, NRPD4, NRPE5, NRPD7, respectively). The U. gibba genome encodes NRPD1, NRPE1 and NRPD2 genes (Additional file 2: Fig. S9), which is consistent with the presence of siRNAs and their strong conservation in the land plant lineage. We also identified NRPD7 in the U. gibba genome (Additional file 2: Fig. S9), a subunit previously identified in green eukaryotes except in the algae Chlamydomonas, and NRPD5, found in gymnosperms and angiosperms but not in algae and ferns [25,26]. Only after an extensive search did we find evidence for an NRPD4 ortholog in U. gibba. Although this gene is classified as an orphan in Plaza 4.0 (HOM04D168668) for Arabidopsis thaliana, our analysis showed that this categorization is likely related to the Arabidopsis thaliana homolog being extremely divergent; even the A. lyrata ortholog was readily placed into a clear NRPD4 gene family (Additional file 2: Fig. S10). Interestingly, this subunit has been reported only in angiosperms and not in gymnosperms, early land plants or algae [25,26].
The AGO protein family in Arabidopsis has been subdivided into 4 clades: AGO2/3/7, AGO4/6/8/9, AGO1/10 and AGO5. The number of family members of this protein family ranges from 2 AGO proteins in C. reinhardtii to 10 in Arabidopsis and 20 in Z. mays [27]. We searched in the Phylome database v4 [28] for possible homologous genes in U. gibba and found evidence for at least one AGO for each clade, with the exception of AGO5 for which no homologs were found (Additional file 2: Fig. S11). Two genes were found for AGO clade 2/3/7 (Additional file 2: Fig. S12), two genes represented the AGO 4/6/8/9 clade (Additional file 2: Fig. S13), and 4 homologs grouped in the AGO 1/10 clade (Additional file 2: Fig. S14). Additionally, we performed our own exhaustive phylogenetic analysis with many plant genomes to assign each U. gibba AGO gene with more certainty into specific clades. We were able to identify the same 8 AGOs described above, wherein the 2 homologs of clade 2/3/7 apparently are AGO7 copies, the two 4/6/8/9 copies appear closer to AGO4 than AGO6. AGO8/9 are sister genes only in Brassicaceae. There is one U. gibba AGO10 and three homologs for AGO1 in the AGO1/10 clade (Additional file 2: Fig. S15).
In seed plants there are three RDR ortholog genes; two are conserved in all land plants, RDR1 and RDR6, and RDR2 is required for production of Pol-IV-siRNAs and is specific to seed plants. Aside from these RDRs, three additional members of this protein family, RDR3, RDR4, and RDR5, are present in Arabidopsis and other plants [29–31]. We found RDR6 (two copies), RDR1 and RDR2, but no evidence for the presence of RDR3/4/5 in the genome of U. gibba (Additional file 2: Fig. S16a). In angiosperms the DCL family has 4 members: DCL1, DCL2, DCL3, and DCL4 [32]; however, in lycophytes and ferns there is only evidence for the presence of DCL1, DCL3 and DCL4 [26]. Phylogenetic analysis of the DCL family permitted the identification in U. gibba of 4 DCL proteins (DCL1, DCL2, DCL3, DCL4), suggesting that has a DCL repertoire similar to other angiosperms (Additional file 2: Fig. S16b). Globally, were able to assign U. gibba homologs for the remaining key genes in the canonical RdDM pathway (Additional file 1: Dataset S1). However, although UgDCL3 phylogenetically groups very closely to tomato and Mimulus DCL3, UgDCL3 is missing its N-terminal region, where the conserved DEAD/DEAH, Helicase and Dicer dimerization domains are located (Fig. 4a; Additional file 2: Fig. S17). The absence of about 600 amino acids of the N-terminal region in UgDCL3 is evident in a multiple sequence alignment of DCL3 with other angiosperms (Fig. 4b; Additional file 2: Fig. S18). To confirm that the incomplete DCL3 does not represents mistake in the assemble/annotation of the U. gibba genome, we searched for the presence of DCL3 transcripts in the different RNA-Seq libraries and we confirm with a 5´ Rapid Amplification of cDNA Ends (5´RACE) that in both cases the sequence of transcripts and 5´ RACE sequence that the U. gibba DCL3 gene is missing the DEAD/DEAH domain (Additional file 2: Fig. S19).
Female gametogenesis in U. gibba is reminiscent of Arabidopsis mutants affected in the RdDM pathway
Mutations affecting most of the genes involved in the RdDM pathway have no obvious phenotype during the vegetative development of plants, but show defects in female gametogenesis, including the differentiation of supernumerary gametic precursors that often give rise to ectopic female gametophytes within the ovule [33,34]. To determine if the truncation of DCL3 and the unusual distribution of sRNAs could be related with female gametogenesis in U. gibba as has been reported for Arabidopsis, we first analyzed female gametogenesis in whole-mounted developing ovules, as no descriptions of ovule development have been previously reported for this species. Our results are described and illustrated in Fig. 4c-i and in Additional file 2: Table S3.
As for other species of Utricularia [35,36], the ovule of U. gibba is unitegmic, with a funiculus forming a raphe and merging into a voluminous placenta. The formation of differentiated gametes occurs after the formation of meiotically derived megaspores (megasporogenesis). Subsequent rounds of mitotic divisions give rise to the female gametophyte (megagametogenesis). Megasporogenesis occurs in ovules within ovaries having a diameter of 0.3 to 0.5 mm. Whereas 51.4% (n=142) of pre-meiotic ovules showed a single megaspore mother cell (MMC; Fig.4c), 42.9% showed from two to six differentiated cells resembling the MMC (Fig. 4d and e). In 5.7% of the ovules examined we could not identify a pre-meiotic precursor. Ovules contained in 0.5-1 mm ovaries had already undergone meiosis and often show a chalazal functional megaspore (FM; Fig. 4f) within which mitotic divisions will give rise to an 8-nucleated syncytium (Fig. 4g) that will cellularize before differentiating into a mature female gametophyte (FG). Although the degeneration of the FG prior to the end of megagametogenesis is not uncommon (13.1%; n=76), in most cases (40.7%; n=76) the ovule contains a FG in which the micropylar region containing the egg apparatus expands outside the integument and grows within the placenta (Fig. 4g and 4h). Interestingly, 22.4% of ovules examined showed supernumerary gametic cells in the chalazal region, independent of the developing FG (Fig. 4h), and containing two independently developing FGs (Fig. 4i), suggesting that supernumerary gametic precursors can give rise to female gametophytes that may or may not originate from a meiotically derived cell. The presence of supernumerary gametic precursor cells and ectopic female gametophytes in U. gibba is reminiscent of phenotypes found in Arabidopsis mutants dicer-like3 (dcl3), argonaute4 (ago4), argonaute9 (ago9), rna-dependent rna polymerase6 (rdr6), and nrpd1a, all affected in key components of the RdDM pathway.
U. gibba has a reduced levels of DNA methylation
In plants, 24-nt siRNAs from repetitive DNA and TEs that are loaded by AGO4/6 trigger DNA methylation, which results in histone modifications such as the H3K9me2 [37,38]. Because of the unusual distribution of 24-nt sRNAs and the low proportion of TEs and other repetitive DNA in U. gibba, we decided to explore preliminarily the global DNA methylation patterns in this carnivorous plant using long-read DNA sequencing data with the technology of SMRT-Seq, recently reported [15]. This technology measures each base addition as an interpulse duration (IPD) or retention time ratio. The IPD will depend on whether the new base is incorporated by pairing to a modified or non-modified base in the template and the nature of the modification [39]. Therefore, analysis of the IPDs during SMRT-Seq can allow the identification of m5C in the DNA template without the need for commonly used bisulfite DNA chemical conversion methods. Since there are no previous reports of using SMRT-Seq data to determine m5C global methylation in plants, we tested as preliminarily the m5C identifications in U. gibba taking advantage of PacBio data from its genome (http://merlion.scelse.ntu.edu.sg/shares/pbio_HGYDGSKAA23/). Raw SMRT-Seq data was aligned against the reference genome and base kinetics information analyzed using the program SMRT-link V4.0 to identify base modifications and the theoretical IPD value for non-modified bases which is 1 was used.
The genome-wide depth with SMRT sequencing was ~ 70X and the mean coverage for all bases was 34X. We identified 1,590,729 putative methylated cytosines and 1,088,032 high-confidence m5Cs (IPD >= 1.7), which represents 3.88% and 2.69% of the Cs in the U. gibba genome, respectively. DNA methylation corresponding to all methylated cytosines in high confidence m5Cs for each context was scored: 37.30% CG methylation, 22.22% in CHG context, and 40.45% in CHH (Additional file 2: Fig. S20a). These preliminarily results suggests that PacBio data could be useful to obtain a methylation landscape and we decided to explore the methylation levels and gene body methylation (GbM). The methylation levels for each context were 10% for CG, 6% for CHG and 2.4% for CHH (Additional file 2: Fig. S20b), which are reduced levels in comparison with other plant genomes [40]. On other hand, in the GbM we found lower DNA methylation density levels in upstream/downstream regions than in gene bodies, with major peaks located near start/stop codon sites (Additional file 2: Fig. S20c).
Moreover, to confirm these preliminary m5C identifications we performed BS-Seq for two replicates of the whole-plant of U. gibba. The bisulfite conversion rates in both replicates were greater than 99.85% and the mean coverage for base was 26X. After sequencing, the clean reads were mapped against the reference genome obtaining around 85% of mapping rate. During the base calling we found a total of 1,789,535 and 1,777,410, m5Cs for each replicate. Of these, we assigned as high-confidence 1,281,545 and 1,302,422 m5Cs, respectively. In percentages of total cytosines in the genome, the global methylation is ranged from 4.09%-4.057% to 2.92%-3.17%, from total and high-confidence m5C in both replicates, respectively. The distribution of m5Cs at the chromosome level shows peaks near centromeric regions for both replicates of BS-Seq and for SMRT-Seq (Fig. 5a) with very similar distributions, similar to that reported in other plant genomes but with one of the lowest global methylation rates reported for a land plant [11,12]. Of the total of m5Cs identified, for the methylation context CG correspond 39.1%, 38.67%, for CHG 27.4% and 27.1% and for CHH context 33.5% and 34.23% (Fig. 5b). When comparing the methylation levels of each context related with the total of these contexts sequence at the genome, the methylation levels were 12% for CG, 8% for CHG and 2% for CHH (Fig. 5c), reduced levels in comparison with other angiosperms including A. thaliana [40]. The GbM in U. gibba was calculated 800bp upstream, downstream and in genic region and the results indicate lowest methylation in upstream and downstream region compared with gene body region where the CG methylation is the highest (Fig. 5d). These results correlates with GbM densities previously reported for other plants [40,41]. The analysis of TE body methylation showed lower methylation levels near upstream and downstream regions such is in GbM and high methylation density is in CHH context (Fig. 5e), which can be explain the TE silencing in this genome.
To experimentally validate the predicted methylation densities, we performed Chop-qPCR analysis on two regions from one U. gibba chromosome (Unitig 0), one predicted to be highly methylated (GPoor, near the presumed centromeric region) and the other with lower methylation density at a proposed euchromatin region (GRich) (Additional file 2: Fig. S21 above). PCR amplification of undigested DNA was used as positive controls to contrast with the amplification obtained from digested samples, where a decrease in amplification levels reveals a higher degree of methylation at CG and CHG sites in the tested regions. We observed for both enzymes that DNA methylation at the analyzed sites was over 2-fold more frequent in the GPoor region than in the GRich region (Additional file 2: Fig. S21). Our findings suggest that the SMRT-sequencing technology could be used effectively to estimate global methylation patterns in plant genomes and BS-Seq confirm several results obtained with SMRT-Seq.