Identification of Post-transcriptional RNA modifications in three Poaceae
To determine the degree to which the epitranscriptome is conserved at a gene level, we performed poly-A enriched RNA-sequencing on soil grown seedlings of Zea mays, Sorghum bicolor, and Setaria italica (Maize, Sorghum, and Setaria, respectively, going forward) two weeks after germination. Paired-end, 150 base-pair RNA-sequencing was performed on aboveground tissue for two biological replicates (~ 20 million reads per replicate; Fig. 1A). Following read mapping, modified sites were inferred using HAMR and ModTect (see Methods). Both of these algorithms take advantage of the propensity for certain reverse transcriptases to misinterpret ribonucleotides that are modified along the canonical base-pairing edge, and as a result, incorporate non-reference nucleotides (i.e., SNPs) in the synthesized cDNA. This misincorporation is non-random, and both HAMR and ModTect use the resulting population of SNPs to infer modification class. RNA-seq data were fed into both algorithms and sites that were predicted by both algorithms, and both replicates were retained for subsequent analyses as high confidence Post-transcriptional RNA Modifications (RCMs). From this approach, 5,434-7,020 unique RCMs were identified in 1,944-2,542 transcripts in the three species (Fig. 1B, C, Additional File 1). These modifications represent the seven major classes detected by HAMR, with the m1A|m1I|ms2i6A class being the most common in all three species (approximately 33% of all identified RCMs; Fig. S1A). Additionally, we find that mRNAs marked by RCMs are present at a substantially higher abundance relative to non-modified mRNAs (modified median TPM = 78.5, n = 2521 mRNAs, not modified median TPM = 10.2, n = 16,537, p < 2.2e-16, Fig. S1E). Enriched gene ontology terms of modified RNAs in each Poaceae demonstrates significant over-representation in photosynthesis and cytoplasmic translation pathways (Fig. 1D). These data suggest that RCMs are targeted to conserved cohorts of photosynthesis and translation-associated RNAs in these grass seedlings.
While HAMR and ModTect are able to account for the non-random nature of plastid RNA editing, there is a possibility that the identified RCMs might be enriched on plastid-localized transcripts and thus be an artifact of plastid RNA editing. While modifications were only predicted on nuclear encoded transcripts, many organellar transcripts are encoded within the nucleus [46]. We assessed this possibility in Maize, using sequence homology of all annotated plastid proteins to the nuclear encoded proteome, to determine if organellar transcripts that have been integrated into the nuclear genome are enriched for potential artifactual RCMs. We find no over-representation of organellar to nuclear integrated RNAs marked with RCMs (24/2,249 maize RCMA marked RNAs are plastid RNA nuclear hits, p = .875 Hypergeometric test for over-representation, Fig. S1B). These findings suggest that RCMs are not artifacts of plastid RNA-editing, but reflect events on bona-fide nuclear encoded RNAs that are involved in photosynthetic pathways.
To determine if the conservation of RCM deposition goes beyond functional pathway and extends to the genes themselves, we next examined if RCMs are found on transcripts derived from orthologous genes using both sequence homology and synteny. Of the 1,912 possible modified orthologs (the smallest observed number of modified genes in Setaria), we observed 1,074 modified orthologs in one or both of the other species, and 474 orthologs that were modified in all three. This is significantly more orthologs than expected by chance, accounting for the number of annotated and expressed orthologs (p < 2.2e-16 multi-set hypergeometric test, Fig. S1C). We then assessed whether orthologous mRNAs are marked by a similar number of RCMs across species, allowing us to understand whether the RCM density of the targeted RNA is a conserved feature. Indeed, for all three combinations of comparisons between two sampled species, there is a significant positive relationship between the number of RCMs deposited on orthologous mRNAs (Sorghum:Maize r = .39, Sorghum:Setaria r = .43, Setaria:Maize: r = .32, p < 2.2e-16, Fig. S1D). These findings further demonstrate that the targeting and density of RCM deposition are conserved features in the sampled Grasses.
RCMs are dynamically deposited on mRNAs based on tissue and abiotic stress
Our previous analyses were limited to a single developmental time point across three species. Previous reports [14, 28, 47] suggest that RCMs may play important roles in post-transcriptional RNA regulation and thus would be dynamically deposited across development or environmental changes. Therefore, we chose to examine RCMs across diverse tissues and environmental contexts using RNA-seq from publicly available datasets [48–50]. We focused our efforts on Sorghum, first identifying RCMs in a large-scale tissue expression atlas by McCormick and co-authors [49] containing 137 sequencing samples across 48 tissues/stages/conditions (see Methods). From these data we identified 266,710 modifications on 6,805 unique transcripts, representing 19.3% of the expressed (TPM > = 1) Sorghum transcriptome. To determine whether the same repertoire of transcripts are being targeted with RCMs as our seedling data from Fig. 1, or if our data were biased towards identifying a distinct subset of RNAs, we compared the modified RNAs in each dataset. We found 1,813 of the 2,542 modified Sorghum seedling mRNAs are also modified in the McCormick et al., tissue expression atlas. This overlap is substantially more than expected by chance (p < 2.2e-16, Fig. S2A) and likely reflects similarity in molecular processes within tissues examined.
We next aimed to characterize the tissue specificity of RCMs in the McCormick Sorghum tissue expression atlas. Using a modified calculation for Tau, a value typically used to calculate tissue specificity based on RNA abundance [51], we observed that most transcripts were modified in a very context-specific manner (Fig. S2B). The transcripts with the lowest Tau value, and therefore modified under the broadest context, were mostly associated with core cellular processes such as translation, whereas those transcripts modified in the most specific context were associated with mRNA maturation and abiotic stress responses (Fig. S2C, D). We also clustered transcripts based on the tissue type in which they were modified, the number of modifications identified, and the average number of modifications per transcript (Fig. 2A). In our analysis we observed a strong clustering based on tissue similarity, with seed and roots being notable exceptions. We find a strong bias for RCMs in this dataset towards root samples where both the number of RCMs and number of mRNA targets is substantially higher than other tissues (Fig. 2A and Fig. S2E). In contrast, seeds displayed the fewest number of modified transcripts, but the average number of modifications per transcript was very similar to that seen in leaf and root tissues where the number of modified sites and transcripts were much higher (Fig. 2A). We observed a large number of transcripts that are modified in all tissues (n = 851; Fig. 2B and Fig. S3A), highlighting the existence of a core repertoire of RCM-targeted mRNAs. As expected, based on their presence across a broad tissue and developmental context, these mRNAs are enriched for terms associated with cytoplasmic mRNA translation (Fig. S3B). However, the majority of transcripts were more restricted in terms of the tissue or developmental context under which they were modified and were enriched for more specialized GO terms. For instance, root-specific modified mRNAs were enriched for rhizosphere-associated terms [52] such as oxidation management, generic methylation, and aromatic compound biosynthesis (Fig. S3C). Leaf-specific modified mRNAs were enriched for photosynthetic terms, whereas seed-specific mRNAs were enriched for lipid storage, ABA response, and cold acclimation terms (Fig. S3D, E). Importantly, the observed increase in context-specific modifications was not simply due to differences in the most abundant transcripts in each tissue. Indeed, we observed a low (although positive, r = .34) correlation between RNA abundance and modification levels (Fig. 2C; top) in the Sorghum tissue atlas. Thus, although there is a core set of modified Sorghum transcripts, most are targeted for RCMs in a context or developmentally-specific manner.
The conservation of RCMs on orthologous genes in our grass seedling data suggest that the developmental and tissue contexts under which these marks are deposited might also be conserved. To address this contextual conservation, we examined RCMs in two publicly available Arabidopsis thaliana tissue atlases [48, 50]. Both atlases examined similar developmental stages, but did so under slightly different conditions (e.g., constant light and sterile MS media for Mergner et al. [48] vs cycling light and soil for Klepikova et al. [50]). Importantly, while the Klepikova tissue atlas is primarily used by the community to examine transcript abundance in the Arabidopsis EFP browser [53], the work by Mergner et al. performed a paired assessment of transcript and protein abundance, which we used to examine the relationship between modifications and translation (see below). Due to differences in experimental design, we analyzed each of these atlases separately. A full breakdown of examined tissues, as well as total number of modifications identified, can be found in Additional File 2. Like Sorghum, clustering of Arabidopsis tissues by RCM density placed similar tissues together (Mergner et al: Fig. S4A, B, Klepikova et al: Fig. S5A, B). While Arabidopsis root tissues did not display a significantly elevated level of RCMs as in Sorghum, Arabidopsis seed transcripts had a reduced pool of very highly modified transcripts in both atlases. Similar to our observations in Sorghum, the seed-specific modified transcripts were enriched for lipid, nutrient, and ABA-response terms (Fig. S6A, B, C). Thus, as in Sorghum, the Arabidopsis epitranscriptome is diverse, highly context-specific, and appears to be associated with transcripts critical for cellular function.
Given the similar patterns of RCM abundance in Sorghum and Arabidopsis, we next examined whether orthologous mRNAs between Sorghum and Arabidopsis, which last shared a common ancestor ~ 150–250 million years ago [54], are targeted by RCMs. While we identified fewer mRNA targets of RCMs in both Arabidopsis expression atlases relative to the Sorghum expression atlas (Mergner: 1,324, Klepikova: 2,495 vs 6,845 in Sorghum), the number of modified orthologous mRNAs (658 Mergner vs McCormick; 1,180 Klepikova vs McCormick) was significantly more than expected by chance for all possible mRNA combinations (Fig. S6D, E, F, p = .001 Hypergeometric test). We conclude that this class of RCMs target an ancient conserved repertoire of translation and photosynthetic related mRNAs.
Given the developmental differences in RCM deposition in both Sorghum and Arabidopsis, as well as reports associating RNA modifications with plant stress responses [9, 14, 55–59], we next determined if RCMs are associated with drought stress in Sorghum. We utilized a publicly available field-grown Sorghum transcriptome dataset that sampled well watered and water limited (i.e., drought treatment) Sorghum leaves and roots from drought tolerant and susceptible genotypes across weekly timepoints [60]. We focused on a post-flowering time point (week 10) where one genotype (BTx642) is considered drought tolerant and the other genotype (RTx430) is drought susceptible. Counterintuitively, we observed a shift towards a more negative correlation between RNA and RCM abundance during drought stress relative to the Sorghum tissue atlas (Fig. 2C, bottom). A heatmap comparing transcript and RCM abundance between the two genotypes and treatments clustered into four groups (Fig. 2D). One of these clusters (Cluster 3) showed similar increases in transcript abundance levels between the two genotypes under water limiting conditions but showed an increase in RCMs specifically in the drought tolerant genotype (Fig. 2D). An examination of enriched GO terms revealed that Cluster 3 contained both heat shock proteins as well as water transport proteins (Fig. 2E), suggesting that RCMs may be associated with the drought response in the tolerant genotype.
RCMs accumulate near exon-exon junctions and are associated with splicing events
Thus far we have observed an association between RCMs and plant developmental and environmental transcriptional responses. To gain functional insight into RCMs, we analyzed their accumulation and distribution across mRNA bodies using our seedling RNA-seq data for all three species. RCMs were enriched across the CDS, with a bias towards the 3’ CDS, and in the 3’ UTR of mRNAs (Fig. 3A). Like m6A [20], we also observed that RCMs are biased towards being deposited on abnormally long exons (median length of modified exons = 361 nts, median unmodified = 147 nts, median all exons = 148 nts, Fig. 3B, p < 2.2e-16). Additionally, we observed a significantly higher proportion of expressed transcripts that are multi-exonic being targeted by RCMs, relative to mono-exonic transcripts (~ 11–14% vs ~ 5–8%, Fig. 3C, p < 7.1e-15). This was also the case for long non-coding RNAs (lncRNAs) where 4.8% of mono-exonic lncRNAs are marked by RCMs (17/355) and 8.4% of multi-exonic lncRNAs are targeted by RCMs (59/704, p = .044; Fig. S7A). A closer examination of distinguishing CDS features uncovered a dramatic buildup of RCMs on both 5’ and 3’ edges of exon-exon junctions (EJs) relative to start and stop codons (Fig. 3D). Thus, these data suggest that RCMs likely play a role in the regulation of RNA splicing on diverse transcript types.
This observation of RCMs preferentially occurring at EJs was initially made by Vandivier et al [14] on degrading mRNAs, whereas degrading mRNAs likely make up a small proportion of our dataset. Thus, these findings suggest a steady-state RCM enrichment at mRNA EJs. A transcriptome-wide 3’ bias was observed for these EJ-enriched RCMs (Fig. 3E). This terminal EJ enrichment was not due to 3’ sequencing bias that is often observed in poly-A enriched transcriptome datasets (Fig. S7B, C, D). Interestingly, genes that express multiple isoforms are more likely to be modified than single isoform transcripts with similar numbers of exons (Fig. 3F, p < 2.2e-16). However, we observe no significant difference in the buildup of modified sites at alternatively spliced junctions vs canonical splice sites (Fig. S8, see Methods). These data suggest that the increase in modification frequency at genes with more isoforms is likely due to the presence of more exon junctions. Thus, RCMs appear to be positively associated with splicing in plants.
RCMs are positively associated with stable and translating mRNAs
Because Mergner et al [48] measured RNA and protein abundances from matched Arabidopsis samples they were able to correlate RNA:protein abundances across their samples. Therefore, we examined whether mRNAs marked with RCMs displayed a higher or lower RNA:protein correlation. A difference in RNA:protein correlation could suggest a RCM function in RNA stability and/or translation efficiency. mRNAs that are not marked with RCMs across the Mergner et al atlas showed the lowest median RNA:protein correlation (n = 3,361, r = .68), while RCM marked mRNAs showed a significantly higher RNA:protein correlation (n = 332, r > = .758, p < .05, Fig. S9A). Due to this positive correlation between transcripts harboring base-pair disrupting modifications and their translation products, it is possible that these RCMs are positively influencing translation, either by reducing structural complexity or by stabilizing these transcripts.
To test for a relationship between RCMs and mRNA decay, we first examined publicly available transcriptomic data from Arabidopsis lines deficient in cytoplasmic mRNA decay pathways [61] by Sorenson and co-authors. Cytoplasmic mRNAs usually decay through three pathways: decapping (5’ -> 3’), the RNA exosome (3’ -> 5’), or an exosome independent 3’ -> 5’ decay pathway. Decapping occurs through a multi-subunit complex that is scaffolded by VARICOSE in plants and metazoans (VCS,[62]). Meanwhile, the exosome independent 3’ -> 5’ decay pathways occurs through SUPPRESSOR OF VARICOSE (SOV), which contributes to the decay of decapped RNAs and is also known as DIS3L2 in other Eukaryotes [63–65]. Sorenson et al took an RNA-seq approach to examine mRNA decay dynamics in four Arabidopsis genotypes that vary in their cytoplasmic mRNA decay factors (wild-type, sov knockout, vcs knockout, and sov/vcs double knockout). If RCMs are primarily marking mRNAs for degradation (as initially suggested by Vandivier et al [14]), then a buildup should be observed after transcriptional arrest in genetic backgrounds deficient for mRNA decay.
Despite the well-known relationship between RCMs and housekeeping RNAs (snoRNAs and tRNAs), the majority of modified transcripts at the beginning and end of the time series were mRNAs (Fig. 4A). As expected from arresting transcription, each genotype, with the exception of the sov/vcs double mutant, displayed a ~ 25–50% decrease in the total number of observable protein-coding transcripts eight hours post-treatment (see Methods; Fig. S9B). The number of observed RCMs increased in all genotypes after arresting transcription, as did the proportion of modified mRNAs (Fig. 4B and Fig. S9C), suggesting two possibilities: 1) that modification abundance increases with transcript age, or 2) that non-modified transcripts are degraded more quickly leading to a higher proportion of transcripts detectable over time containing RCMs.
To more directly test the association of RCMs with RNA degradation, we analyzed whether the pool of transcripts that are modified at time point 0 were still detectable and modified at subsequent time points. Surprisingly, mRNAs modified at time point 0 were nearly all detectable 8 hours after transcription arrest while mRNAs that were not modified at time point 0 declined by more than 40% (Fig. 4C). These results indicate that RCMs mark mRNAs which degrade slower than the entire mRNA pool. Indeed, mRNAs that are not targeted by RCMs show a significantly larger magnitude of TPM decrease relative to mRNAs marked with RCMs at time point 0 (Fig. 4D). These results strongly suggest that RCM marked mRNAs degrade at a slower rate relative to mRNAs that are not marked by RCMs.
Sorenson et al also modeled the initial decay rate of all mRNAs (alpha decay rate) based on read abundance after transcription arrest. Based on these values, mRNAs that are modified at time point 0 in wild-type show a significantly lower (slower) alpha decay rate relative to all other mRNAs (Fig. 4E). Sorenson et al used the decay rates across genotypes to assign mRNAs to genotype-dependent RNA degradation pathways (Fig. 4F x-axis groups 1–14, see Fig. 1C from Sorenson et al). For mRNAs that could not be assigned to the VARICOSE or SOV degradation pathways, Sorenson et al hypothesized that they were likely targets of the RNA exosome (x-axis group 15; Fig. 4F). Interestingly, modified mRNAs are significantly biased towards being assigned to this group and thus are likely targets of the RNA exosome (Fig. 4F p < 2.2e-16, Pearson's Chi-squared test). Thus, based on these data, RCMs appear to be marks of Pol-II transcript stability, rather than marks of degradation.
RCMs are not associated with Nonsense-Mediated mRNA Decay
To further investigate whether RCM marked mRNAs have an association with RNA degradation pathways, we next turned our focus to the Nonsense Mediated mRNA Decay (NMD) pathway. NMD is responsible for degrading aberrant mRNAs that have a premature termination codon. This is often recognized as a termination codon upstream of the exon junction complex (EJC, [66]). Given the accumulation of RCMs at exon-exon junctions and the recent report of RNA degradation intermediates accumulating near exon-exon junctions [67], we tested whether modified mRNAs were likely targets of NMD. SUPPRESSOR OF MORPHOLOGICAL DEFECTS ON GENITALIA7 (SMG7) is a critical component of early NMD signaling in most Eukaryotes [68]. Gloggnitzer and co-authors [69] performed RNA-seq in Arabidopsis with a loss-of-function smg7 mutant in a genetic background avoiding the strong autoimmune response of knocking out NMD (pad4; [69]). We re-processed the RNA-seq data generated by Gloggnitzer et al. and identified RCMs from their data. The up-regulated mRNAs in the smg7 genotype represent both direct and indirect targets of NMD silencing. There is strong statistical evidence that NMD targets are actually under-represented (that is, depleted) from mRNAs containing RCMs (5,275 mRNAs containing RCMs, 656 up-regulated mRNAs in smg7, overlap = 103, p < 2.2e-16, Fig. S10A). Furthermore, we identify no significant differences RCM distribution at exon-exon junctions between pad4 and smg7-pad4 RCMs (Fig. S10B, p = .986), or between smg7 RCMs in mRNAs up-regulated in smg7 vs those not differentially abundant in smg7 (Fig. S10C, p = .144). In agreement with the smg7 results which examined a single tissue, there is no significant overlap between predicted NMD targets and the RCMs identified in the Mergner et al. tissue atlas (Fig. S10D, p = .304). Collectively, these results suggest RCMs are not associated with the NMD pathway.