Generating and sequencing EM-seq and WGBS libraries
To systematically compare EM-seq and WGBS, we prepared Illumina sequencing libraries from different amounts of genomic DNA extracted from Arabidopsis flowers - 25ng, 50ng, 150ng, or 400ng (to encompass the range of starting material amounts that are typically used in an Arabidopsis methylome study). This was followed by either enzymatic (TET2 followed by APOBEC3A) or bisulfite treatment, and PCR amplification of the converted products (for 6, 12, or 18 cycles) to complete the library preparations (Fig. S1, see also Methods). Two replicates were performed for each condition. EM-seq libraries consistently showed higher mapping rates and lower duplication rates than WGBS libraries, and the variation between EM-seq libraries was smaller than the variation between WGBS libraries (Fig. 1ab, Table S1a). In addition, within , higher mapping rates and lower duplication rates (manifested by higher effective read rate) were generally associated with moderate input amounts and lower PCR cycle numbers Table S1a. One of the causes of false methylation reporting in bisulfite sequencing is non-conversion, in which the two strands of DNA occasionally fail to fully denature during bisulfite treatment and are thus resistant to bisulfite conversion [21]; this leads to the detection of several adjacent un-converted cytosines. In Arabidopsis, we previously introduced a filter that discards sequencing reads with three or more consecutive methylated cytosines in the CHH context [8]. This non-conversion filter works well in Arabidopsis, since CHHs are rarely methylated above 10% [22, 23], so the chance of observing three consecutive methylated CHHs is below 0.1% and only a small number of real methylation signatures are discarded. However, it is not practical to use this filter in organisms that have high levels of CHH methylation. Very few EM-seq reads (1.56% to 2.01%) eligible for removal by the non-conversion filter were identified, while the filtered rates for WGBS libraries were much higher and showed greater variation (2.62% to 13.41%) (Fig. 1c, Table S1b). This suggests that the EM-seq method is generally free of the notorious non-conversion problem that is frequently observed in WGBS, and is therefore useful for methylation detection in organisms in which use of an arbitrary non-conversion filter is not suitable. Consistent with this, we found that virtually no methylated cytosines were detected from the unmethylated Arabidopsis chloroplast genomes in any EM-seq library, a result that is only achieved or approached by the best WGBS library (400ng input with 12 cycles of PCR) (Fig. S2, Table S1c). This indicates that EM-seq has much lower background noise levels than WGBS.
As expected from these comparisons, EM-seq also displays advantages in terms of genomic coverage (Fig. 1d). Perhaps more importantly, EM-seq is able to cover 22.07%, 22.10%, and 23.47% more CG, CHG, and CHH sites, respectively, than WGBS (Fig. 2a); these numbers are consistent with previous findings of EM-seq in human cells [20] and the manufacturer’s description of the product [24]. All EM-seq libraries exhibit similar cytosine coverage (CG: ranging from 5,556,957 to 5,602,669, CHG: ranging from 6,090,541 to 6,128,646, CHH: ranging from 31,123,001 to 31,315,262), while different preparation conditions clearly affect the performance of WGBS libraries (CG: ranging from 3,922,759 to 5,165,506, CHG: ranging from 4,26,9441 to 5,664,610, CHH: ranging from 22,080,267 to 28,678,168) (Fig. 2a).
Next, we examined the dependence of coverage on nucleotide composition. Dinucleotide profiles suggest that EM-seq has even coverage and is minimally affected by different dinucleotide combinations. In contrast, WGBS libraries show enrichment for dinucleotides containing Gs and depletion for dinucleotides contains Cs (Fig. 2b), consistent with the damaging effect of sodium bisulfite on unmethylated cytosines [13]. These biases are more pronounced in libraries with 18 cycles of PCR (Fig. 2b), indicating that PCR amplification during WGBS library preparation favors unconverted DNA over converted DNA, due to the low melting temperature and thermostability of AT pairs [25, 26]. This can further negatively affect accurate estimation of methylation level (see below). Similar biases were not observed in EM-seq libraries, although the product of TET2/APOBEC3A conversion is the same as that of bisulfite conversion, which suggests that the polymerase used to amplify EM-seq libraries is superior to the one used for WGBS, in terms of avoiding the introduction of base biases. As previously shown for a library preparation kit similar to the one used in our study [27], WGBS libraries show enrichment for dinucleotides containing only A or T (Fig. 2b). We then decided to look directly at the dependence of coverage on GC content. Again, EM-seq libraries show more even profiles over the majority of the GC content range than do WGBS libraries (Fig. 2c). WGBS libraries have clearly higher coverage in AT-rich regions than in GC-rich regions (Fig. 2c), a known issue for bisulfite sequencing [20, 27] that is discussed above (Fig. 2b).
Overall, the quality metrics of our EM-seq libraries encouraged us to further explore whether it measures methylation more accurately than WGBS in Arabidopsis.
Arabidopsis DNA methylation levels measured by EM-seq and WGBS
We compared levels of CG, CHG, and CHH methylation measured by EM-seq and by WGBS and noted that EM-seq measured DNA methylation levels are lower (Fig. 3a-c, Table S2), even if background noise is considered (see Fig. S2). This is consistent with the previous results obtained by the manufacturer for Arabidopsis [24]. Total DNA methylation levels estimated from EM-seq data (Fig. 3d) are close to the levels previously detected by liquid chromatography-mass spectrometry (LCMS) in Arabidopsis [24]. We observed that increasing the number of PCR cycles led to higher methylation levels in the respective WGBS libraries (especially for libraries with 18 cycles of PCR; Fig. 3a-d, Table S2) and reasoned that this likely reflects the above-mentioned preference of PCR amplification during WGBS library preparation for unconverted DNA (see Fig. 2b). To further test this hypothesis, we analyzed the correlation of methylation level with density of methylated cytosines in both EM-seq and WGBS. For this analysis we picked the best WGBS library (400ng input with 12 cycles of PCR) (see Fig. S2, Table S1c) and its EM-seq counterpart. As Fig. 3ef reveals, for CHG and CHH methylations, the differences in methylation levels between EM-seq and WGBS increases with cytosine methylation density, which is the expected result based on our hypothesis. Much less difference between EM-seq and WGBS was observed for CG methylation, possibly because CG methylation is more or less bimodal (either completely unmethylated or completely methylated) [8], and thus the PCR bias in WGBS library preparation toward methylated CG sites would have less influence on CG methylation percentages. Nonetheless, CG methylation levels still appear to be moderately over-estimated by WGBS due to the selective damage of DNA containing unmethylated cytosines by sodium bisulfite [13] (see also Fig. 2bc).
We next plotted both EM-seq and WGBS data across all five Arabidopsis chromosomes (Fig. 4a). In general, methylation levels reported by WGBS are higher than those from EM-seq, especially in the case of CHH methylation, where some WGBS libraries made under suboptimal conditions (e.g. higher number of PCR cycles) suffer from severe over-estimation of methylation in the euchromatic arms of the chromosomes. Interestingly, Fig. 4a also reveals that methylation levels measured by EM-seq are higher in pericentromeres than those measured by WGBS. In the next section, we explore this further using differentially methylated region (DMR) analysis.
Arabidopsis has been shown to have two distinctive DNA methylation patterns: CG methylation in the body of protein-coding genes and all three types of methylation (CG, CHG, and CHH) in repetitive DNAs such as transposable elements (TEs) [8, 9, 28-30]. In the past, metaplot analysis by WGBS has always reported some residual non-CG methylation inside gene bodies that were indistinguishable from noise [8, 9] due to the background non-conversion issues associated with bisulfite conversion (Fig. 1c, Fig. S2). Since we now know that EM-seq has much lower background than WGBS, we ran the same metaplot analysis with EM-seq data and observed much reduced CHG and CHH methylation levels across gene bodies (Fig. 4b). The observed levels were still an order of magnitude higher than the pure background noise that can be inferred from chloroplast methylation, especially in the case of CHG (1.16% CHG methylation on average over gene body compared to 0.22% background) (Fig. S2, Table S1c), consistent with the known presence of low levels of non-CG methylation in gene bodies [31]. In terms of methylation in TEs, EM-seq produces similar metaplot profiles as WGBS, albeit with lower levels (especially for CHG and CHH (Fig. 4c), which is expected since WGBS tends to overestimate CHG and CHH methylations (Fig. 3, Table S2)).
We also compared the methylation patterns and levels in the chromosomal plots and metaplots containing only the best WGBS library (400ng input with 12 cycles of PCR) and its corresponding EM-seq library (Fig. S3). As expected, the methylation differences between this pair of libraries were smaller than the differences when other WGBS libraries are included in the comparison. Nevertheless, the basic patterns are the same as described above (see Fig. 4).
Since CHG and CHH methylations are maintained by RNA-directed DNA methylation (RdDM) [29], we looked at methylation in genomic regions bound by Polymerase V (PolV), which are often used as a proxy for RdDM target loci [32, 33]. CHG and CHH methylations over the PolV ChIP-seq peaks were elevated to various extents in different WGBS libraries, while all the EM-seq libraries show similar levels (Fig. 4d).
As an example of a gene with a well-studied methylation pattern, we looked at methylation in the FLOWERINGWAGENINGEN (FWA) locus, a target of RdDM [34-36]. While methylated cytosines in non-CG contexts were detected at low levels throughout the FWA gene in almost all of the WGBS datasets, EM-seq data clearly shows that non-CG methylation only exists in the promoter/beginning of coding sequence (CDS) region of FWA (Fig. S4a), where the known patch of RdDM is known to occur. Even when using only data from the best WGBS library (400ng input with 12 cycles of PCR), we still see trace amounts of non-CG methylation downstream of the promoter/beginning of CDS region of FWA; the same places show no non-CG methylation in EM-seq data (Fig. S4b).
Differentially methylated region analyses in EM-seq and WGBS
We performed pairwise differential methylation region (DMR) analysis both within the various datasets of EM-seq or WGBS and across EM-seq and WGBS datasets. Orders of magnitude fewer DMRs are called within the EM-seq libraries than within WGBS libraries (Fig. S5ab). A larger number of DMRs arose in comparisons between the EM-seq libraries made from the least input DNA amount and the most input DNA amount and between the EM-seq libraries made with the lowest number of PCR cycles and the highest number of PCR cycles (Fig. S5a). The same trends were observed in comparisons between WGBS libraries, although the WGBS comparisons produce much larger numbers, especially in the case of CHG and CHH methylation (Fig. S5b).
When comparing called DMRs between EM-seq libraries and WGBS libraries made from the same amount of input DNA and with the same number of PCR cycles, we noticed that there are many more WGBS hyper-DMRs (higher methylation in WGBS libraries) than EM-seq hyper-DMRs (Fig. 5a-c). WGBS libraries with 18 cycles of PCR were clearly the outliers, since they tended to have more WGBS hyper-DMRs than other conditions, and the situation is made worse by combining 18 cycles of PCR with 400ng of input DNA (Fig. 5a-c). Therefore, when making WGBS libraries, excess PCR amplification should be avoided, especially if starting with plenty of DNA. There are 7.94 (4123/519) times as many WGBS hyper-CG DMRs as EM-seq hyper-CG DMRs, while the numbers for CHG and CHH are 405.99 (110834/273) and 802.81 (660713/823) times respectively, suggesting that WGBS has more enrichment of CHG and CHH hyper-DMRs than CG hyper-DMRs (Fig. 5d, Table S3). This fits with our previous finding that CHG and CHH methylation are more over-estimated by WGBS than is CG methylation (Fig. 3, 4). For EM-seq hyper-DMRs, we saw some variation in DMR numbers among different library preparation conditions, for example more hyper-DMRs were seen between libraries with 12 cycles of PCR (Fig. 5a-c), however we suspect that this is rather due to the variation in WGBS libraries than the difference between WGBS and EM-seq libraries (since WGBS libraries have much higher variation among themselves than do EM-seq libraries (Fig. S5ab)).
We next plotted the methylation levels in the defined EM-seq and WGBS hyper-DMRs (Fig. 5e). Interestingly, EM-seq hyper-CG and CHH DMRs on average have higher methylation levels than WGBS hyper-CG and CHH DMRs, respectively. The methylation level in EM-seq hyper-CHG DMRs is lower than that in WGBS hyper-CHG DMRs, which is the opposite of what is observed in EM-seq hyper-CG and CHH DMRs (Fig. 5e). There are very few EM-seq hyper-CHG DMRs, and many of them are obtained from comparison of EM-seq and WGBS in two conditions (400ng input, 6 and 12 cycles) (Fig. 5b), which could skew the result. Furthermore, most of the WGBS hyper-CHG DMRs are in pericentromeric heterochromatin regions (Fig. S5c) with high levels of methylation (see Fig. 4a). GC content analyses in DMRs indicate that EM-seq hyper-DMRs have significantly lower GC content than WGBS hyper-DMRs (Fig. 5f). One extreme case is the mitochondria chromosome (chrM), which has a ~10% higher GC content than the five nuclear chromosomes (Fig. S6a). The relative difference in methylation levels between EM-seq and WGBS in chrM is larger than that in other chromosomes (Fig. S6b), and in fact the majority of the chrM is called as WBGS hyper-DMRs (Table S4). The methylation levels in chrM are generally very low as determined by EM-seq (Fig. S6b), which fits with the previous observation that WGBS hyper-DMRs tend to have lower methylation levels (Fig. 5e). We reasoned that in WGBS hyper-DMRs in nuclear chromosomes there are likely many sites that should have no or very low methylation (like chrM); this would make the average methylation levels in WGBS hyper-DMRs low, except for WGBS hyper-CHG DMRs for above-mentioned reasons (see Fig. 5e, Fig. S5c).
Since GC content also greatly affects coverage (Fig. 2c), we wondered if over- (mainly CHG and CHH) and under-estimating (mainly CG) methylation by WGBS could be linked to coverage. First, we generated heatmaps and coverage plots of all the libraries across PolV ChIP-seq peaks, because these are the places with large increases in CHG and CHH methylation in WGBS libraries (Fig. 4d). We found that for EM-seq libraries, although coverage fluctuates, the ranges are typically quite small. On the other hand, there is a significant increase in coverage coinciding with the center of PolV ChIP-seq peaks for all WGBS libraries (Fig. S7). A reasonable explanation for this is that PolV binding sites are targets of RdDM and contain methylated CHGs and CHHs that are not converted by bisulfite treatment; they therefore become better templates for PCR amplification (see data from previous sections) and gain higher coverage than their surrounding regions. Therefore, methylation levels and coverage are positively correlated in this case. The majority of the EM-seq hyper-CG DMRs are located within pericentromere heterochromatins and are highly methylated (Fig. 4a, Fig. 5e, Fig S8a), but occasionally they can be found in euchromatin regions and within genes (Fig. S8b). We chose the best WGBS library (400ng input with 12 cycles of PCR) and its corresponding EM-seq library and analyzed their coverage across EM-seq hyper-DMRs (Fig. S8c-e). Interestingly, WGBS coverage spikes in EM-seq hyper-DMRs as well, and EM-seq coverage again shows only small fluctuations. The low GC content of EM-seq DMRs (Fig. 5f) could be the cause of high coverage in WGBS libraries (see Fig. 2c). Moreover, according to Fig. 2c, in regions where GC content is low (~30% or less, approximately the range of GC content found in EM-seq hyper-DMRS, see Fig. 5f), a further reduction in GC content (e.g. that caused by bisulfite treatment) will induce a sharp increase in WGBS coverage. This effect likely outweighs the PCR preference for unconverted DNA and causes WGBS to under-estimate methylation levels in these regions.
Methylation differences between Arabidopsis leaves and flowers detected by EM-seq
We applied the EM-seq method to analyze methylation differences between Arabidopsis leaves and flowers. We used 150 ng input DNA and 6 cycles of PCR, and generated 4 replicates for each tissue. Overall, CG and CHG methylation levels were slightly higher in flowers than in leaves, and CHH methylation was about the same in both tissues (Fig. 6a). Metaplots in genes reveal that there are very small differences between leaves and flowers in gene body methylation levels – CG plots are almost identical and CHG and CHH are close to noise levels in both tissues (Fig. 6b). For TEs, CG is very slightly increased and CHH is very slightly decreased in flowers compared to leaves (Fig. 6c), and the same trends can be observed in pericentromeric heterochromatins in chromosomal methylation plots (Fig. 6d). The most striking finding from these analyses is that CHG methylation is significantly higher in flowers than in leaves (Fig. 6cd). Consistent with this, a much larger number of flower hyper-DMRs are called in the CHG context than in CG or CHH (Fig. 7a, Table S5). Flower hyper-CHG DMRs are located mainly in pericentromeric heterochromatins and not inside or close to genes (Fig. 6d, Fig. 7bc). DMRs for non-CG methylations are not enriched within genes, as non-CG methylations are usually not found in the genes (Fig. 6b, Fig 7c).
CHG methylation is mainly mediated by CHROMOMETHYLASE2 (CMT2) and CMT3 via a self-reinforcing loop involving histone H3K9 methylation [22, 37, 38]. Indeed, when examining the transposons in heterochromatin regions (that have flower hyper-CHG DMRs) we observed a higher level of H3K9me2 by ChIP-seq in flowers than in leaves, whereas there was very little change in the H3K9me2 levels of euchromatic TEs that did not show flower hyper-CHG DMRs (Fig. 7d). We note that the absolute level of H3K9me2 is much higher in heterochromatin regions than in chromosome arms (Fig. 7d), which is a characteristic of the Arabidopsis epigenome [39]. These results are consistent with a more active CHG and H3K9me2 self-reinforcing loop in flowers affecting heterochromatic TEs.
Although the overall differences in CG methylation within gene bodies are minimal between leaves and flowers (Fig. 6b), hyper-CG DMRs from both flowers and leaves were enriched in gene exons and introns (Fig. 7c; see an example in Fig. S9a). Indeed, when we plot CG methylation on genes with hyper-CG DMRs in either leaves (Fig. S9b, left panel) or flowers (Fig. S9b, right panel), we observe clear differences. Previous studies have shown that gene body-methylated genes in plants are often house-keeping genes, constitutively expressed, and exhibit moderately high expression levels [28, 40, 41]. We found that majority of the genes with hyper-CG methylations in either leaves or flowers are differentially expressed between leaves and flowers (Fig. S9c). However, this does not seem to be specific, as we obtained similar results when looking at randomly selected genes (data not shown). This and the fact that both upregulated and downregulated genes in both tissues can show increased body CG methylation (Fig. S9c) suggests that gene body methylation does not directly regulate the expression of these genes.