Principle of SCA-seq
Recently, there has been an increasing interest in applying methylation labeling and nanopore sequencing for the analysis of chromatin accessibility at the single-molecule level5–10. In this study, we have developed SCA-seq to study chromatin spatial density by updating the 2D chromatin accessibility map to the three-dimensional space. (Fig. <link rid="fig1">1</link>a-1) After cell fixation, we used a methyltransferase enzyme (EcoGII or M.CviPI) to mark accessible chromatin regions artificially. (2–3) After the chromatin accessibility information was preserved as GpC methylation marks, we conducted digestion and ligation steps using chromatin conformation capturing protocols, relying on proximity ligation to suture together multiple linearly distant DNA fragments that happen to be close to each other in three-dimensional space. (4) Then, we performed the designed DNA extraction protocol to get the pure and large DNA fragments. (5) The DNA fragments that carried chromatin accessibility, methylation marks, and three-dimensional conformation information were sequenced using the nanopore method and analyzed in our house pipeline (Fig. 1a). The conventional NGS-based chromatin conformation protocols only captured the interaction between two genomic loci (Fig. 1a − 4 blue square). Unlike the conventional protocol, the proximity ligation in SCA-seq, not limited to the first-order ligation, can occur multiple times in one concatemer (genome fragments fixed together as a cluster), informing about the high-cardinality genome conformation (Fig. 1a,b). Compared with the competing techniques Trac-loop11, Hi-CAR13 and NicE-C15, which also captured the accessible chromatin conformation, the SCA-seq could reserve more multi-omics information, for example, CpG methylation epigenetic marks, chromatin inaccessibility, and high-order chromatin conformation (Fig. 1b).
First, we experimentally determined the feasibility of SCA-seq. In the methylation reactions, the most suitable methyltransferases, EcoGII and M.CviPI, generated the artificial modifications m6A and m5C(GpC), which are rarely present in the native mammalian genomes16 17. Our previous research showed that EcoGII effectively labels accessible chromatin owing to the high density of adenine in the genome9. However, by using EcoGII the labeled high-density m6A modification either blocked or impaired the activity of the restriction enzymes (Sfig 1a, b). To solve this problem, we selected the m6A-dependent restriction enzyme DpnI that preferentially digests highly methylated DNA containing methylated adenine and leaves blunt ends. However, the m6A-dependent digestion generated biased digestion of the highly methylated accessible chromatin, and the blunt ends were not ligated efficiently. We then tried another approach and used M.CviPI that methylates GpCs (m5C) on the accessible chromatin, and these marks occur four times less frequently than adenosine. In the following steps, DpnII and other enzymes (without GC pattern in the recognition sites) efficiently cut both GpC methylated and unmethylated DNAs (Sfig 1c,d,e). It should be noted that the m5C base-calling algorithm has been gradually improved and is now widely used in nanopore sequencing18. Considering the unbiased digestion, M.CviPI might be a better choice in SCA-seq than EcoGII/DpnI. Next, we analyzed the sequencing data and compared them with those obtained using previous technologies.
Sca-seq Accurately Identifies Accessible Chromatin And Methylation Marks At Single-molecule Resolution In Two-dimensional Space
Our work was based on the concepts of nanoNOME-seq, SMAC-seq, and Fiber-seq5–10, which use either M.CviPI or EcoGII methyltransferases to label chromatin-accessible regions with methylation sites. Our previous experiments9 and validations of the results against published data confirmed the effectiveness of the methyltransferase-mediated labeling, showing technological advantages of the complex genome alignment and single-molecule resolution. As the SCA-seq generated the discontinuous genomic segments by ligating fragments (Fig. 1), which might affect the data processing, we validated the accuracy of the methylation calling and methyltransferase labeling.
First, in the validation of the methylation calling, we performed initial quality control of the sequencing data for the HEK293 cell line. We generated 129.94 Gb (36.9× coverage) of mapped sequencing data with an N50 read length of 4,446 bp. To get the methylation information from the nanopore data, we adopted the modified Nanopolish approach7 as a methylation caller with considerable success (AUC CpG = 0.908, GpC = 0.984), as others had successfully used it for GpC/CpG calling. In the further validation of the methylation calling, we parallelly performed the gold standard whole-genome bisulfite sequencing, which also correlated well with the Nanopolish results (Sfig 2), supporting the accuracy of the methylation caller. In addition to the methylation calling accuracy, there might also be some ambiguity existing between the native cytidine methylations and artificially labeled cytidine methylations. We first checked the native or false-positive GpC regions in the unlabeled genome, which were also very rare and only accounted for 1.8% (Sfig 7a,b). Secondly, the GCG pattern in the genome might cause the ambiguity of native methylation CpG or accessibility representing GpC, so we also excluded both CpG and GpC methylations from the GCG context (5.6% of GpCs and 24.2% of CpGs) to get the unbiased methylation information. In conclusion, our bioinformatic pipeline was feasible for detecting the native CpG methylations and artificially labeled GpC methylations.
We next assessed the potential of SCA-seq to reveal simultaneously endogenous methylation and chromatin accessibility. As the ATAC-seq and DNase-seq were gold standards for detecting chromatin accessibility, we compared the SCA-seq labeling accuracy with ATAC-seq/DNase-seq globally and locally. Of the 55429 accessibility peaks called from SCA-seq data on the whole genome, 74.6% overlapped with those observed in ATAC-seq and DNase-seq (Fig. 2a). The signal correlation in common peaks was approximately 0.5. The result is close to the previous publication 7. The overlapped regions showed a higher peak calling confidence than the non-overlapped regions (Fig. 2b). In the local comparison, SCA-seq also showed peak patterns around the ATAC-seq-identified peaks (Fig. 2c). Moreover, we used computationally predicted binding sites of the CCCTC-binding factor (CTCF), which are a well-documented open chromatin indicator19. As expected, CpG methylation level decreased, and GpC accessibility increased around the CTCF-binding sites (Fig. 2d). At active human transcription start sites with high expression (TSSs), “open” chromatin regions hypersensitive to transposon attack were observed in ATAC-seq/DNase-seq. SCA-seq showed similar nucleosome depletion patterns around TSSs (Fig. 2e). Inactive TSSs were less accessible than active TSSs (Fig. 2e). In the detailed examination of the genome region, the SCA-seq showed the expected nucleosome pattern around the various epigenetic marks, for example H3K4me3(active), H3K27ac(active) (Fig. 2f and Sfig 3). To further explore the labeling efficiency of the accessible regions, we also compared the signal enrichment fold around TSS among different methods (Sfig 4a). All these methods and the competitor Hi-CAR showed the expected nucleosome depletion pattern around TSSs. SCA-seq had a lower enrichment fold than the other two methods because the other methods enriched the accessible chromatin by PCR with the loss of the inaccessible chromatin. We then explored the influence of various factors on the labeling results, for example dose and sequence depth. The relationship between the dose and M.CviPI treatment effect demonstrated superior efficiency of the 3h treatment, comparing with 15min, 30min treatment (Sfig 4b,c). The sequence depth 8x is the minimal requirement to resolve the chromatin accessibility in SCA-seq (Sfig 4d,e). Overall, SCA-seq reliably estimated chromatin accessibility at the genome level.
Sca-seq Reveals High-order Chromatin Organization
The SCA-seq also reserved the genome spatial structure besides the methylation information; therefore, we validated the genome spatial structure in this section. First, we analyzed the basic statics, for example, contact distance and cardinality of SCA-seq. As the SCA-seq ligated the multiple fragments together, revealing the multiplex-nature chromatin conformation (Fig. 1a), we processed non-singleton chimeric reads into genomic segments and assembled in silico paired-end tags (PETs) in order to compare with Hi-C carrying paired loci. The segment median length was approximately 700 bp (Sfig 5a). Among the informative intra-chromosome PETs, 0.1% of the PETs (contact distance) were < 150 bp; 0.3% of them ranged from 150 to 1,000 bp; 24.5% were 1,000–200,000 bp; and 75.1% were > 200,000 bp (Sfig 6b). Unlike Hi-C, SCA-seq, derived from pore-C, revealed the multiplex nature of chromatin interactions. As for the intra-chromosome interactions, 14.7% of the reads contained two segments (cardinality = 2); approximately 14.5% of the reads contained 3–5 segments (cardinality = 3 ~ 5); and 5.4% of the reads had more than five segments (cardinality > 5) (Sfig 6a). As expected, most of the contacts from the reads with fewer segments appeared to have closer contact distance. The contacts from the reads with more segments appeared to have more distal interaction (Sfig 6c,d). Compared with the competitor method Trac-loop, HiCAR, and other methods, the SCA-seq could also resolve more high-cardinality chromatin conformation and distal interactions (Sfig 6e,f,g and Fig. 3b).
We then compared the SCA-seq with the gold standard Hi-C in the false positive rate, reproducibility, and ability to figure out the genome spatial structure. False positivity rates of SCA-seq and HiC, inferred from hybrid PETs that consisted of mitochondrial DNA and genomic DNA, were similar (Sfig 5b). The compartment score correlation between SCA-seq replicates and pore-C replicates was approximately 0.94 (Sfig 5d). Thirty million reads were enough to resolve the A/B compartment and topologically associating domain (TAD) structures (Sfig 5c). In further analysis of the genome structure, SCA-seq with our improved algorithm revealed genome organization similar to the one detected using in situ Hi-C. Side-by-side visualization of interaction heatmaps, loops, TAD boundaries, and A/B compartments obtained using SCA-seq and Hi-C showed equivalent genome organizations (Fig. 3a,c,d,e,f,g,h). The correlation of the eigenvector and insulation scores were 0.91 and 0.84. We had an interesting finding here that sixty-six percent of the concatemers were compartment-specific (all the fragments in one concatemer belonged to A/B compartments), and 34% were non-specific. Overall, these results suggested that SCA-seq successfully resolved the multiplex nature of chromatin interactions.
We also could observe the chromatin conformation with the specific binding pattern from the SCA-seq, for example, the CTCF binding pattern. As previous publications mentioned, the CTCF occupation could lead to the short inaccessible region (~ 50bp) with the methyltransferase labeling 20,21, helping us to determine the CTCF binding status on the CTCF motif loci. As expected, the SCA-seq also could resolve the transcription factor-specific footprint and nucleosome footprint similar to the previous publication (Fig. 3i). Based on specific accessibility patterns, we classified the chromatin interaction concatemers containing CTCF motifs into two classes, the concatemer with a CTCF footprint and without a CTCF footprint (Fig. 3j). Considering the relationship between CTCF binding and chromatin structure formation 22, we plotted the concatemer cardinality and interaction distance (Fig. 3k,l). We found that the CTCF binding resulted in higher cardinality and further interactions than the non-CTCF binding, suggesting that CTCF binding help form the more complex structure. As a recent publication 21, the methyltransferase accessibility pattern also could indicate other transcription factors’ footprints, enlightening the further exploration of chromatin conformation with other transcription factors by SCA-seq. Therefore, SCA-seq could help us to subgroup the chromatin interaction concatemers and study the effects between the chromatin conformation and protein binding.
Sca-seq Reconstructs Chromatin Accessibility In Three-dimensional Space
Given the high cellular heterogeneity in the genome space, our spatial chromatin status analysis mainly relied on the single-molecule pattern, which needs high sensitivity and specificity. Single-molecule base modification calling was performed as described previously7. Moreover, we also determined the enzyme labeling efficiency, which was 79–88%, based on the CTCF motifs and spike-in control measurements (Sfig 7a,b,c). Then, we filtered the fragments using the binomial test further to minimize the false positive accessible/inaccessible status (see Methods). Beyond our expectation, the accessible and inaccessible DNAs were ligated together in SCA-seq (Fig. 4a), suggesting the high heterogeneity of the spatial neighboring DNAs. As our observation, the overall genome concatemer calculations showed that 29% of the genome concatemers maintained inaccessibility on all enclosed segments (accessible/inaccessible segment ratio < 0.1). Furthermore, 62.2% of genome concatemers had parts of accessible segments (hybrid concatemers), and only 8.8% maintained all segments as accessible (accessible/inaccessible segment ratio > 0.9) (Fig. 4b). An example region was shown on side with 1D genome feature tracks to demonstrate the promoter-enhancer spatial interactions, accessibility and CpG methylation at single molecule resolution (Fig. 4a). nanoNOME-seq that labels chromatin accessibility in single molecules also confirmed the existence of hybrid concatemers (Sfig 7d). To explore if the concatemer accessibility status was related to spatial location, we plotted the inaccessible concatemers and hybrid concatemers on the 2D contact heat map. We found that the hybrid concatemers tended to gather around the TAD boundary and contain more distant connections (Fig. 4b-heatmap). By plotting the accessible/inaccessible ratio with concatemer cardinality or interaction distance, the hybrid concatemers had more fragments than the inaccessible concatemers, also implying their distal and high-cardinality interaction preference (Fig. 4c and Sfig 8e). The A/B compartments are usually related to chromatin accessibility and regions of gene expression 1. In our study, we found that the B compartment (negative eigenvector) had significantly more inaccessible concatemers (accessible/inaccessible segment ratio 0.37 vs. 0.4, P < 2.2 × 10− 16) than the A compartment (positive eigenvector) (Fig. 4d and Sfig 8a,b). Due to the linkage between chromatin accessibility and transcription factor binding23, we further investigated the enhancer and promoter contacts on chromosome 7, 30.3% of which had accessible–accessible status; 18.5%, accessible–inaccessible status; and 51.2%, inaccessible-inaccessible status (Fig. 4e). The frequency of contacts with accessible enhancer/promoter highly correlates with gene expression levels (Fig. 4f,g and Sfig 8c,d), supporting the transcription model in which active enhancer initiated promoters by contaction24. However, it is worth noting that 51.2% of enhancer-promoter interactions were independent of the chromatin-accessible status, suggesting that enhancer-promoter spatial interaction was not the only factor to initiate the active transcription. Due to the close relationship between spatial accessibility and transcription activity, we developed an algorithm to calculate the activation power of the genome loci (Sfig 12), revealing their potential to activate transcription. The locus with higher activation power might enhance the genome accessibility of more contacting loci. Overall, spatial contacts and contact accessibility might coordinately regulate gene expression.
Spatial chromatin accessibility is dramatically changed in CTCF +/− HEK293T cells
In mammals, the highly conserved zinc finger protein CTCF is thought to serve as an insulator protein that prevents communication between enhancers and promoters and thereby, regulates chromosome folding25 26 27. We suspected that CTCF might also insulate the accessibility interference in the spatial contacts. We examined the impact of permanent CTCF reduction by CTCF allele-specific knock out (CTCF +/−) in HEK293T cells and observed the loop losses and TAD boundary changes, which is similar to previous publications 22,28 (Sfig 9). We and others also found that the CTCF reduction changed genome accessibility, with 3457 upregulated and 932 downregulated accessibility regions (Sfig 10a,b)29. Among the changed regions, only 2% overlapped with the CTCF motifs (CTCF regions) and 98% did not contain CTCF motifs (non-CTCF regions). For example, the chromatin accessibilities around CTCF motifs (CTCF regions) and TSSs (non-CTCF regions) were increased (Sfig 10c,d). We further studied whether non-CTCF regions' chromatin accessibility was altered by the spatial contacts with CTCF regions. Overall, the number of accessible concatemers increased substantially, and most of them had more than one accessible chromatin segment (Fig. 5a). In the contact maps with accessibility plots, the chromatin spatial accessibility had a line pattern in wild-type cells (Fig. 5b). The accessible-accessible/accessible-inaccessible (red/yellow dots) pairwise contacts were arranged in lines, indicating that the loci contacted specific accessible chromatin regions, such as CTCF motifs and ATAC peaks (Fig. 5b and Sfig 11). In the CTCF +/− cells, chromatin was activated globally on the pairwise contact, and the spatially activated chromatin changed the line patterns (Fig. 5b and Sfig 11). Then we further investigated which spots on the contact map were significantly changed by dividing the contact map into 100kb squares. In the static analysis of the contact map, we found that these significantly changed locations were distributed with the CTCF ChIP signal (Fig. 5c, cut-off p < 0.1 for bin, Pearson correlation p < 2.6*e− 16). Then we increased the resolution to 5kb and summarized the total counts of the accessible-accessible pairwise contacts. We found that the increased accessible contacts in CTCF +/− cells also significantly correlated with the CTCF ChIP signals (Pearson correlation, p < 2.6*e-16). By taking a closer look into the CTCF-specific contacts, we overlayed the CTCF loci together and plotted the interaction contacts with their accessibility information. The accessibilities of these CTCF-contacted loci were significantly increased in the CTCF +/− cells. All these results suggested that the CTCF loss might increase the chromatin accessibility on the loci spatially contacting with CTCF loci. These findings were similar to the recently proposed theory “activity by contact” model about CTCF 30. On the other hand, we used the CTCF siRNA know down to cross-validate our findings, producing similar results (Sfig 13).
We also found another interesting finding of the increased accessibility of long-range interaction in CTCF +/− cells. The activation folds were much larger in the long rang interaction than in the short-range interaction (P < 2.6 × 10− 16, t-test) (Sfig 10e). The sparser data of long-range interactions may bias this finding. However, this observation might provide a clue to the CTCF insulator function conundrum in that CTCF possibly prevents impropriate chromatin activation. This finding further confirmed the close relationship between CTCF loss and spatial activation. Overall, with SCA-seq, we found that CTCF loss led to contact-based chromatin activation, clarifying the spatial insulation function of CTCF, which needs to be explored further 31.
Cpg Methylation On Orphan Cpg Island
CpG islands (CGI), the widespread features of vertebrate genome, were associated with ~ 50% of gene promoters (pCGI). pCGI control the gene transcription by affecting the neighboring promoters with methylation-related chromatin properties. Some CGIs are located close to the enhancers (eCGI). In addition, the other thousands of orphan CGIs (oCGI), distal(1kb) to the promoters and enhancers, were barely unknown (Fig. 6a) 32,33. In the SCA-seq data indicating the high-order interaction and CpG methylation, we found 76418 reads overlapped with CGIs on Chr7, and the most oCGIs were usually close to the CTCF binding motifs and active histone markers, such as H3K27ac and H3K4me3, suggesting their active regulatory functions (Fig. 6b). By examining the methylation status on reads, as expected, these read segments demonstrated lower CpG methylation and higher chromatin accessibility (GpC methylation), which further supports their roles in activating the genes (Fig. 6b). In the in vitro assay of previous research, oCGIs act as tethering elements that promote topological interaction between enhancers and distally located genes to regulate gene expression. In the SCA-seq, we observed the 60% oGCI tethered at least one type of regulatory elements, such as enhancers, CTCFs, and promoters (Fig. 6c). After normalizing by the total number of regulatory elements, we found that the oCGI preferably interacted with CTCF and promoters, comparing with non-CGI events (Binomial test p < 2.6e-16 control as background frequency) (Fig. 6d). Further explored analysis of each concatemer showed that the 39% oCGI-enhancer concatemers and oCGI-CTCF concatemers included more than two enhancers or CTCF motifs. In contrast, most of oCGIs tethered to one promoter (Fig. 6e). These data supported the previous research that the oCGI tethered enhancer regulates the gene expression 32. However, we found the CpG methylation on oCGI weakly correlated with the promoter’s accessibility and CpG methylation in regression analysis, whose mechanism of regulation need to be further studied. Overall, the oCGI could tether the enhancers and CTCF motifs to communicate with promoters, promoting a further understanding of oCGI regulatory functions.