Spatial chromatin accessibility sequencing resolves next-generation genome architecture

doi:10.21203/rs.3.rs-2314753/v1

Download PDF

Article

Spatial chromatin accessibility sequencing resolves next-generation genome architecture

https://doi.org/10.21203/rs.3.rs-2314753/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

As the genome has a three-dimensional structure in intracellular space, epigenomic information also has a complex spatial arrangement. However, the majority of epigenetic studies describe locations of methylation marks, chromatin accessibility regions, and histone modifications in the linear dimension. Proper spatial epigenomic information has rarely been obtained. In this study, we designed spatial chromatin accessibility sequencing (SCA-seq) to reveal the three-dimensional map of chromatin accessibility and simultaneously capture the genome conformation. Using SCA-seq, we simultaneously disclosed spatial regulation of chromatin accessibility (e.g. enhancer-promoter contacts), CpG island methylation and spatial insulating functions of the CCCTC-binding factor. We demonstrate that SCA-seq paves the way to explore epigenomic information in the three-dimensional space and extends our knowledge in next-generation genome architecture.

Biological sciences/Biotechnology/Genomics/Epigenomics

Biological sciences/Biotechnology/Sequencing

Pore-C

HiC

chromatin accessibility

accessible chromatin

inaccessible chromatin

interaction

methylation

m6A

GpC

CpG

The linear arrangement of DNA sequences usually gives an illusion of a one-dimensional genome. However, the DNA helix is folded hierarchically into several layers of higher-order structures that undergo complex spatial biological regulation. The link between gene transcription activity and genome structure was established following an observation that active gene expression proceeds in the decondensed euchromatin, and silenced genes are localized in the condensed heterochromatin. Accessibility of chromatin acts as a potent gene expression regulatory mechanism by preventing access of regulatory factors to condensed chromatin domains. Although this model is attractive, it is simplified in that the genome accessibility is considered only in the linear dimension¹. However, the genome has a three-dimensional structure inside cells, so the accessibility of genome regions also has similar spatial complexity. For example, promoter accessibility could be regulated by contact with enhancers or silencers. Therefore, sophisticated tools are necessary to obtain information about genome accessibility in three dimensions to resolve the relationship between chromatin activation and genome structure.

Most of the tools designed to study chromatin accessibility in the linear form are based on the vulnerability of open/decondensed chromatin to treatment with enzymes such as DNase, micrococcal nuclease (MNase), and transposase. In a pioneer study, Song and Crawford used DNase-seq to establish the relationship between DNase-hypersensitive regions and open chromatin². MNase-seq relies on a similar concept³. Subsequent studies simplified experiments on chromatin accessibility by taking advantage of the ability of mutant transposase to insert sequencing adapters into open chromatin domains⁴. All these methods rely on statistical calculations of chromatin domain accessibility based on the frequency of attacking events of the enzymes on the accessible genome. To understand the heterogeneity of chromatin accessibility in vivo, scientists have been interested in chromatin structure at a single-molecule resolution level in recent years. They developed approaches such as methyltransferase treatment followed by single-molecule long-read sequencing⁵, single-molecule adenine methylated oligonucleosome sequencing assay⁶, nanopore sequencing of nucleosome occupancy and methylome⁷, single-molecule long-read accessible chromatin mapping sequencing^{8 9}, and Fiber-seq¹⁰. Subsequently, decondensed genomes were methylated using methyltransferases and then directly sequenced using third-generation sequencing platforms (Nanopore, Pacbio). These advanced methods offered a single-molecule view of the two-dimensional 2–15 kb long chromatin structures. However, chromatin has a higher-order organization, and the linearized two-dimensional map of chromatin accessibility does not fully reflect reality. Some advanced approaches, such as Trac-looping¹¹, OCEAN-C¹², and HiCAR¹³, could capture open chromatin and Hi-C¹⁴ information simultaneously by enrichment of accessible chromatin regions through the combination of transposons and proximity ligation. However, the losses at the single-molecule resolution level and the condensed chromatin regions restrict the possibility to observe dynamic changes in chromatin structure in three-dimensional space. Therefore, reconstructing chromatin spatial accessibility could promote further understanding of the interactive regulation of transcription and enable more spatially realistic studies of the genome.

Here, we developed a novel tool, spatial chromatin accessibility sequencing (SCA-seq), based on methylation labeling and proximity ligation. The long-range fragments carrying the chromatin accessibility, CpG methylation and chromatin conformation information were sequenced using nanopore technology. We mapped chromatin accessibility and CpG methylation to genome spatial contacts, and the heterogeneous chromatin accessibility in proximal interactions suggested complex genome regulation in addition to direct contacts between genome loci. We believe that SCA-seq may facilitate multi-omics studies of next-generation genome spatial structure.

Principle of SCA-seq

Recently, there has been an increasing interest in applying methylation labeling and nanopore sequencing for the analysis of chromatin accessibility at the single-molecule level^5–10. In this study, we have developed SCA-seq to study chromatin spatial density by updating the 2D chromatin accessibility map to the three-dimensional space. (Fig. <link rid="fig1">1</link>a-1) After cell fixation, we used a methyltransferase enzyme (EcoGII or M.CviPI) to mark accessible chromatin regions artificially. (2–3) After the chromatin accessibility information was preserved as GpC methylation marks, we conducted digestion and ligation steps using chromatin conformation capturing protocols, relying on proximity ligation to suture together multiple linearly distant DNA fragments that happen to be close to each other in three-dimensional space. (4) Then, we performed the designed DNA extraction protocol to get the pure and large DNA fragments. (5) The DNA fragments that carried chromatin accessibility, methylation marks, and three-dimensional conformation information were sequenced using the nanopore method and analyzed in our house pipeline (Fig. 1a). The conventional NGS-based chromatin conformation protocols only captured the interaction between two genomic loci (Fig. 1a − 4 blue square). Unlike the conventional protocol, the proximity ligation in SCA-seq, not limited to the first-order ligation, can occur multiple times in one concatemer (genome fragments fixed together as a cluster), informing about the high-cardinality genome conformation (Fig. 1a,b). Compared with the competing techniques Trac-loop¹¹, Hi-CAR¹³ and NicE-C¹⁵, which also captured the accessible chromatin conformation, the SCA-seq could reserve more multi-omics information, for example, CpG methylation epigenetic marks, chromatin inaccessibility, and high-order chromatin conformation (Fig. 1b).

First, we experimentally determined the feasibility of SCA-seq. In the methylation reactions, the most suitable methyltransferases, EcoGII and M.CviPI, generated the artificial modifications m6A and m5C(GpC), which are rarely present in the native mammalian genomes^{16 17}. Our previous research showed that EcoGII effectively labels accessible chromatin owing to the high density of adenine in the genome⁹. However, by using EcoGII the labeled high-density m6A modification either blocked or impaired the activity of the restriction enzymes (Sfig 1a, b). To solve this problem, we selected the m6A-dependent restriction enzyme DpnI that preferentially digests highly methylated DNA containing methylated adenine and leaves blunt ends. However, the m6A-dependent digestion generated biased digestion of the highly methylated accessible chromatin, and the blunt ends were not ligated efficiently. We then tried another approach and used M.CviPI that methylates GpCs (m5C) on the accessible chromatin, and these marks occur four times less frequently than adenosine. In the following steps, DpnII and other enzymes (without GC pattern in the recognition sites) efficiently cut both GpC methylated and unmethylated DNAs (Sfig 1c,d,e). It should be noted that the m5C base-calling algorithm has been gradually improved and is now widely used in nanopore sequencing¹⁸. Considering the unbiased digestion, M.CviPI might be a better choice in SCA-seq than EcoGII/DpnI. Next, we analyzed the sequencing data and compared them with those obtained using previous technologies.

Sca-seq Accurately Identifies Accessible Chromatin And Methylation Marks At Single-molecule Resolution In Two-dimensional Space

Our work was based on the concepts of nanoNOME-seq, SMAC-seq, and Fiber-seq^5–10, which use either M.CviPI or EcoGII methyltransferases to label chromatin-accessible regions with methylation sites. Our previous experiments⁹ and validations of the results against published data confirmed the effectiveness of the methyltransferase-mediated labeling, showing technological advantages of the complex genome alignment and single-molecule resolution. As the SCA-seq generated the discontinuous genomic segments by ligating fragments (Fig. 1), which might affect the data processing, we validated the accuracy of the methylation calling and methyltransferase labeling.

First, in the validation of the methylation calling, we performed initial quality control of the sequencing data for the HEK293 cell line. We generated 129.94 Gb (36.9× coverage) of mapped sequencing data with an N50 read length of 4,446 bp. To get the methylation information from the nanopore data, we adopted the modified Nanopolish approach⁷ as a methylation caller with considerable success (AUC CpG = 0.908, GpC = 0.984), as others had successfully used it for GpC/CpG calling. In the further validation of the methylation calling, we parallelly performed the gold standard whole-genome bisulfite sequencing, which also correlated well with the Nanopolish results (Sfig 2), supporting the accuracy of the methylation caller. In addition to the methylation calling accuracy, there might also be some ambiguity existing between the native cytidine methylations and artificially labeled cytidine methylations. We first checked the native or false-positive GpC regions in the unlabeled genome, which were also very rare and only accounted for 1.8% (Sfig 7a,b). Secondly, the GCG pattern in the genome might cause the ambiguity of native methylation CpG or accessibility representing GpC, so we also excluded both CpG and GpC methylations from the GCG context (5.6% of GpCs and 24.2% of CpGs) to get the unbiased methylation information. In conclusion, our bioinformatic pipeline was feasible for detecting the native CpG methylations and artificially labeled GpC methylations.

We next assessed the potential of SCA-seq to reveal simultaneously endogenous methylation and chromatin accessibility. As the ATAC-seq and DNase-seq were gold standards for detecting chromatin accessibility, we compared the SCA-seq labeling accuracy with ATAC-seq/DNase-seq globally and locally. Of the 55429 accessibility peaks called from SCA-seq data on the whole genome, 74.6% overlapped with those observed in ATAC-seq and DNase-seq (Fig. 2a). The signal correlation in common peaks was approximately 0.5. The result is close to the previous publication ⁷. The overlapped regions showed a higher peak calling confidence than the non-overlapped regions (Fig. 2b). In the local comparison, SCA-seq also showed peak patterns around the ATAC-seq-identified peaks (Fig. 2c). Moreover, we used computationally predicted binding sites of the CCCTC-binding factor (CTCF), which are a well-documented open chromatin indicator¹⁹. As expected, CpG methylation level decreased, and GpC accessibility increased around the CTCF-binding sites (Fig. 2d). At active human transcription start sites with high expression (TSSs), “open” chromatin regions hypersensitive to transposon attack were observed in ATAC-seq/DNase-seq. SCA-seq showed similar nucleosome depletion patterns around TSSs (Fig. 2e). Inactive TSSs were less accessible than active TSSs (Fig. 2e). In the detailed examination of the genome region, the SCA-seq showed the expected nucleosome pattern around the various epigenetic marks, for example H3K4me3(active), H3K27ac(active) (Fig. 2f and Sfig 3). To further explore the labeling efficiency of the accessible regions, we also compared the signal enrichment fold around TSS among different methods (Sfig 4a). All these methods and the competitor Hi-CAR showed the expected nucleosome depletion pattern around TSSs. SCA-seq had a lower enrichment fold than the other two methods because the other methods enriched the accessible chromatin by PCR with the loss of the inaccessible chromatin. We then explored the influence of various factors on the labeling results, for example dose and sequence depth. The relationship between the dose and M.CviPI treatment effect demonstrated superior efficiency of the 3h treatment, comparing with 15min, 30min treatment (Sfig 4b,c). The sequence depth 8x is the minimal requirement to resolve the chromatin accessibility in SCA-seq (Sfig 4d,e). Overall, SCA-seq reliably estimated chromatin accessibility at the genome level.

Sca-seq Reveals High-order Chromatin Organization

The SCA-seq also reserved the genome spatial structure besides the methylation information; therefore, we validated the genome spatial structure in this section. First, we analyzed the basic statics, for example, contact distance and cardinality of SCA-seq. As the SCA-seq ligated the multiple fragments together, revealing the multiplex-nature chromatin conformation (Fig. 1a), we processed non-singleton chimeric reads into genomic segments and assembled in silico paired-end tags (PETs) in order to compare with Hi-C carrying paired loci. The segment median length was approximately 700 bp (Sfig 5a). Among the informative intra-chromosome PETs, 0.1% of the PETs (contact distance) were < 150 bp; 0.3% of them ranged from 150 to 1,000 bp; 24.5% were 1,000–200,000 bp; and 75.1% were > 200,000 bp (Sfig 6b). Unlike Hi-C, SCA-seq, derived from pore-C, revealed the multiplex nature of chromatin interactions. As for the intra-chromosome interactions, 14.7% of the reads contained two segments (cardinality = 2); approximately 14.5% of the reads contained 3–5 segments (cardinality = 3 ~ 5); and 5.4% of the reads had more than five segments (cardinality > 5) (Sfig 6a). As expected, most of the contacts from the reads with fewer segments appeared to have closer contact distance. The contacts from the reads with more segments appeared to have more distal interaction (Sfig 6c,d). Compared with the competitor method Trac-loop, HiCAR, and other methods, the SCA-seq could also resolve more high-cardinality chromatin conformation and distal interactions (Sfig 6e,f,g and Fig. 3b).

We then compared the SCA-seq with the gold standard Hi-C in the false positive rate, reproducibility, and ability to figure out the genome spatial structure. False positivity rates of SCA-seq and HiC, inferred from hybrid PETs that consisted of mitochondrial DNA and genomic DNA, were similar (Sfig 5b). The compartment score correlation between SCA-seq replicates and pore-C replicates was approximately 0.94 (Sfig 5d). Thirty million reads were enough to resolve the A/B compartment and topologically associating domain (TAD) structures (Sfig 5c). In further analysis of the genome structure, SCA-seq with our improved algorithm revealed genome organization similar to the one detected using in situ Hi-C. Side-by-side visualization of interaction heatmaps, loops, TAD boundaries, and A/B compartments obtained using SCA-seq and Hi-C showed equivalent genome organizations (Fig. 3a,c,d,e,f,g,h). The correlation of the eigenvector and insulation scores were 0.91 and 0.84. We had an interesting finding here that sixty-six percent of the concatemers were compartment-specific (all the fragments in one concatemer belonged to A/B compartments), and 34% were non-specific. Overall, these results suggested that SCA-seq successfully resolved the multiplex nature of chromatin interactions.

We also could observe the chromatin conformation with the specific binding pattern from the SCA-seq, for example, the CTCF binding pattern. As previous publications mentioned, the CTCF occupation could lead to the short inaccessible region (~ 50bp) with the methyltransferase labeling ^20,21, helping us to determine the CTCF binding status on the CTCF motif loci. As expected, the SCA-seq also could resolve the transcription factor-specific footprint and nucleosome footprint similar to the previous publication (Fig. 3i). Based on specific accessibility patterns, we classified the chromatin interaction concatemers containing CTCF motifs into two classes, the concatemer with a CTCF footprint and without a CTCF footprint (Fig. 3j). Considering the relationship between CTCF binding and chromatin structure formation ²², we plotted the concatemer cardinality and interaction distance (Fig. 3k,l). We found that the CTCF binding resulted in higher cardinality and further interactions than the non-CTCF binding, suggesting that CTCF binding help form the more complex structure. As a recent publication ²¹, the methyltransferase accessibility pattern also could indicate other transcription factors’ footprints, enlightening the further exploration of chromatin conformation with other transcription factors by SCA-seq. Therefore, SCA-seq could help us to subgroup the chromatin interaction concatemers and study the effects between the chromatin conformation and protein binding.

Sca-seq Reconstructs Chromatin Accessibility In Three-dimensional Space

Given the high cellular heterogeneity in the genome space, our spatial chromatin status analysis mainly relied on the single-molecule pattern, which needs high sensitivity and specificity. Single-molecule base modification calling was performed as described previously⁷. Moreover, we also determined the enzyme labeling efficiency, which was 79–88%, based on the CTCF motifs and spike-in control measurements (Sfig 7a,b,c). Then, we filtered the fragments using the binomial test further to minimize the false positive accessible/inaccessible status (see Methods). Beyond our expectation, the accessible and inaccessible DNAs were ligated together in SCA-seq (Fig. 4a), suggesting the high heterogeneity of the spatial neighboring DNAs. As our observation, the overall genome concatemer calculations showed that 29% of the genome concatemers maintained inaccessibility on all enclosed segments (accessible/inaccessible segment ratio < 0.1). Furthermore, 62.2% of genome concatemers had parts of accessible segments (hybrid concatemers), and only 8.8% maintained all segments as accessible (accessible/inaccessible segment ratio > 0.9) (Fig. 4b). An example region was shown on side with 1D genome feature tracks to demonstrate the promoter-enhancer spatial interactions, accessibility and CpG methylation at single molecule resolution (Fig. 4a). nanoNOME-seq that labels chromatin accessibility in single molecules also confirmed the existence of hybrid concatemers (Sfig 7d). To explore if the concatemer accessibility status was related to spatial location, we plotted the inaccessible concatemers and hybrid concatemers on the 2D contact heat map. We found that the hybrid concatemers tended to gather around the TAD boundary and contain more distant connections (Fig. 4b-heatmap). By plotting the accessible/inaccessible ratio with concatemer cardinality or interaction distance, the hybrid concatemers had more fragments than the inaccessible concatemers, also implying their distal and high-cardinality interaction preference (Fig. 4c and Sfig 8e). The A/B compartments are usually related to chromatin accessibility and regions of gene expression ¹. In our study, we found that the B compartment (negative eigenvector) had significantly more inaccessible concatemers (accessible/inaccessible segment ratio 0.37 vs. 0.4, P < 2.2 × 10^− 16) than the A compartment (positive eigenvector) (Fig. 4d and Sfig 8a,b). Due to the linkage between chromatin accessibility and transcription factor binding²³, we further investigated the enhancer and promoter contacts on chromosome 7, 30.3% of which had accessible–accessible status; 18.5%, accessible–inaccessible status; and 51.2%, inaccessible-inaccessible status (Fig. 4e). The frequency of contacts with accessible enhancer/promoter highly correlates with gene expression levels (Fig. 4f,g and Sfig 8c,d), supporting the transcription model in which active enhancer initiated promoters by contaction²⁴. However, it is worth noting that 51.2% of enhancer-promoter interactions were independent of the chromatin-accessible status, suggesting that enhancer-promoter spatial interaction was not the only factor to initiate the active transcription. Due to the close relationship between spatial accessibility and transcription activity, we developed an algorithm to calculate the activation power of the genome loci (Sfig 12), revealing their potential to activate transcription. The locus with higher activation power might enhance the genome accessibility of more contacting loci. Overall, spatial contacts and contact accessibility might coordinately regulate gene expression.

Spatial chromatin accessibility is dramatically changed in CTCF ^+/− HEK293T cells

In mammals, the highly conserved zinc finger protein CTCF is thought to serve as an insulator protein that prevents communication between enhancers and promoters and thereby, regulates chromosome folding^{25 26 27}. We suspected that CTCF might also insulate the accessibility interference in the spatial contacts. We examined the impact of permanent CTCF reduction by CTCF allele-specific knock out (CTCF ^+/−) in HEK293T cells and observed the loop losses and TAD boundary changes, which is similar to previous publications ^22,28 (Sfig 9). We and others also found that the CTCF reduction changed genome accessibility, with 3457 upregulated and 932 downregulated accessibility regions (Sfig 10a,b)²⁹. Among the changed regions, only 2% overlapped with the CTCF motifs (CTCF regions) and 98% did not contain CTCF motifs (non-CTCF regions). For example, the chromatin accessibilities around CTCF motifs (CTCF regions) and TSSs (non-CTCF regions) were increased (Sfig 10c,d). We further studied whether non-CTCF regions' chromatin accessibility was altered by the spatial contacts with CTCF regions. Overall, the number of accessible concatemers increased substantially, and most of them had more than one accessible chromatin segment (Fig. 5a). In the contact maps with accessibility plots, the chromatin spatial accessibility had a line pattern in wild-type cells (Fig. 5b). The accessible-accessible/accessible-inaccessible (red/yellow dots) pairwise contacts were arranged in lines, indicating that the loci contacted specific accessible chromatin regions, such as CTCF motifs and ATAC peaks (Fig. 5b and Sfig 11). In the CTCF ^+/− cells, chromatin was activated globally on the pairwise contact, and the spatially activated chromatin changed the line patterns (Fig. 5b and Sfig 11). Then we further investigated which spots on the contact map were significantly changed by dividing the contact map into 100kb squares. In the static analysis of the contact map, we found that these significantly changed locations were distributed with the CTCF ChIP signal (Fig. 5c, cut-off p < 0.1 for bin, Pearson correlation p < 2.6*e^− 16). Then we increased the resolution to 5kb and summarized the total counts of the accessible-accessible pairwise contacts. We found that the increased accessible contacts in CTCF ^+/− cells also significantly correlated with the CTCF ChIP signals (Pearson correlation, p < 2.6*e-¹⁶). By taking a closer look into the CTCF-specific contacts, we overlayed the CTCF loci together and plotted the interaction contacts with their accessibility information. The accessibilities of these CTCF-contacted loci were significantly increased in the CTCF ^+/− cells. All these results suggested that the CTCF loss might increase the chromatin accessibility on the loci spatially contacting with CTCF loci. These findings were similar to the recently proposed theory “activity by contact” model about CTCF ³⁰. On the other hand, we used the CTCF siRNA know down to cross-validate our findings, producing similar results (Sfig 13).

We also found another interesting finding of the increased accessibility of long-range interaction in CTCF ^+/− cells. The activation folds were much larger in the long rang interaction than in the short-range interaction (P < 2.6 × 10^− 16, t-test) (Sfig 10e). The sparser data of long-range interactions may bias this finding. However, this observation might provide a clue to the CTCF insulator function conundrum in that CTCF possibly prevents impropriate chromatin activation. This finding further confirmed the close relationship between CTCF loss and spatial activation. Overall, with SCA-seq, we found that CTCF loss led to contact-based chromatin activation, clarifying the spatial insulation function of CTCF, which needs to be explored further ³¹.

Cpg Methylation On Orphan Cpg Island

CpG islands (CGI), the widespread features of vertebrate genome, were associated with ~ 50% of gene promoters (pCGI). pCGI control the gene transcription by affecting the neighboring promoters with methylation-related chromatin properties. Some CGIs are located close to the enhancers (eCGI). In addition, the other thousands of orphan CGIs (oCGI), distal(1kb) to the promoters and enhancers, were barely unknown (Fig. 6a) ^32,33. In the SCA-seq data indicating the high-order interaction and CpG methylation, we found 76418 reads overlapped with CGIs on Chr7, and the most oCGIs were usually close to the CTCF binding motifs and active histone markers, such as H3K27ac and H3K4me3, suggesting their active regulatory functions (Fig. 6b). By examining the methylation status on reads, as expected, these read segments demonstrated lower CpG methylation and higher chromatin accessibility (GpC methylation), which further supports their roles in activating the genes (Fig. 6b). In the in vitro assay of previous research, oCGIs act as tethering elements that promote topological interaction between enhancers and distally located genes to regulate gene expression. In the SCA-seq, we observed the 60% oGCI tethered at least one type of regulatory elements, such as enhancers, CTCFs, and promoters (Fig. 6c). After normalizing by the total number of regulatory elements, we found that the oCGI preferably interacted with CTCF and promoters, comparing with non-CGI events (Binomial test p < 2.6e-16 control as background frequency) (Fig. 6d). Further explored analysis of each concatemer showed that the 39% oCGI-enhancer concatemers and oCGI-CTCF concatemers included more than two enhancers or CTCF motifs. In contrast, most of oCGIs tethered to one promoter (Fig. 6e). These data supported the previous research that the oCGI tethered enhancer regulates the gene expression ³². However, we found the CpG methylation on oCGI weakly correlated with the promoter’s accessibility and CpG methylation in regression analysis, whose mechanism of regulation need to be further studied. Overall, the oCGI could tether the enhancers and CTCF motifs to communicate with promoters, promoting a further understanding of oCGI regulatory functions.

The SCA-seq aimed to expand the traditional chromatin accessibility to high dimensional space by simultaneously resolving the chromatin accessibility and genome structure. Compared with 1D ATAC-seq, SCA-seq might more closely represent the relatively true structure of the native genome. With the SCA-seq, we found that the genome spatial contacts maintained the non-uniform chromatin accessibility, suggesting the complex genome regulation in 3D space. Further study with CTCT ^+/− indicated the insulating functions of CTCF on the spatial contacts.

Considering the single molecular resolution, the first thing one needs to consider is the efficiency of methyltransferase labeling. We used lambda DNA in vitro labeling and in vivo CTCF motifs signal to estimate labeling efficiency and the binomial test to correct the labeling accuracy at the single-molecule level. Then, relatively reliable accessible chromatin markers were obtained. However, it is still possible for such a marker to be missed or overridden in the case of insufficient enzyme activity. Because the labeling efficiency may lead to deviation from our conclusions, our analysis in the following experiments was mainly based on the statistics of large numbers of molecules. In the single-molecule analysis of specific locations, more than two similar concatemers could accurately describe the epigenetic status in the exact spatial locations. Given the high heterogeneity of the dynamic genome structure and SCA-seq resolution, a much higher sequencing throughput is required to achieve analysis at a single-molecule level in a specific spatial location.

The eigenvalue and insulation score of SCA-seq generally correlated with gold standard Hi-C (0.91 and 0.84), and the phenomena were also true in all other multiplex-order chromatin conformation methods, such as Pore-C³⁴, SPRITE³⁵, and ChIA-Drop³⁶. After we studied this problem deeper, we found that this issue may be generated from two aspects. First, the conversion from multi-contacts to pair-wise contacts would overrepresent some interactions. In our algorithm, we significantly improve this issue by weighted transformation, which increases the correlation coefficient to 0.9. Second, we found that the low correlation regions had lower GC density and low read counts. The previous publication³⁷ also point out that the PCR amplification could bias eigenvalues and insulation scores. In contrast, the Pore-C and SCA-seq were non-amplification methods. After the quality filter of low-coverage regions, the correlation between the two methods was significantly improved.

SCA-seq was created as a multi-omics tool to examine both chromosome conformation and chromatin accessibility. The second point that needs to be discussed is the different levels of resolution of chromosome conformation capture and chromatin accessibility. The resolution of the chromosome conformation capture is approximately 700 bp, whereas that of the conventional chromatin accessibility is approximately 200 bp. The precise accessible–accessible chromatin interactions were underdetermined. The alternative hypothesis is that the interaction loci are located outside the accessible chromatin. Therefore, improvement of the resolution of chromosome conformation capture is needed to determine spatial accessibility interaction accurately.

In this study, the partial loss of CTCF activated spatially neighboring chromatin, which supported the insulating effect of CTCF. However, there were still over 30% of loci that could be activated without physical contact with CTCF motifs or peaks. We suspect that this phenomenon might be due to secondary effects of CTCF deficiency that promote cascade reactions and activation of spatially neighboring regions. However, it is possible that our definition of CTCF binding sites is not unambiguous, as there may be more CTCF binding sites than we expected. In addition, although we have found that the loss of CTCF increased chromatin accessibility on spatial contacts, the reasons for such a clear causal relationship remain uncertain.

Overall, our results demonstrated that SCA-seq could resolve genome accessibility locations in the three-dimensional space, helping observe the subgroup of chromatin conformation with the specific binding pattern, conformation-based chromatin accessibility, and conformation-based native CpG methylation. SCA-seq might pave the way to explore dynamic genome structures in greater detail.

The detailed protocol could be found https://www.protocols.io/view/sca-seq-b6a6rahe. The bioinformatic script could be found https://github.com/genometube/SCA-seq. The data source and QC information could be found in the supplemental files.

Cell culture

Derivative human cell line which expresses a mutant version of the SV40 large T antigen (HEK 293T) [abclonal] and CTCF allele-specific knockout 293T cell line [abclonal] were each maintained in DMEM-high glucose [Thermo Fisher 11995065] supplemented with 10% fetal bovine serum (FBS) [Thermo fisher 1009141]. The CTCF ^+/- cell line was purchased from ABclonal, and validated by genotyping, western blot, qPCR and RNA-seq (Sfig14).

Cross-linking

5 million cells were washed 1 time in chilled 1X phosphate buffered saline (PBS) in a 15 mL centrifuge tube, pelleted by centrifugation at 500xg for 3 min at 4℃. Cells were resuspended by gently pipetting in 5 mL 1X PBS with formaladehyde (1% final concentration). Incubating cells at room temperature for 10 min, add 265 µL of 2.5 M glycine (125 mM final concentration) and incubate at room temperature for 5 min to quench the cross-linking. Centrifugate the mix at 500xg for 3 min at 4℃. Wash cells 2 times with chilled 1X PBS.

Nuclei isolation and methylation

Cell pellet was resuspended with cold lysis buffer: 10 mM HEPES-NaOH pH 7.5, 10 mM NaCl, 3 mM MgCl₂, 1X proteinase inhibitor [Sigma 11873580001], 0.1% Tween-20, 0.1 mg/ml BSA, 0.1 mM EDTA, 0.5% CA-630, incubate on ice for 5 min. Centrifugate lysis mixure at 500xg for 5 min at 4℃ to collect the nuclei. Washed the nuclei once with 1X GC buffer [NEB M0227L] then resuspend 2 million nuclei in 500 µL methylation reaction mixture: 1X GC buffer, 200 U M. CvipI [NEB M0227L], 96 µM S-adenosylmethionine, 300 mM Sucrose, 0.1 mg BSA, 1X proteinase inhibitor, 0.1% Tween-20. Incubate the reaction for 3 hours at 37℃, add 96 µM SAM and 20 U M.CvipI per hour. Centrifugate at 500xg for 10 min at 4℃ to collect nuclei, wash the nuclei once with chilled HEPES-NaOH pH7.5 and centrifugate to collect nuclei.

Restriction enzyme digest

Resuspend nuclei with 81 µL cold HEPES-NaOH pH7.5, add 9 µL 1% SDS and react at 65℃ for 10 min to denature the chromatin, take the tube on ice immediately after reaction. Add 5 µL 20% Triton X-100 and incubate on ice for 10 min to quench SDS. Prepare digestion mixture: 140 U DpnII [NEB R0543L], 14 µL 10X HEPES-buffer3.1 [50 mM HEPES-NaOH pH 8.0, 100 mM NaCl, 10 mM MgCl₂, 100 µg/mL BSA], add nuclei suspension and nuclease-free water into mixture to achieve a final volume of 140 µL. Incubate digest mixture in a thermomixer at 37℃ for 18 hours with 900 rpm rotation.

Ligation

DpnII digests were heat inactivated at 65℃ for 20 min with 700 rpm rotation, average digests to 70 µL per tube, add 14 µL T4 DNA Ligase buffer [NEB M0202L], 14 µL T4 DNA Ligase [NEB M0202L], 1 mM ATP and nuclease-free water to achieve a final volume of 140 µL. The ligation was incubated at 16℃ for 10 hours with 800 rpm rotation.

Reverse cross-linking and DNA purification

Collect all ligation into one 1.5 mL tube, add equal volume of 2X sera-lysis [2% Polyvinylpyrrolidone 40, 2% Sodium metabisulfite, 1.0 M Sodium Chloride, 0.2 M Tris-HCl pH 8.0, 0.1 M EDTA, 2.5% SDS], add 5 µL RNaseA [QIAGEN 19101], incubate at 56℃ for 30 min. Add 10 µL Proteinase K [QIAGEN 19131], 50℃ overnight incubation with 900 rpm rotation. DNA was purified with high molecular weight gDNA extraction protocol [Baptiste Mayjonade, 2016].

SCA-seq pipeline.

We developed a reproducible bioinformatics pipeline to analyze the M.CvipI footprint and CpG signal on SCA-seq concatemers. Briefly, the workflow starts with the alignment of SCA-seq reads to a reference genome by bwa (v0.7.12) using the parameter ${bwa} bwasw -b 5 -q 2 -r 1 -T 15 -z 10. The mapping score ≥ 30, and reads with length < 50 bp were set to filter out the low-quality mapping fragment. To remove the non-chimeric pairs due to ligation of cognate free ends or incomplete digestion, each alignment is assigned to an in-silico restriction digest based on the midpoint of alignment. The locus of each fragment on each concatemer is summarize by converting the filtered alignment to a fragment bed file sorted by read ID first and then the genome locus. The alignment bam file is also used to call the GpC and CpG methylation by Nanopolish (v0.11.1) call-methylation with the cpggpc model (--methylation cpggpc). The default cut-off for log-likelihood ratios are used to determine methylated GpC (> 1) and methylated CpG sites (> 1.5) ⁷. The methylation call is then counted to each fragment in the fragment bed file to derive the methylated and unmethylated count of GpC and CpG for each fragment of the concatemers.

SCA-seq and Hi-C comparisons

SCA-seq concatemers were converted into virtual pairwise contacts in order to correlate with the published Hi-C datasets. The decomposed SCA-seq contact matrix was treated as a Hi-C contact matrix and analyzed by Hi-C software. The contact matrix was normalized using cooler balance. Then the eigenvector scores and TAD insulation score were calculated by cooltools call-compartments and cooltools diamond-insulation tools. The linear correlation between the Pore-C and Hi-C contact matrices was then measured by eigenvector scores and TAD insulation score. The variation of individual pore-C runs, individual SCA-seq runs, and downsampled SCA-seq datasets were also examined by the above metrics. Loop anchors were identified by ENCODE CTCF ChiP-seq peaks (ENCSR135CRI). Cooltools pileup was used to compute aggregate contact maps at 10kb resolution and centered at the loop anchors (± 100kb).

SCA-seq, ATAC-seq, and DNase-seq comparisons

For comparison and visualization of bulk accessibility, the conventional bulk ATAC-seq and DNase-seq data of HEK293T peak signals were obtained from Gene Expression Omnibus (GEO) accession GSE108513 and GSM1008573. The SCA-seq accessibility peak calling was performed in a similar way to nanoNOMe ⁷. Briefly, 200bp window and 20bp step size continuous regions of GpC methylated counts, unmethylated counts, and GpC methylation ratio were generated from SCA-seq Nanopolish calls. The regions of GpC methylation ratio greater than 99th percentile of the regions were selected as candidate first. The significance of each candidate region was calculated by the one-tailed binomial test of raw frequency of accessibility (methylated GpC site / total GpC site) to reject the null probability, which is defined by the overall regions GpC methylation ratio. The p-values were corrected for multiple testing by Benjamini-Hochberg correction. The adjusted p-values < 0.001 and widths greater than 50 bps were determined as the SCA-seq accessibility peaks. The overlapping peaks between SCA-seq, ATAC-seq, and DNase-seq were identified by bedtools (v2.26.0) intersect.

Estimate the labeling efficiency in vivo

As previous research, the CTCF motif maintained the accessible chromatin in neighboring 200bp region. Consider the resolution in Hi-C and experimental fragmentation, we selected the 1000bp bins with the documented CTCF motif in center. The CpG methylation levels were negatively correlated with the chromatin accessibility. Then the segment with low CpG methylation were expected to maintain the accessible chromatin status with CTCF binding. We hypothesized that the segments with low CpG methylation (CpG ratio < 0.25) and low chromatin accessibility (GpC ratio < 0.1) were not efficiently labeled.

Filter the fragments by binomial test

The medium fragment length is 500bp, which is close to the general size of accessible chromatin segments. We first calculated the background level of the methyl-GpC (accessible) and non-methyl-GpC (inaccessible) probability on the segments. We used the non-treated genomic DNAs as the background, and 0.03 (GpC background) were the average GpC frequency on segments. Then we performed the binomial test (R basics) for each fragments in M.CviPI treated samples to test the null hypothesis that if labeled GpCs (GpC > = 4) was equal or smaller than the background GpCs. We further to investigate the confidence level of inaccessible chromatin with the non-methyl GpC. The non-methyl-GpC frequency in M.CviPI treated spike-in is 0.3. Therefore, we roughly estimated that 21% GpCs(p) were not efficiently labeled by M.CviPI. Then we performed the binomial test (R basics) for each segment in M.CviPI treated samples to test the null hypothesis that if the non-methyl GpCs on heterochromatins were equal or larger than the enzymatic inefficiency. For both p-value, the probabilities were corrected for multiple testing using the Benjamini Hochberg correction and accessible/inaccessible fragments with adjusted p-value less than 0.05. We determined the accessible fragments first, and then we further determined the inaccessible fragments in the rest. There are ~ 2 millions segments which is undetermined and discarded.

High resolution accessibility determination

As above description, we used the binomial test to test the accessibility on each fragment. However, the accessible regions in ATAC-seq were around 200bp (peak average size). If we used sliding windows (200bp windows, sliding 50bp) on each fragment, we may determine the precise accessible regions on the fragments with sacrificing the computational speed. By the similar methods, we performed the binomial test (R basics) for each windows in M.CviPI treated samples to test the null hypothesis that if labeled GpCs (GpC_methy > 1) was equal or smaller than the background GpCs. We defined the accessible fragments as containing > = 1 accessible windows. Finally, we found that the sliding windows methods could produce 8% more accessible, which is not very significant improvement. Considering the general computational ability, we suggested the above methods in our experiments.

Statistics

Most of the parametric data which were distributed as normal distribution (log normal distribution), were performed in two-side t-test. The Pearson correlation analysis was also performed for normal distribution data. We used the Fisher’s exact test for the differential accessibility analysis in SCA-seq. Other non-parametric or abnormally distributed data were performed as Wilcoxon rank test.

Data availability

The data were stored at https://db.cngb.org/search/project/CNP0002862/.

Acknowledgment

This research was supported by the Science, Technology, and Innovation Commission of Shenzhen Municipality (grant number JSGG20170824152728492). The supporter had no role in designing the study, data collection, analysis and interpretation, or in writing the manuscript.

Author contributions

CT designed and supervised the experiments. YL, FR, and ML perform the lab experiments; YX and CT perform the bioinformatics data analysis. All authors combinedly performed the data analysis. All authors have read and approved the final manuscript draft.

Competing interest

The authors declare no competing interests.

Misteli, T. Beyond the sequence: cellular organization of genome function. Cell 128, 787–800 (2007).
Song, L. & Crawford, G.E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor protocols 2010, pdb.prot5384-pdb.prot5384 (2010).
Voong, L.N., Xi, L., Wang, J.-P. & Wang, X. Genome-wide Mapping of the Nucleosome Landscape by Micrococcal Nuclease and Chemical Mapping. Trends in Genetics 33, 495–507 (2017).
Buenrostro, J.D., Wu, B., Chang, H.Y. & Greenleaf, W.J. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr Protoc Mol Biol 109, 21.29.1-21.29.9 (2015).
Wang, Y. et al. Single-molecule long-read sequencing reveals the chromatin basis of gene expression. Genome Research (2019).
Abdulhay, N.J. et al. Massively multiplex single-molecule oligonucleosome footprinting. Elife 9(2020).
Lee, I. et al. Simultaneous profiling of chromatin accessibility and methylation on human cell lines with nanopore sequencing. Nature Methods 17, 1191–1199 (2020).
Shipony, Z. et al. Long-range single-molecule mapping of chromatin accessibility in eukaryotes. Nat Methods 17, 319–327 (2020).
Chen, W. et al. Sequencing of methylase-accessible regions in integral circular extrachromosomal DNA reveals differences in chromatin structure. Epigenetics & Chromatin 14, 40 (2021).
Weng, Z. et al. Long-range single-molecule mapping of chromatin modification in eukaryotes. bioRxiv, 2021.07.08.451578 (2021).
Lai, B. et al. Trac-looping measures genome structure and chromatin accessibility. Nature Methods 15, 741–747 (2018).
Li, T., Jia, L., Cao, Y., Chen, Q. & Li, C. OCEAN-C: mapping hubs of open chromatin interactions across the genome reveals gene regulatory networks. Genome Biology 19, 54 (2018).
Wei, X. et al. Multi-omics analysis of chromatin accessibility and interactions with transcriptome by HiCAR. bioRxiv, 2020.11.02.366062 (2020).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–93 (2009).
Luo, Z. et al. NicE-C efficiently reveals open chromatin-associated chromosome interactions at high resolution. Genome Research (2022).
McClelland, M. & Ivarie, R. Asymmetrical distribution of CpG in an 'average' mammalian gene. Nucleic acids research 10, 7865–7877 (1982).
O’Brown, Z.K. et al. Sources of artifact in measurements of 6mA and 4mC abundance in eukaryotic genomic DNA. BMC Genomics 20, 445 (2019).
Liu, Y. et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome biology 22, 295–295 (2021).
Ong, C.-T. & Corces, V.G. CTCF: an architectural protein bridging genome topology and function. Nature Reviews Genetics 15, 234–246 (2014).
Stergachis, A.B., Debo, B.M., Haugen, E., Churchman, L.S. & Stamatoyannopoulos, J.A. Single-molecule regulatory architectures captured by chromatin fiber sequencing. Science 368, 1449–1454%@ 0036-8075 (2020).
Battaglia, S. et al. Long-range phasing of dynamic, tissue-specific and allele-specific regulatory elements. Nature Genetics 54, 1504–1513 (2022).
Hyle, J. et al. Acute depletion of CTCF directly affects MYC regulation through loss of enhancer-promoter looping. Nucleic Acids Res 47, 6699–6713 (2019).
Lee, C.K., Shibata, Y., Rao, B., Strahl, B.D. & Lieb, J.D. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 36, 900–5 (2004).
Schoenfelder, S. & Fraser, P. Long-range enhancer-promoter contacts in gene expression control. Nat Rev Genet 20, 437–455 (2019).
Özdemir, I. & Gambetta, M.C. The Role of Insulation in Patterning Gene Expression. Genes 10(2019).
Merkenschlager, M. & Nora, E.P. CTCF and Cohesin in Genome Folding and Transcriptional Gene Regulation. Annu Rev Genomics Hum Genet 17, 17–43 (2016).
Rowley, M.J. & Corces, V.G. Organizational principles of 3D genome architecture. Nature Reviews Genetics 19, 789–800 (2018).
Nora, E.P. et al. Targeted Degradation of CTCF Decouples Local Insulation of Chromosome Domains from Genomic Compartmentalization. Cell 169, 930–944.e22 (2017).
Xu, B. et al. Acute depletion of CTCF rewires genome-wide chromatin accessibility. Genome Biology 22, 244 (2021).
Lee, M. et al. CTCF mediates the Activity-by-contact derived cis-regulatory hubs. bioRxiv, 2022.11.04.515249 (2022).
Nanni, L., Ceri, S. & Logie, C. Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries. Genome Biol 21, 197 (2020).
Pachano, T. et al. Orphan CpG islands amplify poised enhancer regulatory activity and determine target gene responsiveness. Nature Genetics 53, 1036–1049 (2021).
Bell, J.S.K. & Vertino, P.M. Orphan CpG islands define a novel class of highly active enhancers. Epigenetics 12, 449–464 (2017).
Deshpande, A.S. et al. Identifying synergistic high-order 3D chromatin conformations from genome-scale nanopore concatemer sequencing. Nature Biotechnology (2022).
Quinodoz, S.A. et al. Higher-Order Inter-chromosomal Hubs Shape 3D Genome Organization in the Nucleus. Cell 174, 744–757.e24 (2018).
Zheng, M. et al. Multiplex chromatin interactions with single-molecule precision. Nature 566, 558–562 (2019).
Niu, L. et al. Amplification-free library preparation with SAFE Hi-C uses ligation products for deep sequencing to improve traditional Hi-C analysis. Communications Biology 2, 267 (2019).

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Spatial chromatin accessibility sequencing resolves next-generation genome architecture

Status:

Version 1

Abstract

Figures

Introduction

Results

Principle of SCA-seq

Sca-seq Accurately Identifies Accessible Chromatin And Methylation Marks At Single-molecule Resolution In Two-dimensional Space

Sca-seq Reveals High-order Chromatin Organization

Sca-seq Reconstructs Chromatin Accessibility In Three-dimensional Space

Cpg Methylation On Orphan Cpg Island

Discussion

Method

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1