Disease-associated SNPs (daSNPs)
We used the version v1.0.2-associations_e94_r2018-09-30 of the GWAS catalog: www.ebi.ac.uk/gwas/ [49]. We extracted all SNP entries associated with a disease EFO term (Experimental Factor Ontology), overall 449 EFO terms, distinguishing 71 cancers and 378 non-cancer diseases. Specifically, diseases were found by selecting all traits which fall into the disease subtree (EFO_0000408) from the EFO ontology, and subsequently cancers are separated from non-cancer diseases by using the cancer subtree (EFO_0000311).
A SNP can be associated with a disease multiple times in the GWAS catalog when it was found in distinct studies; we ignored this multiplicity by dropping duplicates of its identifier snpId.
We classified the resulting 21,183 daSNPs into intergenic (40%), intronic (55%) or exonic (5%) type according to its parent category in The Sequence Ontology database, (http://www.sequenceontology.org), version 2015-11-24 [50].We then identified for each disease the subset of its associated intergenic SNPs (on average 47 daSNPs and 18 intergenic daSNPs for both cancer and non-cancer diseases).
Hi-C data
We used the Cooler Hi-C database [51] at ftp://cooler.csail.mit.edu/coolers, which provides published Hi-C data files in the .cool format, at 10kb-resolution (bin size). Throughout our study, genomic coordinates refer to the hg19 genome version adopted in this database. We present in the main text results obtained with high-resolution data from E. Lieberman-Aiden’s laboratory [30] for the 5 native and non-cancerous cell lines available, namely GM12878 (human lymphoblastoid cell line, data obtained with MboI or DpnII restriction enzyme), IMR90 (fetal lung fibroblasts of Caucasian origin), HMEC (human mammary epithelial cells), NHEK (normal human epidermal keratinocytes) and HUVEC (human umbilical vein endothelial cells). We did not consider cancer cell types as we are interested in the genetic risk present at birth in all cells. We also investigated datasets from B. Ren’s laboratory [19, 36–38], obtained in pioneering experiments using a lower sequencing depth and an enzyme HindIII producing larger restriction fragments (see Supplementary Fig. 8), for several cell types: GM12878 in [37], IMR90 in [19, 36] embryonic stem cells (H1 hESC) in [19, 38], and cell lines derived from H1 hESC in [38], namely mesendoderm (H1_ME), neural progenitors (H1_NP) trophoblast-like cells (H1_TB) and mesenchymal cells (H1_MS). Overall 15 datasets were examined in our study.
TAD determination
We determined TAD coordinates using the TopDom algorithm [31], applied after a transformation of .cool files into count matrices using https://github.com/open2c/cooler. Its principle is to count the number of contacts in a window sliding along the genome and to locate TAD ends at the minima of this count (see Fig. 1, blue diamonds). Genomic regions having established only very few contacts in the experiment, labelled ‘gap’ by TopDom, were filtered out. The choice of using this algorithm is supported by comparative studies [33–35]. Moreover, TopDom is based on quantifying the topological insulation between adjacent TADs, which is the feature that matters for gene (mis)regulation. In particular, recently evidenced long-range associations between TADs and their higher-order organization [52] will not be considered here.
The TAD caller thus involves a tunable parameter k, measuring the half-size (in bins, of length equal to the chosen resolution) of the sliding window. This parameter offers a way to investigate the well-known variability in TAD determination [33–35], as depicted in Supplementary Fig. 2. The general trend is that larger numbers of TADs are observed for lower values of k, at which substructures are also extracted while only large TADs are extracted by the algorithm at large values of k. To overcome this technical variability, we scanned all values of the window size k from k = 3 to k = 20 and adopted two strategies: either (i) to aggregate our observations over these values of k (Fig. 2), or (ii) to use a more stringent majority rule in further analyses (Figs. 3, 4). Both strategies reduce small-number effects and smooth out TAD variability, overall yielding robust results despite the lack of robustness of the TAD landscape (Supplementary Fig. 2 and Supplementary Fig. 3).
A discrepancy between TADs determined with TopDom and the visual impression given by the contact map could appear locally (see e.g. Figure 1), coming from the following difference: TopDom is based on the insulation of TADs, i.e. the presence of low-density (yellow) zones between the triangles delineating the TADs (see Fig. 1), whereas the alternative understanding of a TAD as a region with an increased density of internal contacts would rather focus on dark red triangles emerging from the background in the Hi-C map.
TAD borders
We defined TAD borders as regions of 20kb located inside the TAD at the limits of this TAD. Their size has been chosen smaller than the median size (23kb) of a human gene [53].
This definition agrees with a topological characterization of TAD borders as regions across which the contact frequency displays a marked decrease [54] and is consistent with the use of TopDom for calling TADs. It differs from the notion of TAD boundary, considered e.g. in [19, 24] which is the ---not always existing— linker region between two successive TADs along the genome (not belonging to any TAD), whereas a TAD will always have two borders. TAD borders cover from 8% up to 14% of the genome when TopDom parameter k varies, the smallest fraction being observed for k = 20 (see Supplementary File 1, Supplementary Table 2 and Supplementary Table 3 in Supplementary Material)
Preferential location of daSNPs in TAD borders (TAD border enrichment)
For each disease (EFO term), we have tested whether the associated SNPs are located in TAD borders more often than observed by chance, where chance (i.e. the null model in statistical terms) corresponds to the same number of pointwise loci drawn at random in the entire genome. In a second analysis, chance corresponds to the same number of SNPs drawn at random in the entire set of disease-associated SNPs listed in the GWAS catalog. The statistical significance of a preferential location of daSNPs in TAD borders is then assessed by computing a p-value for the disease according to a hypergeometric test.
In more detail:
Testing a preferential location in TAD borders of the SNPs associated to a given disease involves the hypergeometric distribution H (q | N, n, Q) where Q is the number of SNPs associated to the considered disease and q the part located in TAD borders. In the genome-based null model, N is the total number of base pairs in the genome and n the part located in TAD borders. In practice, given the finite resolution which amounts to discretize the genome in 10kb-bins, N is the total number of bins in the genome and n the number located in TAD borders. In the SNP-based null model, close to that considered e.g. in [55], N is the total number of disease-associated SNPs listed in the GWAS catalog (each counted once, and not including the SNPs associated with non-pathological traits so that the random model is the closest possible to the data) and n the part located in TAD borders. Results presented in the main text were obtained with the more conservative genome-based null model. A comparison with those obtained with the SNP-based null model is presented in Supplementary Fig. 4.
The p-value for the considered disease, assessing the over-representation of its associated SNPs in TAD borders, is computed as the cumulative distribution function (i.e., the fraction of values larger than or equal to q) of this hypergeometric distribution. Given the symmetry H (q | N, n, Q) = H (q | N, Q, n) of the hypergeometric distribution, it is equivalent to state that (i) the SNPs associated to the disease are located in a TAD border more often than expected by chance, or (ii) TAD borders contain a SNP associated to this disease more often than expected by chance. We term such a situation TAD border enrichment.
After computing a raw p-value for each disease, Benjamini-Hochberg procedure (multipletests function with method fdr_bh from the statsmodels package in Python, www.statsmodels.org/dev/index.html#citation) is applied to obtain p-values adjusted for multiple testing, so that the false discovery rate is controlled at level 5% when the adjusted p-value is lower than 0.05 [56]. Given the overwhelming number of non-cancer diseases (378) compared to cancers (71) and their different etiology, we investigated separately these two groups of diseases. These two groups are well-defined on biological criteria independently of our enrichment testing, so that correction for multiple testing has been applied separately in each group. We checked that our main result (the relatively more frequent TAD border enrichment for cancers) is also qualitatively observed when using a global multiple-testing correction, or even no correction (Supplementary Fig. 5 and Supplementary Fig. 6).
Enrichment histograms
Histograms of corrected p-values (Fig. 2) are plotted and normalized separately for cancers and non-cancer diseases. The counts have been first aggregated over the 6 considered datasets from [30] and the values of the window parameter k of the TAD caller. In order to get a better display of the core features of the plots, the range of p-values has been truncated at log10(1/p) = 4. A few EFOs, with corrected p-values smaller than 0.0001, are thus lying outside the displayed plot. It would be possible to choose a larger bin size or even to draw a smooth histogram, but this would rather dilute the information.
Comparison of TAD-border enrichment for cancers and non-cancer diseases
Due to the small number of cancers (71) compared to non-cancer diseases (378), we compared the fraction of cancers and the fraction of non-cancer diseases displaying a significant TAD border enrichment, considering either all their associated SNPs or only a sub-category (exonic, intronic and intergenic). Only diseases displaying a significant enrichment for a majority (more than 50%) of values of the window parameter k have been counted. The significance of a difference between the disease fractions has been assessed using Fisher’s exact test. The comparison has been done for each of the 6 considered datasets (for 5 cell types) from [30], and also after aggregating the disease counts over these datasets.
Workflow
The different steps described above have been gathered in an easy-to-execute pipeline, using Snakemake
(https://snakemake.github.io, [57]) unifying the analysis of different datasets. Its rule graph is presented on Supplementary Fig. 1. Its code, written in Python, is freely available at
https://github.com/kpj/GeneticRiskAndTADs . An order of magnitude of the typical numbers of diseases and SNPs of different categories involved in our analysis is provided in Supplementary Material (Supplementary File 1, Supplementary Table 2, Supplementary Table 3).
SNP-based diseasome network and its analysis
We introduced a network representation where nodes are diseases and an edge is drawn between two diseases when they share an associated SNP. Starting from a bipartite network relating diseases and their associated SNPs, this representation is the projection on disease nodes. It is the analogue for SNPs of the network relating diseases and their associated genes, known as the diseasome, and its projected version [41]. A filter has been applied on shared SNPs: in the network labeled ‘border SNPs’, an edge is drawn when the diseases share a SNP lying in a TAD border for a majority of values of the window parameter k (underlying Hi-C data from [30], IMR90 cell type). The network labeled ‘non-border SNPs’ involves the complementary set of SNPs. Additionally, the two networks have been re-drawn considering only intergenic SNPs. Non-cancer disease and cancer nodes (and among them, those displaying TAD border enrichment) were underlined with different colors. Note that each cancer or each non-cancer disease is associated with an ensemble of border SNPs, of non-border SNPs, of intergenic border SNPs and of intergenic non-border SNPs, and could be present in more than one of the four networks. In Fig. 5, the four networks have been visualized using NetworkX Python package. For each network, only diseases (nodes) having an associated SNP of the prescribed type are drawn.
For a quantitative comparison of the four networks regarding their clustering and assortativity properties, we computed an indicator for any subset of nodes, e.g. a group of nodes with the same color. This indicator, called network coherence, is defined as the z-score of the number of edges within the subset of nodes, compared to a thousand randomly drawn groups of nodes of the same size [42, 43]. Network coherence thus measures whether the induced subgraph is more densely connected (i.e. contains more links) than expected at random in the original network. As a z-score, it provides an absolute quantification, independent of the overall size of the group, which makes cross-comparisons possible. Choosing a threshold larger than 1 on the number of shared SNPs required to draw an edge does not change qualitatively the results but reduces the number of diseases involved, which brings statistics to a limit. All the results presented in the text were obtained with a threshold equal to 1.