Optimization of gene-cell proposals against breast data using GA
To describe the mechanisms of cancer risk based on population genetics of BCa, we acquired 206 lead variants of European ancestry [16, 19–21] (Table 1). For each variant, we identified proxy SNPs in LD plus candidate genes (described in Methods). These SNPs were within 200 kb of 2,292 genes of which 51% (n = 1,175) were protein-coding. To better understand these SNPs in the context of normal breast, we identified matching pairs of samples for snRNA-seq and snATAC-seq [23] (Table 1, Fig. 1A). Within these data, cells were divided into 10 clusters: hormone receptor-positive and -negative luminal cells, basal cells, blood and lymphatic endothelial cells, vascular accessory cells, adipocytes, fibroblasts, myeloid, and lymphoid cells.
The biggest challenge is the large number of combinations of hypotheses for every locus. In this study, there are at least 10206 combinations of plausible solutions when considering only genes. We chose GA to identify the most plausible gene and cell set (“proposal”) based on diverse evidence sources. The evidence sources for gene and cell type prioritization are captured in a set of named objective functions described in the Methods and Table 1.
Table 1
Data sets used in this study.
Data Set
|
Source
|
Description
|
BCa GWAS (integrated references)
|
Zhang et al., 2020 [16], Michailidou et al., 2017 [19], Garcia-Closas et al., 2013 [20], Milne et al., 2017 [21]
|
The following study accessions were used: GCST001930, GCST010098, GCST010099, GCST010100, GCST004988, GCST005076, GCST005077, and GCST005075.
|
snRNA-seq
|
Raths et al., 2023 [23]
|
66,926 nuclei from 9 cis-gender females (GSE168836).
|
snATAC-seq
|
Raths et al., 2023 [23]
|
27,459 nuclei from 9 cis-gender females (GSE168837).
|
Marker gene
|
|
This study, reanalysis of gene expression marker set at cell type resolution from Raths et al., 2023 [23].
|
MAGMA gene
|
Zhang et al., 2022 [14]
|
MAGMA, a set of putative disease genes from the 2013 UK biobank 460k release for BCa, consisting of 1,000 genes.
|
Cancer gene
|
Sondka et al., 2018 [26]
|
COSMIC cancer gene consensus is a collection of gene mutation and fusion implicated in cancer.
|
Protein-protein interactions
|
Szklarczyk et al., 2023 [27]
|
STRING (string-db.org), is a database of protein-protein interaction with experimental evidence.
|
LncRNA and protein interactions
|
Li et al., 2023 [28]
|
LncBook is a comprehensive resource of human lncRNA-protein interaction.
|
Promoter regions
|
10X Genomics
|
Promoter region extracted from prebuilt GRCh38 genome reference version 2020-A [23].
|
Common/marker ATAC peaks set
|
|
This study, reanalysis of open chromatin peak set at cell type resolution from Raths et al., 2023 [23].
|
We optimized for 200 generations and then analyzed the proposals in the last generation (Gen200) to assess the result (Fig. 2A, B). We observed that information was distributed unevenly between OFs: the mean score for isCancerGene, isLPI, isPromoter, and isInterPPI were less than 0.1 (Fig. 2C), whereas isPPI had the highest score (0.941). The remaining OFs had scores ranging from 0.410 to 0.832. When compared to other proposals, the elite proposal did not have top scores in all OFs. We asked whether consensus solutions might have a higher score than the elite proposal. To do this, we identified the top gene and cell type for all loci across 1,000 Gen200 proposals. Surprisingly, we observed a fitness score of 0.433 for the consensus – an improvement over the elite proposal (0.429). We observed no change for isMAGMAgene, isCancerGene, isPPI, and isPromoter between the consensus and elite proposal. However, we did observe higher OF scores for isMarkerGene, isMarkerPPI, isIntraPPI, isInterPPI, isCommonATAC and isMarkerATAC, and lower OF scores for isLPI in the consensus compared to the elite proposal. This result suggests the existence of multiple, mutually exclusive, but equally stable solutions preserved only in the consensus proposal.
GA identifies known targets
We compared genes discovered in the consensus against L2G [11] and a naive nearest gene classifier (distance from TSS). L2G outputs the likelihood a gene is causal for the SNP (L2G score) based on distance, molecular QTL, chromatin interaction and variant pathogenicity. We identified the same SNPs across the dataset and selected the gene with the highest L2G score. Of the 175 common loci, we observed 46.8% (n = 82) with shared prediction between L2G and consensus. Across all three models, 68 loci shared the same gene. In total, 77.7% (136 out of 175 loci) L2G genes were the nearest gene to the SNP, so we did not expect our model to have high concordance with L2G because we did not include a gene distance OF. While gene distance to SNP is worth consideration, it has been reported that the nearest gene to the SNP is affected only 15% of the time [33]. In contrast, in our predictions, 41.7% (86 out of 206 loci) were the nearest gene, an intermediate value between these two figures.
The identification of high confidence gene and cell type calls are essential for downstream analysis. We performed a power calculation to determine the threshold for identifying high confidence calls. To do this, we selected a threshold where 80% of high confidence L2G SNPs with the same gene prediction as the consensus (L2G ≥ 0.7) are detected (949 proposals) (Fig. 3A). We used this same threshold to identify high confidence cell types (Fig. 3B). The number of loci with a high confidence call in gene and cell type are 147 and 118 out of 206 respectively. At lead SNP rs10941679, we found the top gene and cell type was FGF10 and “fibroblast” in Gen200 (Fig. 3C, D). Compared to L2G, MRPS30 (L2G = 0.542) was ranked higher than FGF10 (L2G = 0.145) for the same SNP due to support from the QTL and distance modules [11]. Interestingly, eQTL analysis with rs10941679 revealed changes in gene expression levels for MRPS30 and FGF10 in MCF7 and BT474 BCa cell lines [34]. In our model, we observed shared evidence (isMAGMAgene and isPPI) for both genes. However, FGF10 had isMarkerGene as additional evidence. This result highlights the ability of our model to account for complex interactions and mechanisms.
Contribution of individual OFs to overall fitness
We assessed each OF’s contribution to fitness by comparing information content between Gen0 and Gen200. To do this, we computed the posterior probability of observing an OF score in Gen200 given Gen0. We also computed the effect size (ES) as the median difference between the two distributions. We found that the most informative OFs were isPPI (ES = 0.744) and isIntraPPI (ES = 0.870). We expected isMAGMAgene, which captures gene expression as a function of GWAS, to be the most informative OF. Although informative, isMAGMAgene yielded a lower score (ES = 0.359) than the top OF. In contrast, isCancerGene was not informative (ES = 0.106). IsLPI was also not informative (ES = 0.008), possibly due to a low number of lncRNA in the consensus (n = 5). For cell type prioritization, we observed isCommonATAC (ES = 0.438) and isMarkerATAC (ES = 0.398) to be informative, as expected.
We next investigated the information content on a locus-by-locus basis. To accomplish this, we counted all loci with OF support in Gen0 and Gen200. We used Kolmogorov-Smirnov (KS) to test whether these observations derive from the same theoretical distribution (KS test p = 2.20 x 10− 16). In Gen0, we observed 54.8% loci (n = 113) without OF support. In contrast, every locus had at least one supporting OF in Gen200. Additionally, Gen200 had more supporting OFs per locus (µ = 4.76, SD = 1.64) when compared to Gen0 (µ = 0.8, SD = 1.04). Taken together, five OFs had evidence for a large number of positive loci (isMAGMAgene, isPPI, isCommonATAC, isMarkerATAC, and isIntraPPI).
BCa GWAS loci are enriched in associations with breast-specific assays
We reasoned that, according to our hypothesis, in which GWAS variants interact to link common cell types and pathways, there should be a greater number of associations both with disease relevant data and between loci. To test these predictions, we first evaluated whether the solutions discovered by GA had higher fitness than those from equivalent sets of randomly selected variants. Second, we analyzed the network properties of BCa GWAS relative to these control sets.
We repeated GA as before with our 10 control sets. To capture random variation in stable solutions we ran nine additional models for BCa and each control set, each with a different initial population, (10 BCa and 100 control GA runs) (Fig. 4A). In Gen200, we computed the posterior probability of observing the BCa fitness scores given the control distribution (BCa: µ = 0.415, SD = 1.83 x 10− 3; control: µ = 0.330, SD = 8.77 x 10− 3). We observed the BCa fitness scores were significantly higher compared to the control by assessing the probability that the mean difference is zero or less (p = 0.041). Thus, our model is able to distinguish between BCa and randomly chosen SNPs. Moreover, the higher fitness score reveals the potential for true biological associations between BCa GWAS and breast derived multi-omics data.
If a higher fitness score in BCa is driven by its associations with breast-specific data, we predict that the BCa and control set fitness scores should also be driven by different OFs. To test this prediction, we computed the posterior probability of observing positive OFs in BCa given the control set (Fig. 4B). We observed isMAGMAgene, isCommonATAC, and isMarkerATAC higher in BCa than control (ES greater than zero, p < 0.05). We expected isMAGMAgene to outperform in BCa compared to the control group (ES = 0.314) as it’s derived from breast expression. The enrichment of isCommonATAC and isMarkerATAC relative to control suggests that BCa SNPs are associated with normal breast cell types. In contrast, isPromoter, isLPI, isMarkerGene, isMarkerPPI, and isInterPPI were indistinguishable between the BCa and control set when assessing the frequency of ES greater than zero (p ≥ 0.05) (Fig. 4B). Surprisingly, we observed isIntraPPI (p = 0.073) and isPPI (p = 0.074) had a small ES when comparing the BCa to the control set, 0.057 and 0.039 respectively. The result shows that even randomly selected SNPs have a high PPI score.
Given the enriched OFs in BCa, we asked how individual loci contributed to increased fitness over control. To address this, we measured the information content at all BCa loci. We computed the number of OF support for the 176 BCa loci used to match the control sets. We identified the consensus gene and cell type in Gen200 for the 10 BCa GA runs and the 10 matching SNPs from the 10 control GA runs (total of 10 x 10 = 100 runs) and computed the number OF support for each of the 110 GA runs. We used the Wilcoxon rank-sum test to identify differences between OF support by comparing the two distributions. After multiple hypothesis correction, we observed 61.4% (n = 108) BCa loci with higher OF support than control (p ≤ 0.05, ES > 0). In contrast, we observed 8.5% (n = 15) BCa loci with lower OF support than control (p ≤ 0.05, ES < 0). Our analysis of the result demonstrates a majority of loci in BCa have higher OF support than due to chance alone, and provides critical information about lack of support for other loci. This procedure can be used to measure the benefit of OFs, and to exclude non-informative loci from downstream analysis.
By curating a set of control SNPs, we identified the most informative OFs (isMAGMAgene, isCommonATAC, and isMarkerATAC) that distinguish BCa from control. These OFs corresponded to the breast specific data. We anticipated that isPPI would be an informative OF, but despite its overall importance to the outcome for BCa and control (OF mean = 0.94 vs. 0.90 controls) our analysis revealed no significant difference. It is possible that including all interaction experimental evidence (interaction score > 0) from STRING in our isPPI OF may not be stringent enough. It is also possible the quality of interactions as measured in network size is better in BCa than control, and we explore this next.
GWAS variants are enriched for larger networks
Based on our OF enrichment analysis (Fig. 4B), PPI failed to distinguish between the BCa and control set. This finding did not support our hypothesis that molecular interaction mechanisms are embedded within GWAS. If the control set represents variants without any true associations to breast data, then we predict BCa will have larger PPI network sizes. To test our prediction, we identified all PPI (interaction score ≥ 0.4) for the 10 BCa and 100 control GA run. We observed no significant difference in the number of subgraphs between the two groups (KS test p = 0.633) (Fig. 5A). Next, we computed the number of genes per subgraph and observed the control having fewer genes in their largest subgraph (µ = 7.95, SD = 3.52) when compared to the BCa sets (µ = 28.6, SD = 3.5). We used KS to test whether these observations derive from the same theoretical distribution (KS test p = 2.132 x 10− 14). Additionally, we downsampled the BCa (n = 176) to adjust for the additional 30 SNPs that we excluded in making the control sets. We observed BCa (µ = 21.9, SD = 7.61) still had more genes in their largest subgraph compared to controls (KS = 4.71 x 10− 8) (Fig. 5B, C). The result strongly supports the conclusion that genes selected in BCa GWAS have a larger PPI network than expected due to chance, consistent with our hypothesis that GWAS variants are functionally connected.
Reconstruction of cellular interaction from the consensus proposal
Earlier, we found the surprising result that the consensus proposal scored higher than the highest scoring elite proposal. We speculated that competing subsets of loci in different proposals produce more than one family of stable solutions. To quantify diversity of the Gen200 proposal set, we computed the Gini-Simpson index for the 206 loci in the 10 BCa GA runs. We selected loci with low diversity (Gini-Simpson index ≤ 0.5 and gene count ≤ 2) that produced the same gene predictions across multiple independent runs. Of the 118 high confidence BCa SNPs, we identified 26 loci with PPI. We constructed a projection of the protein interaction network which consisted of 6 subgraphs – the largest having a total of 12 genes (subgraph 1) (Fig. 6).
We constructed a map that links genetic variants to gene and cell type. To accomplish this, we annotated predicted cell type on the PPI network graph from Fig. 6. The largest subgraph included basal, luminal hormone receptor positive and negative, fibroblast, adipocytes and blood endothelial, and lymphatic cell types. This result shows in principle how an interpretable model of GWAS can be constructed from the consensus proposal.