Data Sources and Quality Control
Figure 1 outlines the workflow for our study. Due to the confounding effects of ancestral differences in linkage disequilibrium (LD) structure and the scarcity of sufficiently large multi-ancestry samples, we limited our main analysis to individuals of European ancestry. We sourced genome-wide association study (GWAS) summary statistics from the most comprehensive and recent publicly available datasets of European ancestry. Specifically, GWAS summary statistics for leukocyte telomere length (LTL) were derived from a published GWAS comprising 464,716 individuals of European ancestry from the UK Biobank7. LTL was quantified as the ratio of telomere repeats copy number (T) to a single copy gene (S) in a mixed leukocyte population, measured via a multiplex quantitative polymerase chain reaction (qPCR) assay, and subsequently log-transformed to achieve an approximation to a normal distribution. Our selection criteria for GWAS included studies with sample sizes exceeding 50,000 to ensure adequate statistical power. Accordingly, we included GWAS summary statistics for six major cardiovascular diseases (CVDs): atrial fibrillation (AF)8, coronary artery disease (CAD)9, venous thromboembolism (VTE)10, heart failure (HF)11, peripheral artery disease (PAD)12, and stroke13. AF GWAS summary statistics were sourced from a genome-wide meta-analysis of six studies (The Nord-Trøndelag Health Study [HUNT], deCODE, the Michigan Genomics Initiative [MGI], DiscovEHR, UK Biobank, and the Atrial Fibrillation Genetics [AFGen] Consortium), encompassing 60,620 AF cases and 970,216 controls of European ancestry. For CAD, we utilized GWAS summary statistics from a genome-wide meta-analysis by the CARDIoGRAMplusC4D Consortium and the UK Biobank, which included 181,522 cases and 984,168 controls. VTE GWAS summary statistics were extracted from a meta-analysis of 81,190 cases and 1,419,671 controls of European ancestry across 7 cohorts (the Copenhagen Hospital Biobank Cardiovascular Disease Cohort [CHB-CVDC], Danish Blood Donor Study [DBDS], deCODE, Intermountain Healthcare, UK Biobank, FinnGen, and Million Veterans Program [MVP] Consortium). GWAS summary statistics for HF came from the Heart Failure Molecular Epidemiology for Therapeutic Targets (HERMES) Consortium, including 47,309 cases and 930,014 controls. GWAS summary statistics for PAD were derived from a genome-wide meta-analysis of 11 independent GWASs, totaling 12,086 cases and 499,548 controls. Lastly, GWAS summary statistics for Stroke were obtained from the GIGASTROKE consortium, which comprised 73,652 cases and 1,234,808 controls of European ancestry. Detailed information about these GWAS summary statistics and their original publication sources is available in Supplementary Table 1.
Prior to further analysis, stringent quality control measures were applied to the GWAS summary statistics, encompassing several key steps: (i) alignment with the hg19 genome build, referencing the 1000 Genomes Project Phase 3 Europeans; (ii) restriction of the analysis to autosomal chromosomes; (iii) removal of single nucleotide polymorphisms (SNPs) lacking a rsID or presenting duplicated rsIDs; and (iv) exclusion of rare or low-frequency variants, defined by a minor allele frequency (MAF) less than 1%. To ensure robust and interpretable comparisons between LTL and CVDs, we standardized the summary statistics to include only SNPs present across all analyzed phenotypes, resulting in a cohesive dataset of 6,923,146 SNPs. Additionally, in subsequent analyses, we implemented further data processing techniques tailored to the specific requirements of various statistical tools.
Genetic overlap
To explore the shared genetic foundations between LTL and six major CVDs, we evaluated genetic overlap across genome-wide, polygenic, and local levels.
Genome-wide genetic correlation analysis between LTL and CVDs
At the genome-wide level, we analyzed the genetic correlations (rg) between LTL and six major CVDs using cross-trait linkage disequilibrium (LD) score regression (LDSC)49,50. LDSC facilitates the estimation of the average genetic effect sharing across the entire genome between two traits, leveraging GWAS summary statistics. This includes the contribution of SNPs below the threshold of genome-wide significance and accounts for potential confounding factors such as polygenicity, sample overlap, and population stratification. This analysis utilized pre-computed LD scores from the European reference panel in the 1000 Genomes Project Phase 3, excluding SNPs that did not overlap with the reference panel. Notably, the major histocompatibility complex (MHC) region (chr 6: 25-35 Mb), known for its intricate LD structure, was omitted from the main analysis. LDSC analysis estimated the genetic correlations between LTL and the six major CVDs. This method utilizes a weighted linear model, where it regresses the product of Z-statistics from two traits against the LD score across all genetic variants genome-wide. Genetic correlations with P-values below the Bonferroni-adjusted threshold (P = 0.05 / number of trait pairs = 0.05 / 6 = 8.33×10-3) were deemed statistically significant.
To elucidate the biological underpinnings of the shared genetic predisposition to LTL and six major CVDs, we employed stratified LDSC applied to specifically expressed genes (LDSC-SEG) to identify relevant tissue and cell types. This analysis incorporated tissue and cell type-specific expression data from the Genotype-Tissue Expression (GTEx) project51 and the Franke lab52,53, covering 53 tissues and 152 cell types. Additionally, we utilized chromatin-based annotations associated with six epigenetic marks (DNase hypersensitivity, H3K27ac, H3K4me1, H3K4me3, H3K9ac, and H3K36me3) for validation purposes. These annotations included 93 labels from the Encyclopedia of DNA Elements (ENCODE) project and 396 labels from the Roadmap Epigenomics database. For the identified relevant tissues or cell types, we adjusted the P-values for the significance of the coefficients using the False Discovery Rate (FDR) method. An FDR threshold of < 0.05 was established as the criterion for statistical significance.
Polygenic overlap analysis between LTL and CVDs
To augment the genome-wide genetic correlation analysis, we engaged the causal mixture modeling approach (MiXeR) to quantify the polygenic overlap between LTL and six major CVDs, independent of genetic correlation directions. MiXeR estimates the quantity of shared and phenotype-specific "causal" variants that exert non-zero additive genetic effects, accounting for 90% of SNP-heritability in each trait54. This 90% SNP-heritability threshold minimizes the influence of variants with negligible effects. Importantly, MiXeR's capacity to evaluate polygenic overlap without regard to the directional effects of variants offers a nuanced view of local genetic associations that might be obscured in traditional genome-wide genetic correlation estimations due to opposing variant effects. Initially, univariate MiXeR analyses estimated the "causal" variant count for LTL and six major CVDs, assessing both polygenicity and the average magnitude of additive genetic effects among these variants. The LD structure was determined using the genotype reference panel from the 1000 Genomes Project Phase 3. The MHC region (chr 6: 25-35 Mb), known for its intricate LD structure, was excluded from the main analysis. Subsequently, bivariate MiXeR analysis quantified the polygenic overlap between LTL and the six CVDs. This analysis delineated the additive genetic effects across four categories: (i) SNPs with zero effect on both traits, (ii) trait-specific SNPs with non-zero effects on the first trait, (iii) trait-specific SNPs with non-zero effects on the second trait, and (iv) SNPs affecting both traits. The Dice coefficient (DC), representing the proportion of shared SNPs between two traits relative to the total number of SNPs associated with either trait, was used to estimate polygenic overlap. Additionally, MiXeR calculated the overall genetic correlations (rg), the correlation of effect sizes within the shared genetic component (rgs), and the fraction of variants with concordant effects in the shared component. The model's predictive accuracy was evaluated through comparison of the modeled versus actual data, utilizing conditional quantile-quantile (Q-Q) plots, log-likelihood plots, and the Akaike information criterion (AIC). Positive AIC differences are interpreted as evidence that the best-fitting MiXeR estimates are distinguishable from the reference model. A negative AIC value indicates that the MiXeR model fails to distinguish effectively between maximal and minimal overlap scenarios.
Local genetic correlation between LTL and CVDs
To investigate whether there are any genomic loci with pronounced genetic correlations despite negligible genome-wide rg, we utilized the Local Analysis of [co]Variant Annotation (LAVA) method to estimate the localized genetic correlation between LTL and six major CVDs. LAVA, grounded in a fixed-effects statistical model, enables the estimation of local SNP heritability (loc-h2SNP) and genetic correlations (loc-rgs) across 2,495 semi-independent genetic regions of approximately equal size (~1 Mb). This method is adept at identifying loci with mixed effect directions, offering a nuanced measure of genome-wide genetic overlap, albeit influenced by the statistical power of the underlying GWAS data. For this analysis, we adopted the genomic regions defined by Werme et al. as autosomal LD blocks, characterized by minimal inter-LD block linkage with an average size of 1 million bases, each containing at least 2500 variants. The LD reference panel from the 1000 Genomes Project Phase 3 for European samples was employed, and consistent with LDSC and MiXeR analyses, the MHC region (chr 6: 25-35 Mb) was excluded. Identifying meaningful loc-rgs requires a significant local genetic signal; thus, a stringent P-value threshold (P < 1×10-4) was applied to filter out non-significant loci. Subsequent bivariate testing on selected loci and traits with notable univariate genetic signals was conducted. For these loci, P-values for loc-rgs were adjusted using the Benjamini-Hochberg FDR method, with an FDR < 0.05 establishing statistical significance. LAVA incorporated an estimate of sample overlap (genetic covariance intercept) from bivariate LDSC analyses to account for potential sample overlap effects.
In cases where shared risk loci were identified across multiple phenotypes, the Hypothesis Prioritisation for multi-trait Colocalization (HyPrColoc) method was employed to assess whether association signals across more than a pair of traits were colocalized. HyPrColoc, an efficient deterministic Bayesian clustering algorithm, leverages GWAS summary statistics to identify clusters of colocalized traits and potential causal variants within a genomic locus, providing a posterior probability (PP) of colocalization for each cluster. Loci with a PP > 0.7 were deemed colocalized, enhancing our understanding of shared genetic architectures across traits.
Pleiotropy insights to dissect genetic overlap
Causal inference between LTL and CVDs
To elucidate the potential causal relationships underlying the genetic correlations observed between LTL and six major CVDs, we employed latent causal variable (LCV) analysis. This method posits that the genetic correlation between two traits operates through a latent factor, allowing for the distinction between genetic causality (vertical pleiotropy) and both correlated and uncorrelated horizontal pleiotropy55. It achieves this by estimating the genetic causality proportion (GCP) across all genetic variants, where GCP quantifies the share of each trait's heritability explained by a mutual latent factor. It is essential to emphasize that GCP does not indicate the magnitude of causal effects but only implies a causal relationship between traits. This method offers insights to evaluate whether the impact of one trait on the second exceeds the evidence in the reverse direction. The sign of genetic correlation can be employed to infer the consequence of the partial genetic causality of one trait on another. A GCP greater than zero suggests a partial genetic causal relationship from trait 1 to trait 2 and vice versa, with values closer to |GCP| = 1 indicating stronger evidence of vertical pleiotropy. Conversely, a GCP of zero implies horizontal pleiotropy. We considered an absolute GCP estimate (|GCP|) greater than 0.60 as indicative of substantial genetic causality, applying a Bonferroni-corrected significance threshold of P < 8.33×10-3 to account for multiple comparisons across trait pairs. The LCV analysis, while robust, assumes a singular, unidirectional latent variable driving the genetic correlation, a premise potentially confounded by bidirectional causal effects or multiple latent factors.
Addressing the limitations inherent to LCV, we also conducted an analysis using Latent Heritable Confounder Mendelian Randomization (LHC-MR)56. This advanced Mendelian randomization approach leverages genome-wide variants to explore bi-directional causal associations between complex traits, thus enhancing the capacity to discern bi-directional genetic effects, direct heritabilities, and confounder effects, even amidst sample overlap. By modeling an unobserved heritable confounder affecting both exposure and outcome traits, LHC-MR assures the critical assumption of exchangeability57. Its comprehensive modeling of potential SNP effects-direct, indirect, or null-on the traits offers more accurate causal effect estimates compared to traditional MR methods, such as MR Egger, weighted median, inverse variance weighted (IVW), simple mode, and weighted mode58-60. For statistical significance, we adjusted for multiple testing with a threshold of P < 4.17×10-3, considering both the number of trait pairs and the number of tests conducted.
Pairwise Pleiotropic analysis between LTL and CVDs
To delve into the role of horizontal pleiotropy between LTL and six major CVDs, we utilized the Pleiotropic Analysis under the Composite Null Hypothesis (PLACO) method to conduct a comprehensive genome-wide identification of pleiotropic SNPs that concurrently influence the risk of both traits simultaneously. PLACO operates on the composite null hypothesis that asserts a genetic variant is either associated with just one or neither trait, thereby distinguishing between pleiotropic effects and singular trait associations61. The method evaluates this hypothesis by examining the product of the Z statistics derived from the GWAS summary statistics of both traits, formulating a null distribution of the test statistic as a mixture distribution. This allows for the acknowledgment of SNPs that may be linked to only one or none of the phenotypes under study. The threshold for identifying pleiotropic SNPs with significant evidence of genome-wide pleiotropy was set at PPLACO < 5×10-8. Additionally, to adjust for possible sample overlap, we de-correlated the Z-scores using a correlation matrix directly estimated from the GWAS summary statistics, ensuring a more accurate interpretation of pleiotropic effects.
Characterization of pleiotropic loci and functional annotation
The Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA) platform was employed to identify independent genomic loci and conduct functional annotation for pleiotropic SNPs revealed by PLACO analysis61. FUMA, leveraging data from 18 biological databases and analytical tools, annotates GWAS findings to highlight probable causal genes through positional and eQTL mapping 62. The 1000 Genomes Project Phase 3 European-based LD reference panels were utilized for LD structure correction. Initially, FUMA distinguishes independent significant SNPs (meeting genome-wide significance at P < 5×10-8 and r2 < 0.6), further defining a subset as lead SNPs based on mutual independence (r2 < 0.1). LD blocks within 500 kb of lead SNPs are merged to delineate distinct genomic loci, with the SNP exhibiting the lowest P-value in each locus designated as the top lead SNP. The analysis then assesses directional effects between LTL and six major CVDs by comparing Z-scores of these top lead SNPs. SNPs achieving genome-wide significance (P < 5×10-8) in individual GWAS for each trait were annotated using FUMA for comparative analysis. The identified pleiotropic loci were considered novel if they did not coincide with the loci previously reported in the original GWAS for LTL or any of the six major CVDs. In other words, to be deemed 'novel,' a locus identified through FUMA should not have exhibited statistical significance in the single trait GWAS.
To elucidate the biological underpinnings of the observed statistical associations, lead SNPs were annotated using Annotate Variation (ANNOVAR)63 for their proximity to genes and potential impact on gene function. The Combined Annotation-Dependent Depletion (CADD) score64, which aggregates insights from 67 annotation resources, was used to assess the deleteriousness of variants. Variants with CADD scores greater than 12.37 were deemed likely to exert deleterious effects. Furthermore, the RegulomeDB score provided a categorical assessment of an SNP's regulatory potential based on expression quantitative trait loci (eQTL) and chromatin marks, ranging from 1 (strong evidence of regulatory functionality) to 7 (minimal evidence)65. The highest regulatory potential is indicated by a score of 1a, while a score of 7 suggests the least regulatory significance. Chromatin states, determined by ChromHMM using data from 127 epigenomes and five chromatin marks, revealed the genomic regions' accessibility, categorized into 15 states. For the identification of putative causal genes, SNPs were mapped using two approaches: positional mapping within a 10-kb window around the SNP and eQTL mapping.
Colocalization analysis
For pleiotropic loci identified and annotated by FUMA, we conducted a colocalization analysis using COLOC to pinpoint potential shared causal variants across pairwise traits within each locus. COLOC evaluates five mutually exclusive hypotheses for each pair of traits at a locus66: H0 posits no association with either trait; H1 and H2 suggest an association with only one of the traits; H3 indicates that both traits are associated due to different causal variants; and H4 implies a shared association for both traits stemming from the same causal variant. The analysis was performed using default COLOC prior probabilities: p1 and p2, each set at 1×10-4 for an SNP's association with the first and second trait, respectively, and p12 at 1×10-5 for an SNP associated with both traits. A Posterior Probability for Hypothesis 4 (PP.H4) greater than 0.7 was considered strong evidence for colocalization, suggesting the presence of shared causal variants at the locus. The SNP exhibiting the highest PP.H4 was identified as a candidate causal variant.
Gene level analyses
Building on the insights from PLACO, we delved into the shared biological processes and pathways involving the identified pleiotropic loci. Through gene-level analysis using Multi-marker Analysis of GenoMic Annotation (MAGMA)67, we assessed genes within or intersecting the pleiotropic loci, integrating data from both PLACO and single-trait GWAS. Unlike permutation-based approaches, MAGMA employs a multiple regression model that incorporates principal component analysis to evaluate gene associations. This model calculates a p-value for each gene, aggregating the impact of all SNPs linked to that gene while considering gene size, SNP count per gene, and linkage disequilibrium (LD) among the markers. SNPs were attributed to genes based on their location within the gene body or within a 10 kb range upstream or downstream. The LD calculations leveraged the 1000 Genomes Project Phase 3 European population as the reference panel, with SNP locations determined using the human genome Build 37 (GRCh37/hg19) and focusing on 17,636 autosomal protein-coding genes. A gene was deemed significant if its p-value was below 0.05 after applying a Bonferroni correction for the total number of protein-coding genes and the six trait pairs analyzed (P = 0.05 / 17,636 / 6 = 4.73×10-7). Due to complex LD patterns, the MHC region (chr6: 25-35 Mb) was excluded from MAGMA's gene-based analysis.
To overcome the limitations of MAGMA, which assigns SNPs to their nearest genes based on arbitrary genomic windows potentially missing functional gene associations due to long-range regulatory effects, we employed eQTL-informed MAGMA (e-MAGMA)68 for a more nuanced investigation of tissue-specific gene involvement based on PLACO results. e-MAGMA retains the statistical framework of MAGMA, using a multiple linear principal component regression model, but enhances gene-based association analysis by incorporating tissue-specific cis-eQTL information for SNP assignment to genes, which yields more biologically relevant and interpretable findings. For our analysis, we utilized eQTL data from 47 tissues provided by the GTEx v8 reference panel, as available on the e-MAGMA website. Guided by the principle that analyses focused on disease-relevant tissues yield more pertinent insights, we selected ten relevant tissues for our study. These included three artery tissues, two adipose tissues, two heart tissues, and three additional tissues (whole blood, liver, and EBV-transformed lymphocytes), chosen based on their significant enrichment in the LDSC-SEG analysis. The LD reference data for our analysis came from the 1000 Genomes Phase 3 European panel. We calculated tissue-specific p-values for each gene across the selected tissues, with significance determined post-Bonferroni correction for the number of tissue-specific protein-coding genes and trait pairs examined. For instance, the significance threshold for adipose subcutaneous tissue was set at P = 0.05 / 9,613 / 6 = 8.67×10-7. Similar to MAGMA, e-MAGMA analysis results within the MHC region (chr6: 25-35 Mb) were excluded to avoid confounding due to complex LD patterns. Additionally, we conducted a transcriptome-wide association study (TWAS) based on single-trait GWAS results using the functional summary-based imputation software, FUSION, applying tissue-specific Bonferroni corrections to determine significance. The FUSION approach integrates GWAS summary statistics with pre-computed gene expression weights, referencing the same tissues analyzed in the e-MAGMA study from the GTEx v8 dataset. This integration takes into account the LD structures to identify significant relationships between gene expression levels and specific traits.
Pathway level analyses
To investigate the genetic pathways underlying the comorbidity of LTL and six major CVDs, we utilized MAGMA for gene-set analysis. This analysis employs a competitive approach, where test statistics for all genes within a gene set, such as a biological pathway, are aggregated to derive a joint association statistic. Gene sets were sourced from the Gene Ontology biological processes (GO_BP) via the Molecular Signatures Database (MsigDB), with gene definitions and association signals derived from MAGMA gene-based analysis. We adjusted for multiple testing using a Bonferroni correction, setting the threshold at P = 0.05 / 7,744 / 6 = 1.08×10-6. We then conducted functional enrichment analysis on the genes overlapping across more than one trait pair, which were significantly identified by both MAGMA and e-MAGMA analyses. For this purpose, the ToppGene Functional Annotation tool (ToppFun) was employed to identify significantly represented biological processes and enriched signaling pathways, considering the entire genome as the background. ToppFun performs Functional Enrichment Analysis (FEA) on the specified gene list, leveraging a broad spectrum of data sources, including transcriptomics, proteomics, regulomics, ontologies, phenotypes, pharmacogenomics, and bibliographic data. The list of candidate genes was submitted to the ToppFun tool within the ToppGene Suite, with an FDR < 0.05 established as the threshold for statistical significance.
Proteome-wide Mendelian Randomization study analysis
To explore potential common causal factors at the proteomic level, we utilized Summary data-based Mendelian Randomization (SMR)69 to examine associations between protein abundance and disease phenotype. This analysis leveraged index cis-acting variants (cis-pQTLs) identified in the UK Biobank Pharma Proteomics Project (UKB-PPP), which includes plasma samples from 34,557 European individuals with the measurement of 2,940 plasma proteins using the Olink Explore platform. Cis-pQTLs were defined as SNPs located within a 1Mb radius from the transcription start site (TSS) of the gene encoding the protein. Only index cis-pQTLs associated with plasma protein levels at a genome-wide significance threshold (P < 5×10-8) were considered for inclusion in the SMR analysis. Summary data-based Mendelian Randomization (SMR) is designed to prioritize genes for which expression levels are potentially causally linked to an outcome trait, utilizing summary statistics within a Mendelian Randomization framework. To differentiate between pleiotropy and linkage69,70 (where protein abundance and a phenotype manifestation could be influenced by two separate causal variants in strong linkage disequilibrium with one another), the Heterogeneity in Dependent Instrument (HEIDI) test was employed. A HEIDI test p-value below 0.01 signifies the presence of two distinct genetic variants in high linkage disequilibrium, explaining the observed associations. Additionally, to address potential biases from analyzing single SNPs, a multi-SNP approach (multi-SNPs-SMR) was utilized as a sensitivity analysis, enhancing the reliability of the statistical evidence. A p-value less than 0.05 in the multi-SNPs-SMR analysis was deemed significant. Furthermore, HyPrColoc analysis was employed to ascertain whether the associations identified between proteins and various diseases stemmed from the same causal variant or were due to linkage disequilibrium. A posterior probability of a shared causal variant (PP.H4) greater than 0.7 signifies strong evidence of colocalization between proteins and multiple diseases.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.