Sample information
This study used data from participants with European (N = 3514), Latin (N = 4309), or Persian (N = 1332) ancestry in the Population Urban Rural Epidemiology (PURE) biomarker sub-study [32]. Participants of non-European ancestry were excluded from the present analyses due to the need to align the PURE genetic data with external genetic datasets, which are predominantly from European participants. Participants of Latin and Persian ancestry were included due to their genetic overlap with European participants [32].
The PURE biomarker study is a nested case-cohort study of the original PURE study [33] with protein biomarkers and genotyping data [32]. Cases were selected if they had experienced at least one of the following adverse health events: myocardial infarction, stroke, heart failure, type II diabetes, or death from any cause. Cohort members were selected by random sampling to obtain a group of participants who were frequency-matched by major country-specific ethnicity to the cases. The PURE biomarker study also included European participants enrolled in PURE-MIND (N = 1198) [9]. Participants from selected countries in PURE [33] were invited to participate in PURE-MIND if they were aged ≥ 39 years, had no history of stroke, dementia, or other neurological disease; had no contraindication to MRI; and could complete cognitive assessments [9].
The European, Latin, and Persian PURE biomarker cohort participants were used to identify protein biomarker pQTLs [32], for use in Mendelian randomisation analyses, whilst data from the European PURE-MIND biomarker participants were used for observational association analyses. We sought replication of our observational findings in GS (max. N = 1053) [34, 35], which was recruited through re-contact of the Generation Scotland: Scottish Family Health Study (GS:SFHS ) [36, 37].
Ethical approval
All centres contributing to PURE were required to obtain approval from their respective ethics committees (Institutional Review Boards). Participant data is confidential and only authorized individuals can access study-related documents. The participants’ identities are protected in documents transmitted to the Coordinating Office, as well as biomarker and genetic data. Participants provided informed consent to obtain baseline information, and to collect and store genetic and other biological specimens.
The GS:SFHS obtained ethical approval from the NHS Tayside Committee on Medical Research Ethics, on behalf of the National Health Service (reference: 05/S1401/89). All participants provided broad and enduring written informed consent for biomedical research. GS:SFHS has Research Tissue Bank Status (reference: 15/ES/0040), providing generic ethical approval for a wide range of uses within medical research. The imaging subsample of GS:SFHS (referred to as “GS” herein) received ethical approval from the NHS Tayside committee on research ethics (reference 14/SS/0039). All experimental methods were in accordance with the Helsinki declaration.
Brain imaging
PURE-MIND participants enrolled in the PURE biomarker cohort were scanned at four sites in Canada (three at 1.5T (two on General Electric (GE) scanners, one on a Phillips scanner), one at 3T (GE)). The brain imaging phenotypes assessed in this study were total brain volume (excluding ventricles), total white matter volume, hippocampal volume, average cortical thickness, a multi-region composite thickness measure designed to differentiate Alzheimer’s disease patients from clinically normal participants [38], silent brain infarcts (SBI), cerebral microbleeds (CMB), and WMH volumes. These will henceforth be referred to as the “structural brain phenotypes”. SBI and CMB were defined as described previously [9]. WMH volumes were estimated using Lesion Segmentation Tool (LST) in Statistical Parametric Mapping 12 (SPM12) [39]. For analyses, WMH volumes were natural log-transformed, after adding one to account for values of zero, to reduce positive skew [40]. Brain volume measurements, intracranial volumes (ICV), and average cortical thicknesses were derived from T1-weighted images using FreeSurfer v5.3 (http://surfer.nmr.mgh.harvard.edu/) [41, 42]. GS participants [34] were scanned at two sites in Scotland (both 3T) [34]. Brain volumes and ICVs were derived from T1-weighted images using Freesurfer version 5.3 [41, 42]. As in PURE-MIND, WMH volumes in GS were obtained using LST [8].
Assessment of cognitive function
General cognitive ability was measured in PURE-MIND and GS by trained assessors using the DSST (Wechsler Adult Intelligence Scale, 3rd Edition) [43]. Participants were scored according to the number of correct matches made within two minutes (maximum score: 133). PURE-MIND participants completed the Montreal Cognitive Assessment (MoCA) [10], a questionnaire-based test with scores 0 to 30.
Measurement of plasma protein expression
Proteomic and genetic analysis were conducted in the Clinical Research Laboratory & Biobank – Genetic & Molecular Epidemiology Laboratory (CRLB-GMEL), Hamilton, Canada. In the PURE biomarker cohort, plasma protein levels were measured by proximity extension assay using the Olink Proseek Target 96 reagent kit (Olink, Uppsala, Sweden). Thirteen panels (Cardiometabolic; Cardiovascular Disease II and III; Cell Regulation; Development; Immune Response; Inflammation, Metabolism; Neuro Exploratory; Neurology; Oncology I and III; and Organ Damage) were used to measure a total of 1196 biomarkers in 12066 participants, of which 3735 European, 4695 were Latin, and 1436 were Persian. The analytical performance of these panels has been validated previously and further information can be found elsewhere (https://www.olink.com/products-services/target/). Quality control and pre-processing were performed as described previously for the Cardiovascular Disease II panel [32], with the exception that the data were quantile normalised within three, rather than two, reagent lots. Missing biomarker values were imputed by the mean, separately for each reagent lot, and all values were rank-based inverse normalised by reagent lot, sex and ethnic group. Where multiple biomarker measurements from different proximity extension assays were available for a single protein, the mean value was taken. Following quality control, measurements were available for 1160 biomarkers in between 8369 and 9154 European, Latin, or Persian participants (depending on biomarker-specific missingness).
In GS, plasma protein levels were measured with the SOMAscan assay platform (SomaLogic Inc.), as described previously [44]. Following initial data processing and quality control steps, measures of 4058 proteins were available in 1095 participants. Prior to analysis, protein abundance measurements were log-transformed and rank-based inverse normalised.
Genotyping and imputation of the PURE-MIND discovery sample
PURE participant genotypes (Thermofisher Axiom Precision Medicine Research Array r.3) were called using Axiom Power Tools and in-house scripts. Samples were removed if: they had a low signal-to-noise contrast (Dish Quality Control < 0.82); low quality control rate (QCCR < 0.97); <95% call rate; disagreement between self-reported sex and/or ethnicity and genetically determined sex and/or ethnicity; were duplicated; or had excess heterozygosity. We removed variants with: a call rate <98.5%; Hardy-Weinberg equilibrium p < 1 x 10-5; plate or batch effects; non-Mendelian segregation within families; and/or a minor allele frequency <0.005%. Following quality control, 749,783 variants remained [32].
Imputation was performed on the 749,783 variants following the TOPMed Imputation server pipeline (https://imputation.biodatacatalyst.nhlbi.nih.gov/), using the TOPMed release 2 reference panel [45]. EAGLE v2.4 [46] and Minimac4 programs were applied for phasing and imputation, respectively. Imputed variants with an info score ≥ 0.3 and MAF ≥ 0.01, which did not deviate from Hardy-Weinberg equilibrium (p ≥ 1 x 10-5) were retained.
Assessment of the association between protein biomarkers and cognitive function and structural imaging phenotypes
We assessed the association between standardised protein levels and cognitive and structural brain phenotypes using linear (DSST, MoCA, total brain volume, white matter volume, hippocampal volume, WMH volume, cortical thickness) or logistic (CMB, SBI) regression. The cognitive or structural brain phenotype-of-interest was modelled as the dependent variable with the standardised protein expression level, age, age2, sex, ethnicity, and first ten genetic principal components included as independent variables. A sensitivity analysis was performed for DSST-associated proteins in which we further adjusted for education (a categorical variable with levels: (i) no education; (ii) high school or less; (iii) trade school; and (iv) college or university). We calculated Pearson’s correlation coefficient to assess the pairwise correlations between DSST-associated proteins. Within each analysis, we applied a Bonferroni correction to determine statistical significance, yielding the following significance thresholds: p < 4.31 x 10-5 when assessing associations with 1160 proteins; p < 2.5 x 10-3 when assessing associations with the five DSST-associated proteins across four DSST-associated structural brain phenotypes; and p < 5 x 10-3 when assessing 10 pairwise correlations between proteins.
We performed replication analyses in GS for the significant proteins identified in PURE-MIND. Mixed effects models were fitted using the lmekin function from the R package coxme v.2.2.17 [47] to assess the association of the outcome variable (DSST performance, total brain volume (excluding ventricles), cerebral white matter volume, hippocampal volume, and WMH volume) with standardised protein expression, covarying for age, age2, sex, study site (Dundee or Aberdeen), the delay between blood sampling and protein extraction, depression (a binary variable representing lifetime depression status), and a kinship matrix. When a brain volume phenotype was the outcome variable, additional covariates were included to account for ICV, the interaction between ICV and study site (to account for a site-associated batch effect on ICV measurement), and whether there was manual intervention using tools within Freesurfer during the quality control process. Replication was defined as a concordant direction of effect, meeting a Bonferroni-corrected threshold of p < 1.67 x 10-2 (accounting for the assessment of three DSST-associated proteins) or p < 7.14 x 10-3 (accounting for the assessment of seven structural brain phenotype-protein combinations).
Assessment of the association of DSST performance with MRI-derived structural brain phenotypes
To identify mediators of the association between protein expression and DSST performance, we first established the structural brain phenotypes that satisfied the requirements of potential mediators (i.e. associated with both DSST performance and at least one DSST-associated protein), and then formally tested the meditation relationship by bootstrap mediation analyses.
We estimated the association between DSST performance and structural brain phenotypes in PURE-MIND using linear models. All brain volume measurements were normalised to ICV and the models included covariates for age, age2, sex, and the first ten genetic principal components. We defined statistical significance as p < 0.00625 (Bonferroni correction for eight phenotypes) and sought replication of significant associations (N = 4) in GS. In GS, brain volumes were residualised for ICV, scanner location, the interaction between ICV and scanner location, and whether there was manual intervention during the quality control process. The resultant residuals were included as the dependent variable in a mixed effects model with DSST score, age, age2, sex, depression, and a kinship matrix as independent variables. Statistical significance was defined as p < 0.0125.
The DSST performance-associated brain MRI phenotypes (N = 4) were assessed as potential mediators of the protein level-DSST associations (N = 3, yielding a total of N = 9 mediations to assess) using bootstrap mediation analysis in PURE-MIND. Analyses were performed using the R package “mediation” [48] with 1000 bootstraps. We corrected for the nine potential mediation relationships assessed using a Bonferroni-corrected threshold of p < 5.56 x 10-3.
Functional and tissue-specific expression enrichment analyses
Proteins associated with DSST performance at p < 0.05 in PURE-MIND were included in functional and tissue-specific expression analyses in three groups: (i) all proteins; (ii) positively associated proteins; and (iii) negatively associated proteins. Enrichment was assessed relative to all measured proteins (n=1160). Functional enrichment analyses were performed using WebGestalt (http://www.webgestalt.org/) [49] using default parameter settings for the over-representation analysis method to assess enrichment for: (i) gene ontology categories (biological processes, molecular functions, and cellular compartments); (ii) Reactome pathways; and (iii) disease-associated genes (Disgenet). Tissue-specific enrichment analyses were performed using the “GTEx v8: 54 tissue types” and “GTEx v8: 30 general tissue types” gene expression datasets in FUnctional Mapping and Annotation (FUMA) [50]. For both the functional enrichment and tissue expression analyses, enrichment was assessed using a hypergeometric test and significant enrichment was defined as a Benjamini-Hochberg-adjusted p < 0.05, correcting for the number of tests performed within each analysis platform. Analyses were performed using web interfaces accessed on 18/04/2022 (WebGestalt and FUMA) and 14/01/2023 (FUMA).
Two-sample forward MR analyses
We performed two-sample forward MR analyses to identify potentially causal associations between genetically predicted plasma protein levels and: (i) cognitive function; (ii) structural brain phenotypes (total brain volume, cerebral white matter volume, hippocampal volume, WMH volume, and CMB); and (iii) disease outcomes (Alzheimer’s disease, all stroke, stroke subtypes (ischaemic, cardioembolic, large artery, and small vessel), and intracranial aneurysm).
Associations between single nucleotide polymorphisms (SNPs) and plasma protein expression levels were calculated in PURE. SNPs located within 200 kilobases up- or downstream of the RefSeq transcript corresponding to a protein-of-interest were assessed as potential pQTLs through separate GWASs of the European (N = 3514), Latin (N = 4309), and Persian (N = 1332) participants, with significant association defined as p < 5 x 10-6. Missense variants and SNPs affecting splice sites were excluded. The GWAS model has been described previously [32]. Effect estimates were then combined by inverse variance-weighted fixed effects meta-analysis using METAL [51], and an independent set of pQTLs obtained by pruning (r2 < 0.1 within the European, Latin, and Persian subgroups from the PURE cohort). Sensitivity analyses were performed in which the pruning threshold was adjusted to r2 < 0.01.
The independent set of pQTLs were assessed for their associations with cognitive function, structural brain phenotypes, and disease outcomes using summary statistics from published studies [52-59].
MR analyses were performed using the R packages MRBase for TwoSample MR v.0.5.6 [60], mr.raps v.0.4.1 [61], and MRPRESSO v.1.0 [62]. We employed several complementary MR approaches: IVW [63], weighted median [64], robust adjusted profile scores (RAPS) [61], MR-Egger [65], and MR-PRESSO [62]. We adopted the IVW approach as our primary methodology and defined statistical significance using a liberal within-outcome variable Bonferroni correction for the two proteins (CA14 and CDCP1) that could be assessed, yielding a significance threshold of p < 0.025 (or p < 0.05 when an outcome could only be assessed for one protein). The IVW approach has the greatest statistical power but also makes the most assumptions. Hence, we reported IVW findings only where there was: (i) no evidence of pleiotropy; and (ii) corroboration of the direction of effect from at least two other MR approaches. An MR-Egger intercept p < 0.05 was deemed to indicate directional pleiotropy. Heterogeneity amongst instrumental variables, suggestive of horizontal pleiotropy, was indicated by a significant Cochran’s Q (p < 0.05). If Cochran’s Q was significant, MR-PRESSO [62] was performed, and, if the MR-PRESSO global test was significant (p < 0.05), MR-PRESSO with outlier removal was performed. In addition to the above conditions, we only reported results where there were at least three IVs, there was no evidence of weak instrument bias (F-statistic > 10) [66], and when the correction causal direction had been assessed (indicated by the instrumental variables explaining a greater proportion of the variance in the exposure than in the outcome, and a Steiger test p < 0.05). For the sensitivity analyses, in which a more stringent r2 threshold was used to select independent pQTLs, only one or two IVs were available for each analysis. When two IVs were available, results from the IVW approach are reported, and when one IV was available, results from the Wald ratio test were reported.
Pairwise Conditional Analysis and Co-localisation Analysis (PWCoCo)
PWCoCo [30, 67] was performed to assess the existence of a shared causal variant between (i) pQTLs for each of the five proteins-of-interest and (ii) variants associated with the outcomes assessed in the two-sample MR analyses. PWCoCo analyses were performed for all conditionally distinct pQTLs and all conditionally distinct association signals in the outcome data. Analyses were performed using SNP-protein associations calculated in the European PURE-MIND participants (N = 3514). As for the MR analyses, SNPs had to be located within 200 kilobases up- or downstream of the RefSeq transcript corresponding to a protein-of-interest.
Analyses were performed using a C++ implementation of the PWCoCo algorithm, which utilises methods from the GCTA-COJO [68] and coloc [69] R packages. PWCoCo calculates the posterior probabilities (PP) for: the existence of no causal variant(s) for either trait (PP0); the existence of causal variant(s) for trait one or trait two (PP1 and 2, respectively); both traits being associated with the same region, with different causal variants (PP3); and both traits being associated with the same region, with a shared causal variant(s) (PP4). Failure to find evidence in support of colocalisation can be due to lack of power [69]; therefore, we limited our analyses to those where PP3 + PP4 ≥ 0.8. Colocalisation was defined as PP4/PP3 > 5, as suggested previously [70].
Software
Statistical analyses and plot generation were performed in R (versions 3.6.0, 4.1.1, 4.1.2, 4.2.0, 4.2.1)