Overview of the Study Design
The main analysis focused on the causal relationship of the lipid level with breast cancer and its molecular subtypes. Specifically, we investigated the causal effect of various blood lipids (TC, HDL, LDL, and TG), individually, on the development of breast cancer. These included 16 MR analyses using summary-level data from large-scale meta-analyses of genome-wide association studies. In addition, a reverse analysis was performed to understand whether breast cancer has a possible causal influence on lipid levels. Subsequently, a multivariate analysis was performed to establish the independent effect of each lipid type on the outcomes of breast cancer. The three main assumptions underlying MR, as illustrated in Fig. 1, are: first, instrumental variables (IVs) must be strongly associated with the exposure being studied; second, these IVs must not be linked to any confounding factors that could influence the relationship between the exposure and outcome; and third, genetic instruments should influence the outcome only through exposure. Details of sources of data used in this manuscript are summarized in Table S1. To reduce population stratification, all analyses were restricted to only African population.
GWAS data sources
Instrumental variables for this study were derived from large-scale genome-wide GWAS conducted on individuals of African ancestry. Genetic instruments for lipid traits were obtained from two key studies: the African Partnership for Chronic Disease Research (APCDR) and the Africa Wits-IN-DEPTH Partnership for Genomics Studies (AWI-Gen), which together included up to 24,215 participants. For breast cancer outcomes, we used a recent GWAS dataset comprising 18,034 cases and 22,104 controls. The outcomes examined included general breast cancer (18, 034), ER positive breast cancer (n = 9,304), estrogen ER negative breast cancer (n = 4,924) and TNBC (n = 2,860). These outcomes variables were defined by the African Ancestry Breast Cancer Genetic (AABCG) Consortium. Genotyping in these studies was performed using Illumina arrays or the Multi-Ethnic Genotyping Array (MEGA). Rigorous quality control (QC) procedures were applied to both datasets, including the removal of SNPs with missingness greater than 0.05, minor allele frequency (MAF) less than 0.01, and Hardy-Weinberg equilibrium (HWE) P-value less than 0.0001. Additional steps included imputation to the 1000 Genomes Project reference panel and adjustments for age, study design, and the first five principal genetic components. Ethical approval and participant consent were obtained in the original studies. All lipid data sets are available for download from the IEU OpenGWAS database (https://gwas.mrcieu.ac.uk/datasets/), and breast cancer datasets can be accessed from https://www.ebi.ac.uk/gwas/studies. Additional information on GWAS can be found in the original studies [37, 38].
Extraction of SNPs associated with lipid traits
We identified SNPs associated with each lipid trait from MRCIEU at a genome-wide significance threshold (p < 5 × 10− 8). To ensure the independence of IVs, SNPs in the disequilibrium of the linkage (LD) with each other were removed using an LD pruning threshold of r2 = 0.001 and a kilobase (KB) threshold of 1000. The SNPs in the lipid datasets underwent screening, and consistency was ensured by harmonizing the direction of effect values between the exposure and outcome data. Ambiguous SNPs with incompatible alleles (e.g., A / G vs. A/C) were excluded from the analysis. Palindromic SNPs with intermediate allele frequencies (between 0.45 and 0.55) were also removed to minimize potential confounding effects that could violate the assumption of independence [39]. To assess the strength of selected SNPs, we calculated the F-statistic (F = beta2/se2) for each instrumental variable (IV). IVs with an F-statistic below 10 were considered weak instruments and therefore excluded from the analysis [40].
Quality Control and Data Standardization
Quality control of the breast cancer summary statistics was carried out following the guidelines outlined by Murphy et al. [41]. We utilized MungeSumstats, a Bioconductor R package, to standardize and process the summary statistics. MungeSumstats employs a series of automated quality control procedures to ensure data consistency and accuracy. Firstly, we standardized the column headers and checked the consistency of the data set, confirming that the alleles were correctly represented and aligned with the reference genome. Non-biallelic SNPs were removed, and any missing SNP IDs were imputed on the basis of base pair positions and chromosome numbers. Identified indels, as well as duplicated RSIDs and base pair positions, were excluded from the dataset. Next, we verified that the directionality of the effect alleles matched the reference genome. Any discrepancies in allele alignment were corrected by flipping the effect columns as needed. We also performed a change from hg38 to hg19 to align the summary statistics with the appropriate genomic coordinates, as the exposure dataset was stored in hg19. Further processing included renaming columns to accurately reflect minor or major allele frequencies and converting the summary statistics into the GenomicRanges format for better integration.
Univariable Mendelian Randomization analysis
We conducted two-sample MR analyses using inverse variance weighted (IVW), weighted-median (WM), and MR-Egger models to estimate the casual relationship between lipid traits (TG, TC, HDL, and LDL) and the risk of breast cancers [42–44]. We use the random effect IVW method as the main effect size estimator [45]. The IVW regression model combines genetic variant-specific causal estimates weighted by the inverse of their variances [42]. The approach assumes the validity of the genetic instruments under the assumptions of no heterogeneity and no horizontal pleiotropy and, therefore, provides robust estimates of the causal effects. In the case of heterogeneity but no horizontal pleiotropy, the weighted median method was applied [44]. This method provides reliable estimates of the causal effect if at least half the weight comes from valid variants. However, when heterogeneity, or variation in causal estimates across genetic variants, and horizontal pleiotropy, where genetic variants influence multiple traits, were detected in our MR analyses, we performed the MR-Egger regression method for our analysis. Specifically, MR-Egger regression allows for an intercept term, which estimates the average pleiotropic effect across all variants used as instruments, to address biases induced by pleiotropy. Other MR algorithms, such as those for the simple median and simple mode described by Bowden, were also applied in this study to further assess the robustness of these findings.
Multivariable Mendelian Randomization analysis
Since lipid traits are genetically related, we use MVMR to assess the direct effects of lipid traits on breast cancer outcomes, following the method described by Sanderson et al. [46, 47]. Here, we retrieved genetic-associated variants for all the exposures across their summary datasets. The SNPs for all traits (TC, HDL, LDL and TG) were combined and we then filtered for genome-wide significance (P < 5 × 10− 8) and for linkage disequilibrium (r2 < 0.001). To test for the presence of weak instruments, we evaluated the strength and validity of IVs. We also performed horizontal pleiotropy testing using conventional Q-statistic estimation.
Sensitivity Analyses
Heterogeneity among genetic variants was assessed using Cochrane's Q value of the IVW method, with p < 0.05 indicating significant heterogeneity [48]. We used the MR-Egger intercept to check whether horizontal pleiotropy influenced our results [49]. A p-value greater than 0.05 suggested that pleiotropy was not a significant factor. Furthermore, MR-PRESSO was used to detect and correct outliers in the analysis [50]. The MR-PRESSO approach is derived from the IVW method but includes the removal of genetic variants whose specific causal estimates deviate from those of other variants [50]. Nonetheless, a leave-one-out approach was employed to evaluate the effect of each exposure SNP on the outcome of the MR analysis [51]. We achieved this by removing variant one by one from the analysis and re-estimating the causal effect. The results were presented as odds ratios (OR) and 95% confidence intervals (CI), providing an estimate of how lipid traits influence the probability of developing breast cancer [52]. P-value < 0.05 suggests that the observed association between lipid trait (risk factor) and the likelihood of developing breast cancer (outcome) is unlikely to be due to chance alone. To account for multiple tests in our analyses, we applied the false discovery rate (FDR) based on the Benjamini and Hochberg method (P value < 0.05/16 = 0.003125) [53]. All statistical analyses were performed in the R software version 4.2.2, using packages including TwoSampleMR, MVMR, MR-PRESSO and MendelianRandomization [54–56].