Genetic variations not only determine individual polymorphisms but also have been confirmed as risk factors for various disease phenotypes[42]. With the growth of public GWAS summary data, an increasing number of genetic relationships between disease phenotypes and genetic variations are being revealed. MR is an effective algorithm for inferring causality between phenotypes. However, while providing analytical opportunities, GWAS data analysis poses significant challenges for researchers, primarily in three aspects.
First, public GWAS data stem from diverse resources, such as OpenGWAS, the UK Biobank, Catalog, and FinnGen, with substantial differences in data formats. Although some data from OpenGWAS support users in fetching data through APIs, network congestion or large data requests may lead to data loss. Preprocessing GWAS data demands high proficiency in statistical programming for formatting textual data. Additionally, due to the immense amount of GWAS data, considerable computational and storage resources are needed. Second, the MR analysis process involving numerous parameters and handling procedures often struggles to ensure the reliability and reproducibility of causal inference. Filtering SNPs based on allele frequency or the F statistic lacks standardized criteria, potentially leading to inconsistent conclusions even when employing the same parameters. Third, uncompressed text-format GWAS data files typically range from hundreds of megabytes to several gigabytes. MR analyses usually necessitate importing at least two GWAS datasets (exposure and outcome). Some analyses may even require multiple GWAS files simultaneously, such as MVMR, demanding high computational resources that regular laptops may not suffice. Apart from computational resources, MR also places significant demands on storage resources. For instance, a bidirectional analysis of 731 immune cells requires 131 GB of storage.
We developed a platform containing over 6000 preprocessed GWAS summary datasets and numerous statistical modules to facilitate systematic causal inference. This platform enables users to engage in causal inference based on GWAS summary data effectively in the following ways. First, BioWinfordMR incorporates extensive preprocessed GWAS data that users can access directly. The platform also offers a data cleaning module that automatically preprocesses data from various sources. Second, BioWinfordMR automates the application of cutting-edge methodologies based on customized parameters, enhancing the reliability and reproducibility of causal inference. Third, BioWinfordMR was established on a large server infrastructure, currently deployed on a 16-core, 64 GB memory, 8 TB storage server to meet the substantial computational and storage resource demands of GWAS data.
In an applied example, we used the BioWinfordMR platform to explore potential pathways mediating sepsis induced by gut microbiota-mediated immune cells. Through batch analysis using the MRomics module, we identified six gut microbiota and 38 immune cell exposures that are significantly associated with sepsis. Subsequently, utilizing a two-step analysis, we discerned two completely mediating pathways. Our analysis revealed that Enterobacteriaceae positively regulate the abundance of CD62L- CD86 + myeloid DCs, further increasing the risk of sepsis. Previous studies have suggested a potential connection between Enterobacteriaceae, myeloid DCs, and sepsis, yet no research has decisively determined the genetic interaction pathway among them through MR mediator analysis[43, 44]. In our study, via MR mediator analysis, we obtained strong evidence indicating that Enterobacteriaceae activate myeloid DCs, consequently increasing the risk of sepsis through a mediating pathway.
Furthermore, we selected candidate genes significantly associated with sepsis at the eQTL and pQTL levels using the SMR algorithm. Subsequently, we validated these candidate genes associated with sepsis through colocalization analysis, ultimately identifying causal variants shared between sepsis and two candidate genes (ENTPD5 and MANEA). The ENTPD5 gene functions as an enzyme involved in purinergic signaling and metabolism by hydrolyzing nucleoside triphosphates and diphosphates, impacting cellular processes such as proliferation, differentiation, and survival. Mutations in ENTPD5 are linked to certain cancers and infectious diseases[45, 46]. Recent studies have verified that ENTPD5 promotes renal injury in both human patients and mouse models. Xu et al. reported that ENTPD5 was mainly expressed in the renal tubules of kidneys, and the expression level of ENTPD5 was altered in mice and patients with kidney injury[47]. On the other hand, MANEA encodes the enzyme mannosidase endo-alpha, playing a crucial role in N-glycan processing within the endoplasmic reticulum. The structural characteristics of MANEA have inspired the development of new inhibitors disrupting pathogen protein N-glycan processing and reducing pathogen infectivity in cellular models[48].
With the advancement of Mendelian Randomization technology, there are currently many packages related to MR analysis. However, these publicly available MR-related R packages mostly focus on addressing specific issues. Online platforms like the BiowinfordMR are scarce, which integrates rich MR functionalities and data. The most widely used MR-related platform currently is the MR-Base platform (https://app.mrbase.org/)[25]. Hence, we compared our platform with MR-Base. Firstly, MR-Base only supports data from the OpenGWAS database. In contrast, the BiowinfordMR platform not only supports data retrieval from the OpenGWAS database through an API but also includes over 7000 locally processed omics GWAS datasets. Moreover, as the BiowinfordMR platform evolves, this number continues to grow. Secondly, while MR-Base allows users to upload local files, it only supports preprocessed plain text files. In addition to regular text files, BioWinfordMR also supports direct analysis using VCF format input files. MR-Base cannot analyze GWAS data with missing RSIDs, whereas BiowinfordMR can convert coordinates into RSIDs. This function expands the types of input files supported, facilitating users to conduct MR analysis through local files. Lastly, MR-Base can only conduct TwoSampleMR analysis, lacking more complex analyses such as multivariable MR, mediator analysis, SMR, colocalization, etc. It necessitates users to perform step-by-step analyses using corresponding R packages. BioWinfordMR has incorporated the major functions of the MR-base and built on top of it to make the functionality more diverse, thus can better meet the needs of user analysis.
The BioWinfordMR platform boasts a number of strengths, as outlined below.
1. Extensive Repository of Preprocessed GWAS Data
Currently housing nearly 7000 localized preprocessed GWAS datasets, BioWinfordMR enables users to access data directly via IDs without the need for redownloading from the original resource. Moreover, for nonpreprocessed data from various sources, the platform offers an automated formatting module that standardizes GWAS data into a unified format.
2. Efficient Execution of MR Analysis
Leveraging significant computational resources on large servers, BioWinfordMR can process data in parallel across multiple threads, greatly enhancing operational efficiency and reducing processing times. Under default settings with four threads, Mendelian randomization analysis involving 731 immune cells was completed in approximately 17 minutes, whereas single-threaded analysis on a laptop took approximately 1.5 hours. The platform interactively presents graphical and tabular postanalysis results, allowing users to adjust parameters with real-time result updates within the interface.
3. Generation of reliable MR estimates
BioWinfordMR focuses on enhancing reliability through multiple approaches for estimating pleiotropy, heterogeneity, and confounding factors. The platform offers tools such as MR PRESSO to assess uncorrelated horizontal pleiotropy, while the CAUSE module evaluates correlated horizontal pleiotropy. Additionally, the PhenoScanner module is available for evaluating SNP confounding factors.
4. Reproducibility of MR Findings
By consolidating data and analytical modules within a unified platform, BioWinfordMR facilitates the reproduction of results by other analysts when utilizing identical parameters. Furthermore, the platform provides R code to aid users in reproducing their results on different devices.
5. Diverse MR Analysis Modules
BioWinfordMR offers comprehensive MR analysis functionalities tailored to meet user requirements for systematic and in-depth analysis. In addition to common TwoSampleMR capabilities, the platform includes functional modules such as MVMR, Mediator MR, LDSC, SMR, Coloc, MR Meta-Analysis, and various interactive visualization analysis modules.
Our platform, while offering numerous advantages, also has certain limitations. First, some GWAS pertaining to different traits may stem from either the same cohort or distinct cohorts that exhibit overlap. Such cohort overlap among traits has the potential to introduce bias into effect estimates, skewing them toward confounded observational associations[49]. Although we made efforts to mitigate cohort overlap bias by selecting SNPs with F-statistics exceeding 10, complete elimination of this bias remains challenging. Second, our platform features multiomics MR analysis modules encompassing areas such as the gut microbiota, cytokines, and immune cells. Typically, to control false positives in multiple tests, the use of the false discovery rate (FDR) algorithm is recommended[50]. However, given that positive causal relationships in MR analysis are often sparse, applying FDR correction runs the risk of obscuring true positive conclusions. As a result, there is a likelihood of losing genuine positive findings after implementing FDR correction.