EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multi-platform GWAS for Late-Onset Alzheimer's Disease

doi:10.21203/rs.3.rs-4031105/v1

Download PDF

Research Article

EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multi-platform GWAS for Late-Onset Alzheimer's Disease

https://doi.org/10.21203/rs.3.rs-4031105/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Late-Onset Alzheimer Disease (LOAD) is a progressive and complex neurodegenerative disorder in the aging population. LOAD is characterized by cognitive decline, such as deterioration of memory, loss of intellectual abilities, and other cognitive domains depending on traumatic brain injuries. Alzheimer's Disease (AD) presents a complex genetic landscape that remains elusive, which restrains the early and differential diagnosis of LOAD. While Genome-Wide Association Studies (GWAS) enable the examination of statistical interactions among individual variants within specific loci, traditional univariate analysis may overlook intricate relationships between these genetic elements. Machine learning (ML) algorithms, on the other hand, prove invaluable in unraveling concealed, novel, and significant patterns by considering nonlinear interactions among variants. This approach enhances our comprehension of the genetic predisposition underlying complex genetic disorders. When working on different platforms, majority voting can not be applied because the attributes differ. Hence, a new post-ML ensemble approach is developed to select significant SNVs over multi-genotyping platforms. We proposed the EnSCAN framework using a new algorithm to ensemble selected variants even from different platforms to prioritize candidate causative loci, which consequently helps improve ML results by combining the prior information captured from each multi-model of each dataset. The proposed ensemble algorithm utilizes chromosomal locations of SNVs by mapping to cytogenetic bands, along with the proximities between pairs and multi-model via Random Forest validations to prioritize SNVs and candidate causative genes for Alzheimer Disease. The scoring method is scalable and can be applied to any multi-platform genotyping study. We present how the proposed EnSCAN scoring algorithm prioritizes the candidate causative variants related to LOAD among three GWAS datasets.

Late-onset Alzheimer's Disease

Single Nucleotide Variation (SNV)

Machine Learning

Random Forest

Scoring

Ensemble

Causative Genes

Enrichment

Identifying the epistatic interactions between multiple genetic variations leading to complex human diseases is essential in developing diagnostics and new treatments. Among genetic variations, single nucleotide variations (SNVs) have received much attention regarding disease prediction. Models that narrow down the prioritized SNVs enriching the causative biomarkers at different loci are promising to discover the molecular etiology and enlighten the phenotypes of complex diseases like Alzheimer's disease (AD).

Late-onset form of AD [1] develops after age 65, and the causes are not yet completely understood. However, many single-nucleotide variations (SNVs) have been identified and confirmed to be associated [2], [3], [4] with AD.

Today, diagnosis of late-onset Alzheimer's disease is partially available based on clinical evaluation and imaging; still, many patients cannot be diagnosed at the early stages of AD. Discovering SNV as biological markers indicating susceptibility to complex diseases can potentially improve early diagnosis accuracy and prevention of the disease via clinical decision-making. Also, identifying causative SNVs among the set of biomarkers can enlighten the molecular etiology, pathogenetic changes, and the related phenotypes of complex diseases like Alzheimer's disease (AD).

Straightforward approaches such as Genome-Wide Association Studies (GWAS) are based on testing univariate hypotheses. GWAS does not assess the potential interactions of each genetic marker but calculates statistical significance based on statistical data distribution assumptions. Although this standard method gives information about the novel loci for complex diseases, it has some limitations. GWAS does not consider each biomarker's genetic interactions, which are the most influential. Although it reveals top-ranked significant correlations associated with a particular disease, the method does not provide a prediction model considering the statistically significant biomarkers [5]. At this point, intelligent computational techniques are needed to obtain patterns behind the etiology of a complex disease. Predictive models that integrate and assess genetic variations and clinical findings have been proposed recently for monitoring the progress of Alzheimer disease and diagnosing the disease. Although data mining methods for knowledge extraction are widely used in some of these studies, multi-model or hybrid method approaches have not been implemented, and information gathering from different datasets is not considered.

Machine learning (ML) techniques can find significantly associated biomarkers by giving importance to variables and prioritization, which leads to the development of prognostic models and support decision systems. Due to large-scale and complex biological data, modern biology has experienced the use of advanced machine learning techniques [6], [7] so far. Machine Learning methodologies have also been exploited and adapted for GWAS as preprocessing steps since GWAS data can be considered a classification problem described by thousands of individual genetic variations.

The Random Forest (RF), a supervised data mining method for classification learning, provides significant improvement in determining the significant classification accuracy by voting each tree. Hence, significant attribute selection can be made among the best-split attributes. In addition to building an accurate model, this method can be used as a preprocessing step for incorporating feature selection and interactions in the training process. Moreover, Random Forest analysis produces measures of variable importance [8] that can be used to rank and select the predictor variables. Recent studies also reviewed using Random Forest for genomic data classification, feature selection, pathway analysis, and prediction, association, or detection of the underlying reason for complex diseases.

In this study, a meta-analysis of three different authorized accessed LOAD data sets is performed. A multi-platform comparison is applied among two common microarray platforms (Affymetrix and Illumina). Multi-model and hybrid data mining approaches are applied for in-silico Alzheimer's diagnosis model construction. The model construction process has significant noveltysince multi-data mining classification approaches are handled separately for each dataset, providing the highest classification performance independently. Following the ML modeling, different independent models representing each multi-platform dataset must be assembled to integrate information whose metadata is different on data sets. Since attributes differ on different platforms, not allowing majority voting, a new ensemble approach is developed to select significant SNVs over multi-microarray platforms.

The core problem we focused on in this study was integrating (ensemble) multi-step data mining classifiers to find a general prediction model from prior knowledge where there are no common representative SNPs within different platforms. We mapped SNVs identified from each dataset to a higher genomic dimension to find commonalities. SNVs are mapped to chromosomal locations as a new informational dimension. We utilized chromosomal locations, linkage disequilibrium (proximities between pairs), characterization (appeared in) with multi-platform databases, and statistical significance captured by RF-RF modeling results to add value to the prioritization since variants are selected repetitively in the model into consideration while building the scoring of prioritized SNVs.

The proposed significant variants by ensemble model will accelerate new studies focusing on early diagnosis and prediction of LOAD and developing new therapeutics. Diagnosing and treating such a high-risk disease in the early phases will eventually bring significant social and economic benefits.

1. Dataset

Data used in this study is acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, accessible at adni.loni.usc.edu and NCBI dbGAP, accessible at https://www.ncbi.nlm.nih.gov/gap. dbGAP (Database of Genotypes and Phenotypes), developed for storing and sharing genotype and phenotype data, family history, and other clinical findings of the association studies regarding a particular disease, is supported and published by NCBI. On the other hand, ADNI (The Alzheimer's Disease Neuroimaging Initiative) constitutes an ongoing, longitudinal, multicenter research initiative aimed at establishing clinical, imaging, genetic, and biochemical biomarkers for the early detection and monitoring of Alzheimer's disease (AD). ADNI collects, validates, stores, and utilizes data such as MRI and PET images, genetics, cognitive tests, and blood biomarkers to predict Alzheimer's disease.

With the controlled accessed, GWAS datasets provided by ADNI (210 controls and 344 cases), GenADA (777 controls and 798 cases), and NCRAD through dbGAP initiatives (1310 controls and 1289 cases) are subjected to analysis for construction of multi-models. The initiatives utilize Affymetrix Mapping250K_Nsp and Mapping250K_Sty, as well as Illumina Human610_Quadv1_B 500K and Illumina Human610-Quad BeadChip platforms, as illustrated in Figure 1. This study incorporates a total of 620,901, 410,907, and 585,295 QC-passed SNVs for GenADA, NCRAD, and ADNI, respectively.

As platforms and single nucleotide variations focused on AD are different, there is a need to process datasets individually to identify the causing variations. As a result, the individual prediction model identifies unique and notable prior information before the ensemble process of the multi-platform dataset.

2. Development of the Workflow

First, GWAS analysis is employed to discover statistically meaningful SNVs using the chi-square test for each dataset. As the GWAS only identifies the independent statistical significances of variations, we used this information to filter the non-informative SNVs that are not directly related to the disease.

After the initial dimension reduction analysis for filtering statistically insignificant variations based on GWAS, the SNVs significantly associated with LOAD are used for further analysis. We have reported that the RF-RF approach outperformed the hybrid models for the GenADA dataset (unpublished data). As the top-performing modeling approach was GWAS filtering followed by two-step RF, we developed LOAD-RF-RF models for all three datasets.

The second phase of this study integrates information from three data mining models for each dataset with the proposed ensemble method. For the first time, we suggested using a higher space (SNVs to genomic locations on cytogenetic bands), as seen in Figure 2, while ensembling multi-models that offer significant information after GWAS and RF analysis. Since RF-selected variants from different datasets differ due to the sequencing platforms, SNVs are mapped to chromosomal positions to locate and analyze the neighboring variants. An ensemble scoring algorithm is proposed to calculate information provided regarding cytogenetic bands that emphasize disease-related and causing variations, as seen in detail in Figure 1. While scoring, a three-rowed frame is slipped down within a band containing SNVs. The frame size is selected three as the information comes from three datasets. Linkage Disequilibrium (LD) for each SNV, the incident of variations in all databases, and validation with the Random Forest were vital indicators to assign scores to prioritize and select the most critical variations of AD.

a. Preprocessing

The data preprocessing step describes any operation on data for eliminating noise, cleaning redundancy, and determining the significant subset by feature extraction or dimension reduction methods within the data before modeling. Preprocessing takes a large part of the total effort.

As allele-calling algorithms continue to improve, quality improvement strategies must ensure that only reliable markers are used for analysis. In data, quality control of genotype data is conducted initially before the authorized access dataset is realized. Autosomal SNVs with Hardy Weinberg Equilibrium value<5e-07 are excluded in controls and MAF<0.05 and call rate<0.99. Remaining SNVs with call rate<0.95 and without a valid map position are also excluded from genotype data.

GWAS analysis is implemented for each dataset to eliminate redundant variations by comparing the results between unadjusted p-values with a significant value of 0.05. GWAS is used only for pre-filtering; hence, multiple hypotheses are not considered.

b. Model Construction

Dimension reduction can be beneficial for computational efficiency and improve the accuracy of the data mining model. After GWAS with PLINK, Random Forest (RF) is employed tandemly for dimension reduction and supervised learning model implementation [9]. Since we are considering nonlinear patterns during the dimension reduction by applying data mining methods, false positives are better eliminated.

The dataset is partitioned by bagging (bootstrap) techniques, which are used to reduce the variance of predictions by assembling the results of multiple classifiers obtained on different sub-samples of the same dataset. Sampling is performed on the initial data, leading to the creation of new datasets for every tree construction. Data that is out of training is considered out of the bag (OOB) for assessing the accuracy of trees.

First, parameter estimations of model building for the number of trees to be generated and how many features are randomly available to be considered for each new split are calculated. Hence, the value of "the number of trees" is determined with different sizes. In addition to this, the "number of randomly selected variables" is also tuned. In order to detect the optimized random forest parameter, the "RANGER" package is used in R (including the "randomForest" package) as the RANGER library has functionality for tuning RF. It automatizes fitting multiple versions of an Ensemble Scoring Model by changing its crucial parameters and tree construction samples. An increase in the "Number of trees" brings computational effort, but the model's accuracy may increase as this increases. Sequential values are given to estimate the tuning parameters, and the model is trained.

A confusion matrix is formed for general accuracy, specificity, and sensitivity information based on the general majority voting on OOB samples. Then, the best model whose parameters produce higher accuracy in terms of OOB samples is set as the final model.

Finally, variable importance values are calculated using the RANGER package in R by calculating t-test statistics for each variable for the most accurate model. It evaluates how the importance of all variables' existence in the model.

c. Ensemble Scoring Algorithm

In the literature, a few ensemble methods that work within the same dataset [5], [10], [11], [12], [13] are proposed. So far, there has been no ensemble methodology for analyzing multi-platforms whose attributes differ. Here, the output of first-step prediction models is used as prior knowledge to merge all information to calculate an importance score for each SNV that gives insight into LOAD susceptibility. As the novelty of the Ensemble Scoring Algorithm, we have utilized genomic location as a higher dimension where all RF-selected SNVs are mapped to their chromosomal positions in the genomic space, and cytogenetic bands are set as slots to survey through a sliding window approach for scoring (Figure 2).

Three factors are used in the Ensemble Scoring Algorithm: (1) the number of platforms in the scoring window showing variants from different data sets are concentrated in the same region, (2) linkage disequilibrium (LD) with neighbors, and (3) whether the SNV is selected after multi-step RF as significant. Based on these three variables, each variant is assessed by exploiting the neighboring SNVs in the sliding window. A higher score for an SNV first indicates that the loci are associated with the condition in multiple studies, and multiple unlinked SNVs are identified as associated with that loci. A further selection of an SNV through a multi-step RF-RF model marks the candidate causative variants in that window.

The number of SNVs in each window is set to three to analyze multiple datasets. Figure 3 expresses the flowchart of the Ensemble Scoring Algorithm for calculating the importance of SNVs. The algorithm's steps are given in Algorithm 1.

The scoring formula for each SNV has three components: one from database listing, one from LD linkage, and the final component refers to whether RF-RF Modeling selects SNV. The Ensemble Scoring Coefficients and Formula is calculated as follows:

As the number of datasets is three, "d" can be 1, 2, or 3. Based on the predefined LD correlation coefficient, considering the neighborhoods in the window, "l" depends on the number of independent SNVs in the defined sliding window; hence "l" can be 1, 2, or 3. If post and pre neighbors are in LD, “l” is set to the lowest score as 1. "v" shows that an SNV is selected by RF-RF modeling, so corresponding "v" equals 1 if SNV is selected via RF-RF. While analyzing three different multi-platform datasets, the maximum score for an SNV can be 3.31, while the minimum score can be 1.10. In summary, the score can be an element of S ∈ {1.10, 1.11, 1.20, 1.21, 1.30, 1.31, 2.10, 2.11, 2.20, 2.21, 2.30, 2.31, 3.10, 3.11, 3.20, 3.21, 3.30, 3.31}.

All SNVs are scored using the Ensemble Scoring Algorithm. As given in Step 6, each SNV is scored three times as it is listed in different windows. If "rs1333190" is considered an example, the scoring procedure is presented below for each location of the SNV in the windows.

Case 1: When SNV is in the middle of a sliding window.

Figure 4 represents calculations for the SNV called "rs1333190," which occurs in the middle of three-slot window. According to Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains in the middle. The window has been expanded since there is an LD linkage for" rs1333190". In this case, it has been seen that information comes from 3 different databases within the extended window. "tandem" was created due to the information from 3 different databases. In this case, "d" is defined as 3. Since an LD linkage exists for" rs1333190", "l" has been determined as 2 in the frame. As the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 3.21, formulated in Formula 1.

Case 2: When SNV is at the bottom of a sliding window.

In the first step, according to Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains at the bottom. Figure 5 shows how to calculate the score when the SNV occurs at the bottom of the three-slot window. In this case, "rs1333190" is now at the bottom of the window. Since "rs1333190" and "rs4548489" known to be linked, the sliding window expands to contain both as one informative variant. Here, "d" is defined as 2 since two different datasets are contained within the extended window. Since there is no LD linkage within the frame, "l" has been determined as 3 in the frame. As the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 2.31 on, which is formulated in Formula 1.

Case 3: When SNV is at the top in the sliding window.

Figure 6 shows the calculations for an SNV called "rs1333190" at the top of three-slot window. Again, following Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains at the top. The frame has been expanded since there is an LD linkage for" rs1333190". In this case, "d" is defined as 2. Since an LD linkage exists within the window for" rs1333190", "l" has been determined as 2. Since the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 2.21 on which it is formulated in Formula 1.

All three possible scores for rs1333190 are calculated. It scores 3.21 in the middle of the window, 2.31 at the bottom, and 2.21 at the top of the window. In this example, the ensemble score is set to 3.21, the maximum of three EnSCAN scores.

d. Functional Enrichment

After the GWAS algorithm is run for the ensemble and scores for individual SNVs are calculated, the pathway enrichment and visualization of the data are done by following protocol [14]. Pathway enrichment analyses are done by g:Profiler, a web server accessible to the public for characterizing and manipulating lists of genes. Functional enrichment analysis (over-representation analysis (ORA) or gene set enrichment analysis on AD-related SNVs) is performed to map genes to known functional information sources and detect statistically significantly enriched terms.

The g: Profiler tool's g: GOSt component is employed for analyzing variants identified by the ensemble model, specifically focusing on GO Molecular Function, GO Cellular Component, GO Biological Process, and Reactome pathways. Default attributions are used for all analyses, with a significance threshold of 0.05 applied to identify pathways associated with SNVs based on ensemble scores. The enrichment p-values for pathways are computed using Fisher's exact test, and multiple testing corrections are applied using the Bonferroni correction method.

Then, EnrichmentMap, the Cytoscape plugin, is used for functional enrichment visualization by creating networks from Gene Ontology annotations and Reactome pathways. All analyses are completed with p-value 0.05, FDR q-value cut-off 0.01, and edge similarity cut-off 03(Jaccard metric).

IMPLEMENTATION

3. Independent Model Construction for Multi Platforms by PLINK and Random Forest

We developed in-silico models for Late-Onset Alzheimer's Disease (LOAD) using genotyping data obtained from three datasets sourced from ADNI and dbGAP initiatives. The analysis involves examining GWAS datasets from ADNI (210 controls and 344 cases), GenADA (777 controls and 798 cases), and NCRAD through dbGAP (1310 controls and 1289 cases).

The initial step involves conducting GWAS using PLINK, followed by filtering based on p-values for dimension reduction. When only one SNP is investigated, type one error α=0.05 is considered. After GWAS, 3.768 SNVs from GenADA, 16.404 SNVs from NCRAD, and 7.639 SNVs from ADNI are determined as statistically significant [9] to LOAD. PLINK results filter redundant SNVs that are common or uninformative between cohorts.

For each dataset, a Multi-step Random Forest (RF) is executed with 5-fold cross-validation (CV) using the RANGER R package, following GWAS with PLINK. The test performances of the Late-Onset Alzheimer's Disease (LOAD) RF models for ADNI, NCRAD, and GenADA datasets are computed as 72.9%, 68.8%, and 92.4%, respectively. The individual LOAD RF models select 390 SNVs from the ADNI dataset, 1740 from NCRAD, and 434 from GenADA, considering the permutation importance of variants at a 95% confidence level. No consensus variants were identified from three different datasets [9], as seen in Figure 7.

The evaluation of the test performances for the Late-Onset Alzheimer's Disease (LOAD) RF-RF models on ADNI, NCRAD, and GenADA datasets yielded results of 74.0%, 72.1%, and 85.1%, respectively, as detailed in Table 1. Individual LOAD RF-RF models selected 32 SNVs from ADNI, 581 from NCRAD, and 107 from GenADA datasets, taking into account the permutation importance of variants with a 95% confidence level [9]. The RF-RF analysis identified the most significant SNVs. These RF-RF-identified SNVs are utilized during ensemble scoring to enrich the prioritized SNV list for causative variants.

Table 1 Individual Results from Three Datasets

4. Multi-platform Variant Prioritization by EnSCAN Algorithm

Since tagged SNVs do not overlap between different platforms, we developed a framework that maps individual SNVs to a higher dimension, chromosomal location on genes, to observe neighboring variants from different datasets.

SNP Nexus [15] is used to determine the annotation of SNVs. SNiPA [16] determines pairwise Linkage Disequilibrium (LD). The correlation between the phenotype and the marker allele should indicate the presence of the causal SNV in LD. Capturing all LD blocks in the genomic region provides additional information in detecting variants predisposing to the disease. [17] The LD correlation coefficient is determined during the scoring process. LDs of each SNV are defined and labeled. Finally, all SNVs identified in the LOAD-RF-RF model from three platforms with multi-step models are scored following this approach named Sliding Window Algorithm (SWA) for prioritization.

The ensemble algorithm will score each SNV in an iterated scoring window. Hence, for each SNV, three different scorings will be handled where the SNV is in the middle, at the top and bottom of the window, considering three different datasets (the number of SNVs in the window is three).

All variants are scored via the Ensemble Scoring Algorithm (EnSCAN) [18], and the highest-scoring variants are identified. The distribution of calculated scores is presented in Figure 8.

83 SNVs are scored with the highest EnSCAN score of "3.31". 43 out of 83 SNVs are identified as protein-coding (Table 3). Genes carrying these LOAD associated variants with highest EnSCAN score were CSMD2, NR5A2, KIF26B, CCDC3, CSGALNACT2, SLC18A2, PPFIBP2, TSPAN18, ALDH3B1, CNTN5, RNFT2, FBXW8, RBFOX1, NDRG4, PITPNC1, NPLOC4, PHLPP1, RUNX1, OSBP2, CYB5R3, LSAMP, TPRG1, FGF12, TENM3, ENPP6, ADCY2, ANKRD33B, IQGAP2, SPOCK1, ARHGAP26, CYFIP2, SYCP2L, SLC17A2, MDFI, RUNX2, CRISP1, PLAGL1, GNGT1, DLC1, C9orf3, SNX30, HMCN2 (Supplementary Table 1). Among the 43 protein-coding variants with a score of 3.31, twelve mapped on chromosomes 5 and 6, located on ADCY2, ANKRD33B, IQGAP2, SPOCK1, ARHGAP26, CYFIP2, SYCP2L, SLC17A2, MDFI, RUNX2, CRISP1 and PLAGL1 genes.

Also, 29 out of 43 variants on protein-coding genes were associated with known disease phenotypes. Among these, 6.5% were directly linked to Alzheimer Disease (AD), while 35.5% were associated with other AD-related conditions, including cholesterol metabolism, Type 2 Diabetes, cardiovascular disorders, and immunological issues. Furthermore, 30.5% were tied to addiction-related phenotypes such as Tobacco Use Disorder. Additionally, 13% were associated with various neuropsychiatric disorders, and the remaining 14.5% were linked to non-neurological or cancer-related phenotypes.

5. Evaluation of Ensemble Scores

We perform functional enrichment analysis, over-representation analysis (ORA), or gene set enrichment analysis for EnSCAN variants with various scoring categories as a validation method. Here, we map SNVs over genes to access known functional information sources and detect statistically significantly enriched terms for identifying the efficacious enrichments related to Alzheimer disease.

The overall distribution of ENSCAN scores and their overlap between datasets are summarized in Figure 9. SNVs with scores greater than 2 (observed in at least two different datasets from ADNI, NCRAD, or GENADA) are analyzed in different combinations. The pathways from the following EnSCAN score categories are benchmarked against all calculated scores:

1) All variants with EnSCAN scores as a benchmark

2) Variants with EnSCAN score equal to 2.31

3) Variants with EnSCAN scores 2.31 or 3.31

4) Variants with EnSCAN scores 3.11 or 3.21, or 3.31 (3s)

5) Variants with EnSCAN scores equal 3.31

Evaluation of All LOAD-RF-RF Variants:

When SNVs with EnSCAN scores are evaluated, the enrichment analysis reveals a general view of the biological process possibly underlying the molecular etiology of AD (Figure 10).

Tissue and cell-type identity lie at the core of human physiology and disease, considering "multicellular organismal process" [19]. Gaining insights into the genetic foundations of intricate tissues and specific cell lineages is pivotal in advancing diagnostics and therapeutics. The exploration of tissue-specific networks offers a novel approach for generating hypotheses pertaining to the molecular origins of human diseases.

"Neuronal system development" is also critical for AD. Strategies to improve the symptoms of aging and age-related diseases such as Alzheimer's Disease have included different means to stimulate neurogenesis, both pharmacologically and naturally. The regulatory mechanisms of stem cell neurogenesis or a functional integration of newborn neurons have been explored to provide the basis for grafted stem cell therapy. A review [20] aims to provide an overview of AD pathology of different neural and glial cell types and summarizes current strategies of experimental stem cell treatments and their putative future use in clinical settings.

[21] shows that abnormal "ion channels" (especially those showing selectivity for cations like Ca2+) might be a reason of the toxic mechanism behind AD pathology. "Peptides or phosphopeptides" of common plasma proteins show increased observation frequency by Chi Square and/or precursor intensity in AD, according to a recent study [22]. Increases in mean precursor intensity of peptides from common plasma proteins such as DISC1, EXOSC5, UBE2G1, SMIM19, NXNL1, PANO, EIF4G1, KIR3DP1, MED25, MGRN1, OR8B3, MGC24039, POLR1A, SYTL4, RNF111, IREB2, ANKMY2, SGKL, SLC25A5, CHMP3 among others were observed in association with AD.

Impairments of the "extracellular matrix" [23] have been reported to have a role in Alzheimer Disease, focusing on synaptic transmission, amyloid-β-plaque generation and degradation, Tau-protein production, oxidative stress response, and inflammatory response. The extracellular matrix comprises various macromolecules secreted by cells, namely collagen, elastin, fibronectin, laminin, and glycoproteins; regulating the expression of individual components is an essential step in stabilizing or improving the course of the disease. When collagen forms as fibrils around the vulnerable neurons, it blocks the Aβ proteins from binding to the cells. Therefore, its components are a potential target and biomarker for developing and treating AD.

Evaluation of score 2.31:

Enrichment analysis of only SNVs with EnSCAN score of 2.31 is showed that AD-related processes of "adhesion molecules/biological cell adhesion", "cation ion binding/calcium ion binding," and "external encapsulating structure/plasma membrane/periphery membrane component" were enriched within this category. Additionally, hsa mir181a and has mir221 are observed, which was not enriched in the previous benchmark analysis (Figure 11).

Evaluation of union of score 2.31 and 3.31:

SNVs with EnSCAN score of 2.31 or 3.31 are evaluated together under one category; the top enrichment classes are not changed compared to only 2.31 and the benchmark (Figure 12). Considering ontology, development multicellular process, external encapsulating structure, has mir 221, dlc1 dlgap1 myo5a, metal ion transmembrane, plasma membrane periphery, calcium regulation cardiac, component plasma integral, biological adhesion cell, synapse presynapse, adhesion molecules cell, has mir181a, calcium ion binding and cation binding ion are significant terms that affect the susceptibility to Alzheimer Disease in humans.

All three categories analyzed above for the enriched biological process shared common keywords of "adhesion molecules/biological cell adhesion", "ion cation binding/calcium ion binding", and "plasma membrane/ peripherial component/ plasma integral". One biological annotation category that emerges in this set different from 2.31's enrichment analysis that is noteworthy was "synapses and presynapses". The enrichment analysis allowed us to observe biological processes involving LOAD-associated variants. Comparing EnSCAN scoring categories also revealed the consistent enrichment of ontological terms between categories that are known to have roles in the core of AD pathophysiology.

Evaluation of score 3s (3.11, 3.21, 3.31):

As we have focused on neighboring LOAD-RF-RF variants only observed in all three datasets with EnSCAN scores of 3.11, 3.21 and 3.31, we notice that the enrichment analysis is narrowed down onto the “phospholipid binding” (Figure 13), which is a recent topic of interest in AD pathophysiology. It has been reported that abnormal phospholipid binding contributes to the development of AD, and mutations in genes related to phospholipid metabolism have been linked to an increased risk of AD [24]. While more research is definitively needed to understand the role of phospholipid interactions in AD pathogenesis, recent reports are suggesting certain types of phospholipids might promote the formation of alpha-synuclein aggregates [25], which can disrupt normal phospholipid interactions in cell membranes.

Evaluation of score 3.31:

The enrichment analysis of variants with top EnSCAN score reveals the "nicotinic activity on dopaminergic neurons" as the only ontology common between these variants (Figure 14).. In general, activating Nicotinic Acetylcholine Receptors (nAChRs) on dopaminergic neurons leads to increased dopamine release, which regulates various brain functions, including memory, attention, and reward processing. Dopamine is known to have a role in cognitive functions, and decreased dopamine levels are reported as a feature of AD. Interestingly, the relationship between nicotinic activity on dopaminergic neurons and Alzheimer's Disease (AD) is complex and the specific effects of nicotine on dopaminergic neurons in the context of cognitive impairment, memory loss and development of AD are still being investigated [26].

The advancement of high-throughput genotyping and next-generation sequencing technologies facilitates the genetic epidemiological analysis of extensive datasets. These advances led to identifying single nucleotide variation (SNV) profiles associated with various complex diseases. Many research centers collect GWAS data and share it for advanced analysis to obtain further meaningful information to enlighten the genetic etiology of a disease. Integrated or meta-analysis of multiple GWAS data from different centers reported identifying strongly associated variants, but revealing causative variants was impossible.

Our main goal was to develop a methodology for ensemble analysis of GWAS data from different centers and genotyping platforms to reveal associated cytogenetic loci and focus on variants with a higher probability of being causative. To integrate multi-platform data, we have investigated data mining methods for knowledge extraction. RF-RF models are trained for each dataset of Alzheimer disease. An assembly of prioritized variants selected through RF-RF models is formed by mapping variants selected in each RF-RF model to their cytogenetic loci. The EnSCAN scoring system is proposed to survey all selected variants and categorize variants depending on whether they are on a locus independently selected in all GWAS datasets analyzed. The core of the presented study was the ensemble of priori information derived from three different models containing different representative variants, along with the newly proposed EnSCAN algorithm. The enrichment analysis for variants based on their EnSCAN score showed how biological processes and annotations are narrowed down as the EnSCAN score increases, allowing researchers to focus their studies on enriched biological processes through a smaller associated variants list.

Here, we present a unique study aiming to reveal hidden patterns by combining the results of different multi-platform genotyping studies. Single nucleoid variants (SNVs) biomarkers associated with the early or differential diagnosis of Alzheimer disease were determined as proposed by the Ensemble Scoring algorithm after machine learning models were developed to identify significant patterns considering complex interactions between variants.

EnSCAN scoring is proposed to prioritize LOAD-RF-RF variants neighboring with LD independent variants in its cytogenetic loci from two or three datasets (EnSCAN score of 2.31 or 3.31). It allowed us to minimize the list of LOAD-associated variants without losing representation of the enriched biological processes. Hence, in future ensemble studies for multi-platform GWAS, EnSCAN scores of 2.31 and 3.31 can be selected to investigate further the causative variants among the associated SNVs.

A total of 85 variants are identified with the LOAD-RF-RF model, with the top score of 3.31, and 29 of the variants on protein-coding genes were shown to be associated with known disease phenotypes, including Alzheimer Disease (AD) and other AD-related conditions such as cholesterol metabolism, Type 2 Diabetes, cardiovascular disorders and immunological issues. Besides various neuropsychiatric disorders and other disease phenotypes, interestingly, there was a strong representation of Tobacco Use Disorder.

APOE gene variants are known to increase the predictive power. Results [27] suggest that the dose of the APOE ε4 or ε2 allele contributes independently to the differential diagnosis of AD, especially the early onset and familiar forms. As a limitation of our study, the genotyping of the APOE gene was not reported in all three data sets (NCRAD, ADNI, GenADA). Since the information from APOE alleles is not incorporated into the models, we cannot further discuss their effect on LOAD or their interactions with the variants selected in this study.

While we have demonstrated associations with Alzheimer's Disease through enrichment analysis and disease phenotypes, it is essential to investigate the SNVs identified by the LOAD-RF-RF model within a clinical framework. This examination aims to assess the predictive efficacy of EnSCAN scores for early or differential diagnosis of LOAD. The discovery of new associated and causative variants for AD through LOAD-RF-RF modeling and EnSCAN scoring can lead to the development of cost- and time-effective diagnoses and an understanding of the pathophysiology leading to neurodegeneration. Such studies will accelerate the new studies focusing on early and differential diagnosis of AD and developing new therapeutics. Diagnosing and treating such a high-burden disease in the early phases will eventually bring significant social and economic benefits.

Overall, the novelty of our approach is based on the assembly of multi-platform GWAS data along with the proposed EnSCAN scoring of variants. Proposed EnSCAN scoring, applied to variants identified through machine learning modeling, has the capability to unveil concealed, novel, and meaningful patterns by considering non-linear interactions between variants. This approach enhances our comprehension of the genetic predisposition in complex genetic disorders, where the cumulative impact of multiple variants determines the risk.

Analyzing independent and large data sets using data mining methods will enable us to derive more detailed information about the complex genetic background and molecular etiology of Alzheimer disease. This framework holds promise as an effective approach for conducting post-GWAS analysis in the context of various complex genetic disorders.

Acknowledgements

** The investigators within the ADNI, GenADA, and NCRAD contributed to the design and implementation of and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

Data collection and sharing for ADNI data for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Funding support for the dbGaP datasets: National Institute on Aging - Late Onset Alzheimer's Disease Family Study: Genome-Wide Association Study for Susceptibility Loci. dbGaP Study Accession: phs000168.v1.p1. and Multi-Site Collaborative Study for Genotype-Phenotype Associations in Alzheimer's disease and Longitudinal follow-up of Genotype-Phenotype Associations in Alzheimer's disease and Neuroimaging component of Genotype-Phenotype Associations in Alzheimer's disease. dbGaP Study Accession: phs000219.v1.p1

Authors' contributions

OE conducted the study under the supervision of YAS and CI; OE run the machine learning analysis build the RF-RF models, YAS formulated, OE coded and implemented the EnSCAN scoring; OE drafted the manuscript; YAS and CI edited and reviewed the manuscript; The authors read and approved the final manuscript.

Funding

This work is supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) ARDEB 1003 Grant No:SBAG -216S468

Availability of data and materials: https://github.com/onurer007/Ensembl

a. Ethics approval and consent to participate

Not applicable.

a. Consent for publication

Not applicable.

b. Competing interests

The authors declare that they have no competing interests.

C. Reitz, E. Rogaeva, and G. W. Beecham, “Late-onset vs nonmendelian early-onset Alzheimer disease,” Neurol. Genet., vol. 6, no. 5, p. e512, 2020, doi: 10.1212/nxg.0000000000000512.
E. Bagyinszky, Y. C. Youn, S. S. A. An, and S. Kim, “The genetics of Alzheimer’s disease,” Clinical Interventions in Aging, vol. 9, no. July. pp. 535–551, 2014, doi: 10.2147/CIA.S51571.
T. P. V. Huynh, A. A. Davis, J. D. Ulrich, and D. M. Holtzman, “Apolipoprotein E and Alzheimer’s disease: The influence of apolipoprotein E on amyloid-β and other amyloidogenic proteins,” J. Lipid Res., vol. 58, no. 5, pp. 824–836, 2017, doi: 10.1194/jlr.R075481.
S. S. Muñoz, B. Garner, and L. Ooi, “Understanding the Role of ApoE Fragments in Alzheimer’s Disease,” Neurochem. Res., vol. 44, no. 6, pp. 1297–1305, 2019, doi: 10.1007/s11064-018-2629-1.
V. Botta, G. Louppe, P. Geurts, and L. Wehenkel, “Exploiting SNP correlations within random forest for genome-wide association studies,” PLoS One, vol. 9, no. 4, 2014, doi: 10.1371/journal.pone.0093379.
A. L. Tarca, V. J. Carey, X. Chen, R. Romero, and S. Drăghici, “Machine learning and its applications to biology.,” PLoS Comput. Biol., vol. 3, no. 6, p. e116, 2007, doi: 10.1371/journal.pcbi.0030116.
B. A. Goldstein, E. C. Polley, and F. B. S. Briggs, “Random forests for genetic association studies.,” Stat. Appl. Genet. Mol. Biol., vol. 10, no. 1, p. 32, 2011, doi: 10.2202/1544-6115.1691.
M. N. Wright and A. Ziegler, “Ranger: A fast implementation of random forests for high dimensional data in C++ and R,” J. Stat. Softw., vol. 77, no. 1, pp. 1–17, 2017, doi: 10.18637/jss.v077.i01.
B. Yaldız, O. Erdoğan, S. Rafatov, C. Iyigün, and Y. Aydın Son, “Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies,” BioData Min., vol. 17, no. 1, pp. 1–17, 2024, doi: 10.1186/s13040-024-00355-3.
K. L. Lunetta, L. B. Hayward, J. Segal, and P. Van Eerdewegh, “Screening large-scale association study data: exploiting interactions using random forests.,” BMC Genet., vol. 5, no. 1, p. 32, 2004, doi: 10.1186/1471-2156-5-32.
M. ÇOLAK, T. TÜMER SİVRİ, N. PERVAN AKMAN, A. BERKOL, and Y. EKİCİ, “Disease prognosis using machine learning algorithms based on new clinical dataset,” Commun. Fac. Sci. Univ. Ankara Ser. A2-A3 Phys. Sci. Eng., vol. 65, no. 1, pp. 52–68, 2023, doi: 10.33769/aupse.1215962.
H. Byeon, “Is the random forest algorithm suitable for predicting parkinson’s disease with mild cognitive impairment out of parkinson’s disease with normal cognition?,” Int. J. Environ. Res. Public Health, vol. 17, no. 7, 2020, doi: 10.3390/ijerph17072594.
M. Pal and S. Parija, “Prediction of Heart Diseases using Random Forest,” J. Phys. Conf. Ser., vol. 1817, no. 1, pp. 0–8, 2021, doi: 10.1088/1742-6596/1817/1/012009.
J. Reimand et al., “Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap,” Nat. Protoc., vol. 14, no. 2, pp. 482–517, 2019, doi: 10.1038/s41596-018-0103-9.
“SNP Nexus,” 2022. https://www.snp-nexus.org/v4/.
“SNiPA,” 2022. https://snipa.helmholtz-muenchen.de/snipa3/.
S. Padmanabhan, C. Menni, D. Prabhakaran, and A. F. Dominiczak, “Discovering the genetic determinants of complex diseases,” Curr. Sci., vol. 97, no. 3, pp. 385–391, 2009.
O. Erdoğan, “Ensemble Scoring Algorithm,” 2021. https://github.com/onurer007/Ensembl.git.
C. S. Greene et al., “Understanding multicellular function and disease with human tissue-specific networks,” HHS Public Access, vol. 47, no. 6, pp. 569–576, 2016, doi: 10.1038/ng.3259.Understanding.
V. Vasic, K. Barth, and M. H. H. Schmidt, “Neurodegeneration and neuro-regeneration— Alzheimer’s disease and stem cell therapy,” Int. J. Mol. Sci., vol. 20, no. 17, 2019, doi: 10.3390/ijms20174272.
N. A. Shirwany, D. Payette, J. Xie, and Q. Guo, “The amyloid beta ion channel hypothesis of Alzheimer’s disease,” Neuropsychiatr. Dis. Treat., vol. 3, no. 5, pp. 597–612, 2007.
A. Florentinus-Mefailoski, P. Bowden, P. Scheltens, J. Killestein, C. Teunissen, and J. G. Marshall, “The plasma peptides of Alzheimer’s disease,” Clin. Proteomics, vol. 18, no. 1, pp. 1–26, 2021, doi: 10.1186/s12014-021-09320-2.
Y. Sun et al., “Role of the Extracellular Matrix in Alzheimer’s Disease,” Front. Aging Neurosci., vol. 13, no. August, pp. 1–11, 2021, doi: 10.3389/fnagi.2021.707466.
F. Sáez-Orellana, J.-N. Octave, and N. Pierrot, “Alzheimer’s Disease, a Lipid Story: Involvement of Peroxisome Proliferator-Activated Receptor α.,” Cells, vol. 9, no. 5, May 2020, doi: 10.3390/cells9051215.
Z. Lv, M. Hashemi, S. Banerjee, K. Zagorski, J.-C. Rochet, and Y. L. Lyubchenko, “Assembly of α-synuclein aggregates on phospholipid bilayers.,” Biochim. Biophys. acta. Proteins proteomics, vol. 1867, no. 9, pp. 802–812, Sep. 2019, doi: 10.1016/j.bbapap.2019.06.006.
Z.-R. Chen, J.-B. Huang, S.-L. Yang, and F.-F. Hong, “Role of Cholinergic Signaling in Alzheimer’s Disease.,” Molecules, vol. 27, no. 6, Mar. 2022, doi: 10.3390/molecules27061816.
M. Kikuchi et al., “Polygenic effects on the risk of Alzheimer’s disease in the Japanese population,” medRxiv, p. 2023.10.06.23296656, 2023, [Online]. Available: https://www.medrxiv.org/content/10.1101/2023.10.06.23296656v1.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
27 Sep, 2024
Reviews received at journal
26 Sep, 2024
Reviewers agreed at journal
25 Sep, 2024
Reviews received at journal
10 Aug, 2024
Reviewers agreed at journal
03 Aug, 2024
Reviewers agreed at journal
29 Jul, 2024
Reviewers agreed at journal
29 Jul, 2024
Reviewers invited by journal
29 Jul, 2024
Submission checks completed at journal
16 May, 2024
Editor assigned by journal
16 May, 2024
First submitted to journal
07 Mar, 2024

You are reading this latest preprint version

EnSCAN: ENsemble Scoring for prioritizing CAusative variaNts across multi-platform GWAS for Late-Onset Alzheimer's Disease

Status:

Version 1

Abstract

Figures

INTRODUCTION

MATERIAL AND METHODS

2. Development of the Workflow

a. Preprocessing

b. Model Construction

c. Ensemble Scoring Algorithm

d. Functional Enrichment

3. Independent Model Construction for Multi Platforms by PLINK and Random Forest

4. Multi-platform Variant Prioritization by EnSCAN Algorithm

5. Evaluation of Ensemble Scores

Evaluation of All LOAD-RF-RF Variants:

Evaluation of score 2.31:

Evaluation of union of score 2.31 and 3.31:

Evaluation of score 3s (3.11, 3.21, 3.31):

Evaluation of score 3.31:

DISCUSSION

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1