1. Dataset
Data used in this study is acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, accessible at adni.loni.usc.edu and NCBI dbGAP, accessible at https://www.ncbi.nlm.nih.gov/gap. dbGAP (Database of Genotypes and Phenotypes), developed for storing and sharing genotype and phenotype data, family history, and other clinical findings of the association studies regarding a particular disease, is supported and published by NCBI. On the other hand, ADNI (The Alzheimer's Disease Neuroimaging Initiative) constitutes an ongoing, longitudinal, multicenter research initiative aimed at establishing clinical, imaging, genetic, and biochemical biomarkers for the early detection and monitoring of Alzheimer's disease (AD). ADNI collects, validates, stores, and utilizes data such as MRI and PET images, genetics, cognitive tests, and blood biomarkers to predict Alzheimer's disease.
With the controlled accessed, GWAS datasets provided by ADNI (210 controls and 344 cases), GenADA (777 controls and 798 cases), and NCRAD through dbGAP initiatives (1310 controls and 1289 cases) are subjected to analysis for construction of multi-models. The initiatives utilize Affymetrix Mapping250K_Nsp and Mapping250K_Sty, as well as Illumina Human610_Quadv1_B 500K and Illumina Human610-Quad BeadChip platforms, as illustrated in Figure 1. This study incorporates a total of 620,901, 410,907, and 585,295 QC-passed SNVs for GenADA, NCRAD, and ADNI, respectively.
As platforms and single nucleotide variations focused on AD are different, there is a need to process datasets individually to identify the causing variations. As a result, the individual prediction model identifies unique and notable prior information before the ensemble process of the multi-platform dataset.
2. Development of the Workflow
First, GWAS analysis is employed to discover statistically meaningful SNVs using the chi-square test for each dataset. As the GWAS only identifies the independent statistical significances of variations, we used this information to filter the non-informative SNVs that are not directly related to the disease.
After the initial dimension reduction analysis for filtering statistically insignificant variations based on GWAS, the SNVs significantly associated with LOAD are used for further analysis. We have reported that the RF-RF approach outperformed the hybrid models for the GenADA dataset (unpublished data). As the top-performing modeling approach was GWAS filtering followed by two-step RF, we developed LOAD-RF-RF models for all three datasets.
The second phase of this study integrates information from three data mining models for each dataset with the proposed ensemble method. For the first time, we suggested using a higher space (SNVs to genomic locations on cytogenetic bands), as seen in Figure 2, while ensembling multi-models that offer significant information after GWAS and RF analysis. Since RF-selected variants from different datasets differ due to the sequencing platforms, SNVs are mapped to chromosomal positions to locate and analyze the neighboring variants. An ensemble scoring algorithm is proposed to calculate information provided regarding cytogenetic bands that emphasize disease-related and causing variations, as seen in detail in Figure 1. While scoring, a three-rowed frame is slipped down within a band containing SNVs. The frame size is selected three as the information comes from three datasets. Linkage Disequilibrium (LD) for each SNV, the incident of variations in all databases, and validation with the Random Forest were vital indicators to assign scores to prioritize and select the most critical variations of AD.
a. Preprocessing
The data preprocessing step describes any operation on data for eliminating noise, cleaning redundancy, and determining the significant subset by feature extraction or dimension reduction methods within the data before modeling. Preprocessing takes a large part of the total effort.
As allele-calling algorithms continue to improve, quality improvement strategies must ensure that only reliable markers are used for analysis. In data, quality control of genotype data is conducted initially before the authorized access dataset is realized. Autosomal SNVs with Hardy Weinberg Equilibrium value<5e-07 are excluded in controls and MAF<0.05 and call rate<0.99. Remaining SNVs with call rate<0.95 and without a valid map position are also excluded from genotype data.
GWAS analysis is implemented for each dataset to eliminate redundant variations by comparing the results between unadjusted p-values with a significant value of 0.05. GWAS is used only for pre-filtering; hence, multiple hypotheses are not considered.
b. Model Construction
Dimension reduction can be beneficial for computational efficiency and improve the accuracy of the data mining model. After GWAS with PLINK, Random Forest (RF) is employed tandemly for dimension reduction and supervised learning model implementation [9]. Since we are considering nonlinear patterns during the dimension reduction by applying data mining methods, false positives are better eliminated.
The dataset is partitioned by bagging (bootstrap) techniques, which are used to reduce the variance of predictions by assembling the results of multiple classifiers obtained on different sub-samples of the same dataset. Sampling is performed on the initial data, leading to the creation of new datasets for every tree construction. Data that is out of training is considered out of the bag (OOB) for assessing the accuracy of trees.
First, parameter estimations of model building for the number of trees to be generated and how many features are randomly available to be considered for each new split are calculated. Hence, the value of "the number of trees" is determined with different sizes. In addition to this, the "number of randomly selected variables" is also tuned. In order to detect the optimized random forest parameter, the "RANGER" package is used in R (including the "randomForest" package) as the RANGER library has functionality for tuning RF. It automatizes fitting multiple versions of an Ensemble Scoring Model by changing its crucial parameters and tree construction samples. An increase in the "Number of trees" brings computational effort, but the model's accuracy may increase as this increases. Sequential values are given to estimate the tuning parameters, and the model is trained.
A confusion matrix is formed for general accuracy, specificity, and sensitivity information based on the general majority voting on OOB samples. Then, the best model whose parameters produce higher accuracy in terms of OOB samples is set as the final model.
Finally, variable importance values are calculated using the RANGER package in R by calculating t-test statistics for each variable for the most accurate model. It evaluates how the importance of all variables' existence in the model.
c. Ensemble Scoring Algorithm
In the literature, a few ensemble methods that work within the same dataset [5], [10], [11], [12], [13] are proposed. So far, there has been no ensemble methodology for analyzing multi-platforms whose attributes differ. Here, the output of first-step prediction models is used as prior knowledge to merge all information to calculate an importance score for each SNV that gives insight into LOAD susceptibility. As the novelty of the Ensemble Scoring Algorithm, we have utilized genomic location as a higher dimension where all RF-selected SNVs are mapped to their chromosomal positions in the genomic space, and cytogenetic bands are set as slots to survey through a sliding window approach for scoring (Figure 2).
Three factors are used in the Ensemble Scoring Algorithm: (1) the number of platforms in the scoring window showing variants from different data sets are concentrated in the same region, (2) linkage disequilibrium (LD) with neighbors, and (3) whether the SNV is selected after multi-step RF as significant. Based on these three variables, each variant is assessed by exploiting the neighboring SNVs in the sliding window. A higher score for an SNV first indicates that the loci are associated with the condition in multiple studies, and multiple unlinked SNVs are identified as associated with that loci. A further selection of an SNV through a multi-step RF-RF model marks the candidate causative variants in that window.
The number of SNVs in each window is set to three to analyze multiple datasets. Figure 3 expresses the flowchart of the Ensemble Scoring Algorithm for calculating the importance of SNVs. The algorithm's steps are given in Algorithm 1.
The scoring formula for each SNV has three components: one from database listing, one from LD linkage, and the final component refers to whether RF-RF Modeling selects SNV. The Ensemble Scoring Coefficients and Formula is calculated as follows:
As the number of datasets is three, "d" can be 1, 2, or 3. Based on the predefined LD correlation coefficient, considering the neighborhoods in the window, "l" depends on the number of independent SNVs in the defined sliding window; hence "l" can be 1, 2, or 3. If post and pre neighbors are in LD, “l” is set to the lowest score as 1. "v" shows that an SNV is selected by RF-RF modeling, so corresponding "v" equals 1 if SNV is selected via RF-RF. While analyzing three different multi-platform datasets, the maximum score for an SNV can be 3.31, while the minimum score can be 1.10. In summary, the score can be an element of S ∈ {1.10, 1.11, 1.20, 1.21, 1.30, 1.31, 2.10, 2.11, 2.20, 2.21, 2.30, 2.31, 3.10, 3.11, 3.20, 3.21, 3.30, 3.31}.
All SNVs are scored using the Ensemble Scoring Algorithm. As given in Step 6, each SNV is scored three times as it is listed in different windows. If "rs1333190" is considered an example, the scoring procedure is presented below for each location of the SNV in the windows.
Case 1: When SNV is in the middle of a sliding window.
Figure 4 represents calculations for the SNV called "rs1333190," which occurs in the middle of three-slot window. According to Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains in the middle. The window has been expanded since there is an LD linkage for" rs1333190". In this case, it has been seen that information comes from 3 different databases within the extended window. "tandem" was created due to the information from 3 different databases. In this case, "d" is defined as 3. Since an LD linkage exists for" rs1333190", "l" has been determined as 2 in the frame. As the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 3.21, formulated in Formula 1.
Case 2: When SNV is at the bottom of a sliding window.
In the first step, according to Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains at the bottom. Figure 5 shows how to calculate the score when the SNV occurs at the bottom of the three-slot window. In this case, "rs1333190" is now at the bottom of the window. Since "rs1333190" and "rs4548489" known to be linked, the sliding window expands to contain both as one informative variant. Here, "d" is defined as 2 since two different datasets are contained within the extended window. Since there is no LD linkage within the frame, "l" has been determined as 3 in the frame. As the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 2.31 on, which is formulated in Formula 1.
Case 3: When SNV is at the top in the sliding window.
Figure 6 shows the calculations for an SNV called "rs1333190" at the top of three-slot window. Again, following Step 6 of Algorithm 1, the sliding window is created so that "rs1333190" remains at the top. The frame has been expanded since there is an LD linkage for" rs1333190". In this case, "d" is defined as 2. Since an LD linkage exists within the window for" rs1333190", "l" has been determined as 2. Since the RF-RF model has chosen "rs1333190", "v" has been determined as 1. In this case, the score is calculated as 2.21 on which it is formulated in Formula 1.
All three possible scores for rs1333190 are calculated. It scores 3.21 in the middle of the window, 2.31 at the bottom, and 2.21 at the top of the window. In this example, the ensemble score is set to 3.21, the maximum of three EnSCAN scores.
d. Functional Enrichment
After the GWAS algorithm is run for the ensemble and scores for individual SNVs are calculated, the pathway enrichment and visualization of the data are done by following protocol [14]. Pathway enrichment analyses are done by g:Profiler, a web server accessible to the public for characterizing and manipulating lists of genes. Functional enrichment analysis (over-representation analysis (ORA) or gene set enrichment analysis on AD-related SNVs) is performed to map genes to known functional information sources and detect statistically significantly enriched terms.
The g: Profiler tool's g: GOSt component is employed for analyzing variants identified by the ensemble model, specifically focusing on GO Molecular Function, GO Cellular Component, GO Biological Process, and Reactome pathways. Default attributions are used for all analyses, with a significance threshold of 0.05 applied to identify pathways associated with SNVs based on ensemble scores. The enrichment p-values for pathways are computed using Fisher's exact test, and multiple testing corrections are applied using the Bonferroni correction method.
Then, EnrichmentMap, the Cytoscape plugin, is used for functional enrichment visualization by creating networks from Gene Ontology annotations and Reactome pathways. All analyses are completed with p-value 0.05, FDR q-value cut-off 0.01, and edge similarity cut-off 03(Jaccard metric).
IMPLEMENTATION
3. Independent Model Construction for Multi Platforms by PLINK and Random Forest
We developed in-silico models for Late-Onset Alzheimer's Disease (LOAD) using genotyping data obtained from three datasets sourced from ADNI and dbGAP initiatives. The analysis involves examining GWAS datasets from ADNI (210 controls and 344 cases), GenADA (777 controls and 798 cases), and NCRAD through dbGAP (1310 controls and 1289 cases).
The initial step involves conducting GWAS using PLINK, followed by filtering based on p-values for dimension reduction. When only one SNP is investigated, type one error α=0.05 is considered. After GWAS, 3.768 SNVs from GenADA, 16.404 SNVs from NCRAD, and 7.639 SNVs from ADNI are determined as statistically significant [9] to LOAD. PLINK results filter redundant SNVs that are common or uninformative between cohorts.
For each dataset, a Multi-step Random Forest (RF) is executed with 5-fold cross-validation (CV) using the RANGER R package, following GWAS with PLINK. The test performances of the Late-Onset Alzheimer's Disease (LOAD) RF models for ADNI, NCRAD, and GenADA datasets are computed as 72.9%, 68.8%, and 92.4%, respectively. The individual LOAD RF models select 390 SNVs from the ADNI dataset, 1740 from NCRAD, and 434 from GenADA, considering the permutation importance of variants at a 95% confidence level. No consensus variants were identified from three different datasets [9], as seen in Figure 7.
The evaluation of the test performances for the Late-Onset Alzheimer's Disease (LOAD) RF-RF models on ADNI, NCRAD, and GenADA datasets yielded results of 74.0%, 72.1%, and 85.1%, respectively, as detailed in Table 1. Individual LOAD RF-RF models selected 32 SNVs from ADNI, 581 from NCRAD, and 107 from GenADA datasets, taking into account the permutation importance of variants with a 95% confidence level [9]. The RF-RF analysis identified the most significant SNVs. These RF-RF-identified SNVs are utilized during ensemble scoring to enrich the prioritized SNV list for causative variants.
Table 1 Individual Results from Three Datasets
4. Multi-platform Variant Prioritization by EnSCAN Algorithm
Since tagged SNVs do not overlap between different platforms, we developed a framework that maps individual SNVs to a higher dimension, chromosomal location on genes, to observe neighboring variants from different datasets.
SNP Nexus [15] is used to determine the annotation of SNVs. SNiPA [16] determines pairwise Linkage Disequilibrium (LD). The correlation between the phenotype and the marker allele should indicate the presence of the causal SNV in LD. Capturing all LD blocks in the genomic region provides additional information in detecting variants predisposing to the disease. [17] The LD correlation coefficient is determined during the scoring process. LDs of each SNV are defined and labeled. Finally, all SNVs identified in the LOAD-RF-RF model from three platforms with multi-step models are scored following this approach named Sliding Window Algorithm (SWA) for prioritization.
The ensemble algorithm will score each SNV in an iterated scoring window. Hence, for each SNV, three different scorings will be handled where the SNV is in the middle, at the top and bottom of the window, considering three different datasets (the number of SNVs in the window is three).
All variants are scored via the Ensemble Scoring Algorithm (EnSCAN) [18], and the highest-scoring variants are identified. The distribution of calculated scores is presented in Figure 8.
83 SNVs are scored with the highest EnSCAN score of "3.31". 43 out of 83 SNVs are identified as protein-coding (Table 3). Genes carrying these LOAD associated variants with highest EnSCAN score were CSMD2, NR5A2, KIF26B, CCDC3, CSGALNACT2, SLC18A2, PPFIBP2, TSPAN18, ALDH3B1, CNTN5, RNFT2, FBXW8, RBFOX1, NDRG4, PITPNC1, NPLOC4, PHLPP1, RUNX1, OSBP2, CYB5R3, LSAMP, TPRG1, FGF12, TENM3, ENPP6, ADCY2, ANKRD33B, IQGAP2, SPOCK1, ARHGAP26, CYFIP2, SYCP2L, SLC17A2, MDFI, RUNX2, CRISP1, PLAGL1, GNGT1, DLC1, C9orf3, SNX30, HMCN2 (Supplementary Table 1). Among the 43 protein-coding variants with a score of 3.31, twelve mapped on chromosomes 5 and 6, located on ADCY2, ANKRD33B, IQGAP2, SPOCK1, ARHGAP26, CYFIP2, SYCP2L, SLC17A2, MDFI, RUNX2, CRISP1 and PLAGL1 genes.
Also, 29 out of 43 variants on protein-coding genes were associated with known disease phenotypes. Among these, 6.5% were directly linked to Alzheimer Disease (AD), while 35.5% were associated with other AD-related conditions, including cholesterol metabolism, Type 2 Diabetes, cardiovascular disorders, and immunological issues. Furthermore, 30.5% were tied to addiction-related phenotypes such as Tobacco Use Disorder. Additionally, 13% were associated with various neuropsychiatric disorders, and the remaining 14.5% were linked to non-neurological or cancer-related phenotypes.
5. Evaluation of Ensemble Scores
We perform functional enrichment analysis, over-representation analysis (ORA), or gene set enrichment analysis for EnSCAN variants with various scoring categories as a validation method. Here, we map SNVs over genes to access known functional information sources and detect statistically significantly enriched terms for identifying the efficacious enrichments related to Alzheimer disease.
The overall distribution of ENSCAN scores and their overlap between datasets are summarized in Figure 9. SNVs with scores greater than 2 (observed in at least two different datasets from ADNI, NCRAD, or GENADA) are analyzed in different combinations. The pathways from the following EnSCAN score categories are benchmarked against all calculated scores:
1) All variants with EnSCAN scores as a benchmark
2) Variants with EnSCAN score equal to 2.31
3) Variants with EnSCAN scores 2.31 or 3.31
4) Variants with EnSCAN scores 3.11 or 3.21, or 3.31 (3s)
5) Variants with EnSCAN scores equal 3.31
Evaluation of All LOAD-RF-RF Variants:
When SNVs with EnSCAN scores are evaluated, the enrichment analysis reveals a general view of the biological process possibly underlying the molecular etiology of AD (Figure 10).
Tissue and cell-type identity lie at the core of human physiology and disease, considering "multicellular organismal process" [19]. Gaining insights into the genetic foundations of intricate tissues and specific cell lineages is pivotal in advancing diagnostics and therapeutics. The exploration of tissue-specific networks offers a novel approach for generating hypotheses pertaining to the molecular origins of human diseases.
"Neuronal system development" is also critical for AD. Strategies to improve the symptoms of aging and age-related diseases such as Alzheimer's Disease have included different means to stimulate neurogenesis, both pharmacologically and naturally. The regulatory mechanisms of stem cell neurogenesis or a functional integration of newborn neurons have been explored to provide the basis for grafted stem cell therapy. A review [20] aims to provide an overview of AD pathology of different neural and glial cell types and summarizes current strategies of experimental stem cell treatments and their putative future use in clinical settings.
[21] shows that abnormal "ion channels" (especially those showing selectivity for cations like Ca2+) might be a reason of the toxic mechanism behind AD pathology. "Peptides or phosphopeptides" of common plasma proteins show increased observation frequency by Chi Square and/or precursor intensity in AD, according to a recent study [22]. Increases in mean precursor intensity of peptides from common plasma proteins such as DISC1, EXOSC5, UBE2G1, SMIM19, NXNL1, PANO, EIF4G1, KIR3DP1, MED25, MGRN1, OR8B3, MGC24039, POLR1A, SYTL4, RNF111, IREB2, ANKMY2, SGKL, SLC25A5, CHMP3 among others were observed in association with AD.
Impairments of the "extracellular matrix" [23] have been reported to have a role in Alzheimer Disease, focusing on synaptic transmission, amyloid-β-plaque generation and degradation, Tau-protein production, oxidative stress response, and inflammatory response. The extracellular matrix comprises various macromolecules secreted by cells, namely collagen, elastin, fibronectin, laminin, and glycoproteins; regulating the expression of individual components is an essential step in stabilizing or improving the course of the disease. When collagen forms as fibrils around the vulnerable neurons, it blocks the Aβ proteins from binding to the cells. Therefore, its components are a potential target and biomarker for developing and treating AD.
Evaluation of score 2.31:
Enrichment analysis of only SNVs with EnSCAN score of 2.31 is showed that AD-related processes of "adhesion molecules/biological cell adhesion", "cation ion binding/calcium ion binding," and "external encapsulating structure/plasma membrane/periphery membrane component" were enriched within this category. Additionally, hsa mir181a and has mir221 are observed, which was not enriched in the previous benchmark analysis (Figure 11).
Evaluation of union of score 2.31 and 3.31:
SNVs with EnSCAN score of 2.31 or 3.31 are evaluated together under one category; the top enrichment classes are not changed compared to only 2.31 and the benchmark (Figure 12). Considering ontology, development multicellular process, external encapsulating structure, has mir 221, dlc1 dlgap1 myo5a, metal ion transmembrane, plasma membrane periphery, calcium regulation cardiac, component plasma integral, biological adhesion cell, synapse presynapse, adhesion molecules cell, has mir181a, calcium ion binding and cation binding ion are significant terms that affect the susceptibility to Alzheimer Disease in humans.
All three categories analyzed above for the enriched biological process shared common keywords of "adhesion molecules/biological cell adhesion", "ion cation binding/calcium ion binding", and "plasma membrane/ peripherial component/ plasma integral". One biological annotation category that emerges in this set different from 2.31's enrichment analysis that is noteworthy was "synapses and presynapses". The enrichment analysis allowed us to observe biological processes involving LOAD-associated variants. Comparing EnSCAN scoring categories also revealed the consistent enrichment of ontological terms between categories that are known to have roles in the core of AD pathophysiology.
Evaluation of score 3s (3.11, 3.21, 3.31):
As we have focused on neighboring LOAD-RF-RF variants only observed in all three datasets with EnSCAN scores of 3.11, 3.21 and 3.31, we notice that the enrichment analysis is narrowed down onto the “phospholipid binding” (Figure 13), which is a recent topic of interest in AD pathophysiology. It has been reported that abnormal phospholipid binding contributes to the development of AD, and mutations in genes related to phospholipid metabolism have been linked to an increased risk of AD [24]. While more research is definitively needed to understand the role of phospholipid interactions in AD pathogenesis, recent reports are suggesting certain types of phospholipids might promote the formation of alpha-synuclein aggregates [25], which can disrupt normal phospholipid interactions in cell membranes.
Evaluation of score 3.31:
The enrichment analysis of variants with top EnSCAN score reveals the "nicotinic activity on dopaminergic neurons" as the only ontology common between these variants (Figure 14).. In general, activating Nicotinic Acetylcholine Receptors (nAChRs) on dopaminergic neurons leads to increased dopamine release, which regulates various brain functions, including memory, attention, and reward processing. Dopamine is known to have a role in cognitive functions, and decreased dopamine levels are reported as a feature of AD. Interestingly, the relationship between nicotinic activity on dopaminergic neurons and Alzheimer's Disease (AD) is complex and the specific effects of nicotine on dopaminergic neurons in the context of cognitive impairment, memory loss and development of AD are still being investigated [26].