Random forest and its properties
RF is a popular ML algorithm because it often demonstrates good performance while remaining relatively easy to optimize and interpret. RF algorithms belong to a Bagging (Bootstrap Aggregation) type of ensemble ML methods where a group of weak learners in a form of decision trees (DT) classify the outcome using majority vote. Decision trees are sensitive to the data they are trained and often suffer from high variance problem especially when the depth of the tree is high. To address that RF algorithm trains each tree on a subset of samples drawn from the complete dataset with the bootstrap procedure. An additional source of randomness within the RF algorithm is introduced during the construction of a tree when selecting a split node from the random sample of features (in place of the greedy search through all feature set like in DT). These two sources of randomness aim to decorrelate weak learners and correspondingly decrease the variance of an estimator by combining diverse trees prediction via majority vote.
During the construction of a DT, the decrease in the error function can be calculated for a feature at each split node. In a classification task, this is often done via estimating the Gini score or the entropy score. The function decreases can be averaged across all trees and returned as feature importances score (the greater the decrease the higher feature importance). Feature importance scores are often conveniently implemented as a RF function which makes this method more interpretable.
Permutation Feature Importances
Permutation feature importance metrics were first introduced by Breiman in his Random Forest manuscript [37] and further extended by Altmann [38] to correct for the bias of the RF’s Gini importance and entropy criterion for feature selection. We utilize a custom implementation of PFI which could be applied to any machine learning classification and regression algorithms (Fig. 1). Here, PFI metric is calculated with following steps: 1) the dataset is shuffled and split into the training and testing datasets 2) the model is fitted on the training dataset and the balanced accuracy is estimated on the testing dataset, 3) feature 1 out of N is permuted for the testing dataset 4) the balanced accuracy is estimated on the permuted testing dataset 5) the relative decrease of the permuted and non-permuted balanced accuracies is calculated and stored as relative decrease in accuracy, 4) step 2 and 3 are repeated for the remaining N-1 features, 5) steps 1 through 4 are replicated for N-1 times with the new seeds for shuffle and split procedure, 6) mean of relative decrease in accuracy per feature is calculated across the splits and is used as features’ PFI value. Retrieved PFI values were normalized to sum to 1.
Evaluation Of The Feature Importance Metrics
Data simulation using HIBACHI
We used Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) software to simulate genetic datasets with non-additive epistatic interactions of different complexity. The HIBACHI method employs biological and mathematical frameworks to connect genotype and phenotype [39, 40] At the bottom of the biological framework is the concept of information transfer from the DNA sequence to a clinical phenotype through complex interactions at multiple levels: gene expression, pathway, and cell. Within this framework, a population of samples is expressed as a collection of genes (genotype) each of which has three variants 0, 1 or 2. The mathematical framework’s goal is to specify a relationship between genotype and phenotype in terms of logical, arithmetical and other functions. HIBACHI merges the frameworks by evolving a mathematical expression tree which when applied to a genotype, generates a binary clinical phenotype.
HIBACHI employs Genetic Programming (GP) as an optimization engine. One of the key characteristics of GP is a fitness function which is represented through the mathematical expression that satisfy a specific objective of interest defined by user. This objective could be a performance of a machine learning pipeline, complexity of genetic interactions, odds ratio of genetic effect sizes, length of expression tree, etc. or a combination of those in the form of the multi-objective fitness function. Additionally, user to allowed to specify the length of the optimization (by specifying the number of generations), the population size (number of samples) and genotype size (number of genes) of the dataset.
At the beginning of the GP optimization process, a population of individuals with random mathematical expression trees is initialized and is further subjected to mutational and recombinational processes. This process serves as a source of variation for the expression trees and have a pre-defined rate. After that, a user-defined fitness function is calculated and the best-fitted individuals are selected for the next round. At the last optimization round, an overall best-fitted individual is used as an output dataset.
For the aims established within this study, we wanted to generate datasets with non-additive epistatic interactions that would involve two and three genetic variants and, therefore, we have defined a fitness function that maximizes two-way and three-way information gain (IG) term. Entropy-based IG approach to detect epistatic interactions has been introduced by Moore et al. [41] and extended by Fan et al. [42]. Two-way IG defined as:
,
Where I is mutual information that describes the dependency between variables X, Y and Z. It measures the reduction of uncertainty of one variable (Z) given the knowledge of others (X and Y) It is expressed with entropy terms:
,
Where entropy H defined with the probability mass function:
Definitions of three-way IG can be found in Hu et al., 2013 [43]. We used the following version of this term:
Additionally, a second fitness objective was set up as maximization of expression tree length, to encourage multiple combinations of genetic interactions. In addition to the variability in the interaction complexity, the following factors have been considered in the HIBACHI experimental schemes: percent of cases (25% and 50%) to address an imbalanced dataset structure, and sample size (1000 and 10000). Each experimental setup was reproduced 100 times using random seed generator and the whole population of replicates was considered in the consecutive analysis. All simulated data used here is available upon request.
Sensitivity Analysis
We implemented a HIBACHI-based sensitivity analysis to determine the true feature importances ranks and the effect sizes. For this the permutation-based framework was implemented with the following steps (Fig. 2): 1) split HIBACHI-generated dataset into the outcome vector and the feature set 2) permute feature vector 1 out of N and re-evaluate the outcome by applying the HIBACHI-generated expression tree 3) calculate the dissimilarity of the outcome as of mismatch between the perturbed and unperturbed feature set 4) repeat the estimate 2) for the remaining feature and normalize the counts by the total sum; 5) replicate steps 2–4 100 times and calculate the average of the replicates per feature as a final true feature importance score. Sensitivity scores were further normalized to sum to 1.
Real world data analysis.
To examine the convergence of the RF’s feature importance metrics we used two real-world datasets with evidence for non-additive interactions. The first includes preselected SNPs from a genome-wide association study of Alzheimer’s Disease while the second includes preselected SNPs from a genome-wide association study of Primary Open Angle Glaucoma (POAG). The Alzheimer’s dataset came from the Alzheimer’s Disease Neuroimaging Initiative during which functional MRI was taken every six to twelve months for patients with three health conditions (neuro-typical (identified here as controls), and mild cognitive impairment and Alzheimer’s disease (identified here as cases)). A computational evolution system [44] identified a model of seven SNPs with evidence of non-additive interactions and a classification accuracy of 0.738. Among these SNPs are the SNP with the large main effect that is located in the APOE gene (rs429358) – a known risk factor for Alzheimer disease, four SNPs located within genes with known functionality/disease state (rs1931073 – an intergenic region near the PPAP2B gene that is participating in cell-cell interactions, rs7782571 – near the ISPD gene that is associated with the Walker-Walburg syndrome, rs4955208 – in the OSBPL10 genes which are expressed into intracellular lipid receptor, rs12209418 – in the PKIB gene that codes a protein kinase inhibitor) and two remaining SNPs (rs2414325, rs12785149) are located within genes with unknown functionality.
The glaucoma dataset came from the Glaucoma Gene Environment Initiative study and contained with POAG individuals identified as cases and healthy individuals as controls. This dataset has been previously analyzed by Moore at al. [45] with the EMERGENT algorithm that resulted in the identification of a model of six SNP’s with evidence of non-additive interactions and a classification accuracy of 0.615. Two of these SNPs (rs2157719, and rs1266924) are located within the genes that were previously associated with glaucoma disease, two SNPs (rs10915315, and rs1489169) are located within the genes that are associated with glaucoma-non-related diseases and relevant pathology, and two more SNPs (rs936498, and rs7738052) are located within the genes that were not previously associated with any disease, but have a known functionality that is relevant to visual cortex and retina development.
We used the visualization of the statistical interaction network (ViSEN) method [46] to analyze and visualize SNP main effects, and two-way and three-way gene-gene interactions among SNPs for real-world datasets. The ViSEN method calculates the mutual information (MI) between individual SNP (genotype) and the phenotype, the pairwise interaction between every pair of SNPs and the phenotype and the three-way interaction between every combination of three SNPs and the phenotype via the IG term. A positive IG indicates synergistic (i.e. non-additive) effects of SNPs on the phenotype. The IG metrics used in the ViSEN method were designed to detect pure epistatic interactions and excluded all lower-order effects (by subtracting all main effects and pairwise synergies in cases of the three-way term).