Modern technologies have enabled the use of genomic data to predict and customize strategies for preventing and treating diseases. Millions of single-nucleotide polymorphisms (SNPs) exist in the human genome, and genome-wide association studies (GWAS) help identify associative links between SNPs and various diseases [1]. Frequently polymorphisms with weak individual effects may collectively exhibit a strong correlation with a disease [2]. Polygenic Risk Score (PRS), a linear regression model that uses individual SNPs with weights derived from GWAS, has traditionally been used to assess the risk of multifactorial disease manifestation. Although PRS has rightfully become the most popular tool due its simplicity and good predictive ability, it has significant limitations, such as inability to account for non-linear effect of epistasis. Although, historically this term has been used to describe various genetic events, the most suitable definition was proposed by Fisher [3]. That is statistical epistasis, and it refers to a phenomenon where the effect of genetic variants on disease is non-additive. Epistasis is a field of active study, and it has already been proven that it has a significant effect in a number of diseases [4]. Epistasis is a challenging aspect in building a reliable polygenic risk model, as linear approaches are often insufficient to capture non-linear relationships between genetic variants and disease.
Machine learning techniques may help to overcome some of PRS limitations. For instance, deep neural networks (DNN) have improved PRS for predicting breast cancer [5]. DNN demonstrated better results (AUC ROC 0.674) than any other approach, including best linear unbiased estimator (AUC ROC 0.642), BayesA (AUC ROC 0.645), LDpred (AUC ROC 0.624), random forest (AUC ROC 0.636) and gradient boosting (AUC ROC 0.651). The same conclusion was reached by the researchers also for the breast cancer and breast cancer subtypes in Chinese population in the work [6], although the difference in performance was less significant (AUC ROC of 0.601 for DNN and 0.598 for logistic ridge regression). Neural network-based approach has also proven effective in predicting other risks, including some heart conditions (myocardial infarction, stroke and others) [7], Alzheimer’s disease [8, 9] and 10 phenotypes from UK biobank [10]. In this study, we evaluated the potential of various machine-learning methods on simulated data with epistasis. After that, we tested the performance of these models on multifactorial diseases: obesity, type 1 diabetes, and psoriasis.
Obesity is a global health problem that has raised major concerns in recent decades. According to the World Health Organization (WHO), obesity rates have nearly tripled worldwide since 1975, with over 650 million adults categorized as obese [11]. Obesity is associated with numerous health risks and chronic conditions, including type 2 diabetes, cardiovascular disease, high blood pressure, some types of cancer, and respiratory problems [12]. In addition to that, obesity has also a significant impact on a person's mental well-being, leading to anxiety and depression [13]. The causes of obesity are commonly associated with various environment factors, including demographic, socioeconomic, and behavioral contributions [14]. Nevertheless, variation in body weight is largely modulated by a strong genetic component that determines an individual's susceptibility to these factors. Research conducted through twin and family studies has estimated that obesity has a heritability rate ranging from approximately 40–70% [15]. Obesity risk prediction is currently a subject of thorough research, with machine learning methods being actively used. Among the commonly used models are logistic regression, naïve Bayes, gradient boosting, random forest, support vector machine, k-nearest neighbor method, as well as various neural network architectures, mainly multilayer perceptron (MLP) and convolutional neural networks (CNN). Majority of the published research relies on non-genetic information, such as social and clinical factors [16–19]. Typically, this strategy proves to be fruitful, as it demonstrates a high predictive power. However, it is important to note that the best results are typically achieved when considering both environmental factors and genetic information together. When it comes to polygenic risk prediction for obesity, there are fewer publications, possibly, due to the difficulty of constructing a sufficient dataset containing both genetic and phenotypic information. Nevertheless, machine-learning algorithms have been shown to be accurate and reliable with an average ROC AUC of 0.7 [20, 21]. This approach is often used to identify the SNPs that have the most significant impact on obesity [22, 23]. It was also demonstrated that age and gender might be among the most important cofactors [22].
Second tested phenotype, type 1 diabetes is an autoimmune disease in which the immune system attacks the cells of the pancreas that produce insulin. Its adverse effects may include high levels of blood sugar, heart disease, stroke, kidney disease, nerve damage, and eye problems. Although nowadays one cannot prevent type 1 diabetes, knowing about the genetic predisposition is important, as early diagnosis and proactive management are key to minimizing the negative effects [24]. Moreover, type 1 diabetes is commonly misdiagnosed as type 2 based on clinical indicators [25]. Considering that these diseases require different treatment strategies, genetic information becomes of immense importance in classification and predicting type 1 diabetes. While there is an abundance of research concerning type 2 diabetes classification using machine learning approaches on genetic [26, 27] and non-genetic data [28], there is a limited number of publications focusing on type 1 diabetes. Using clinical and socio-economic factors, researchers were able to reach AUC-ROC values up to 0.83 [29, 30]. Results that are even more impressive with AUC-ROC of 0.96–0.99 were achieved using metagenomics approach in infants [31, 32]. Unfortunately, metagenomics is a rather complex and expensive analysis, and non-genetic classifiers rely on medical history and personal information. Therefore, there is a need for a reliable type 1 diabetes prediction model based on genetic data.
Finally, psoriasis is a chronic autoimmune skin condition that has a strong genetic component. Family and twin studies show strong hereditary patterns, with a higher risk if parents have the condition [33]. As GWAS studies show, psoriasis is highly dependent on genetics, polygenic approaches can help to estimate the risks associated with the disease and design a better treatment strategies [34]. Heterogeneous type of data is currently being used for psoriasis risk prediction (see [35] for review). The best results (accuracy up to 98%) of machine learning classification algorithms achieved using gene expression data in affected and healthy cells [36]. Unfortunately, such an approach is not suited to early prediction, since it analyzes the affected cells. Using genetic information, it is possible to predict psoriasis before the disease manifests itself in any way.
In this paper, we present our studies on how epistasis complicates a disease classification. For this purpose, we trained machine learning models including deep learning architectures on simulated data containing phenotypes with epistasis of varying complexity. Then we verified our machine learning models on real genetic data collected for three phenotypes: obesity, type 1 diabetes and psoriasis.