Classification of patients with lithium-treated bipolar disorder based on gene expression: Dirichlet Bayesian network model

doi:10.21203/rs.3.rs-2267196/v1

Download PDF

Research Article

Classification of patients with lithium-treated bipolar disorder based on gene expression: Dirichlet Bayesian network model

https://doi.org/10.21203/rs.3.rs-2267196/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Backgrounds: Dirichlet Bayesian network (DBN) model is a method with score-based structural learning, leading to a more accurate knowledge of the structure of the Bayesian network. Therefore, the DBN was used in this study to classify gene expression data in bipolar disorder (BD) with lithium treatment.

Methods: In this study, gene expression data of patients with BD, including 47323 genes, were used, of which 30 received standard treatment and 30 received lithium treatment. The first essential variables were selected using partial least squares (PLS) regression to analyze and classify the data. The plaid algorithm was used to discover identical patterns and biclusters of gene expression data. We implemented principal component analysis (PCA) to represent a component for each bicluster. Then we created the DBN model toclassify the correlation network. Finally, the accuracy of the prediction model was evaluated using Receiver operating characteristic (ROC) curve analysis. R3.6.2 software was used to analyze the data.

Results: In this analysis, the number of essential and significant genes discovered using PLS regression was 10788. We used the plaid algorithm and nine homogeneous biclusters were discovered. The representative component of the biclusters was selected with at least 75% of the variance in the data using PCA. Then the classification was performed using DBN which the model's accuracy was 0.86 and the model's precision was 0.91.

Conclusions: This study demonstrates the potential of an ensemble approach, which can be developed for network analysis for thousands of genes. Combining models produces more robust and accurate models than single models. Also, network analysis is a desirable approach to detect subtle but coordinated changes in the mutual and related expression of a set of genes. This method can help study other diseases using existing datasets.

Bipolar Disorder

Dirichlet Bayesian network

Lithium Therapy

Classification

BD is a complex chronic disease characterized by recurring periods of depression and mania or hypomania (1). This disorder is expected in society, so the prevalence of the spectrum of BDs in the whole period of life is between 2.8 and 6.5% (2). Bipolar disease is the sixth cause of disability worldwide in young adults (3). The probability of suicide in patients with this disorder is high and around 15% (4, 5). Various studies showed the high social cost imposed by this disease on the caregivers of the patients and the patients themselves (in the form of efficiency and productivity) (6–9). The main symptom of BD is severe mood swings. Modulatory drugs are any drugs that are prescribed to treat the two phases of BD. These drugs help the patient's mental health and are effective in both stages of depression and mania. Among the drugs that are prescribed for this purpose is lithium. Lithium is the proper treatment in the management of BD and has been used for about six years, and it also reduces suicidal thoughts or intentions in bipolar patients (10). Due to the unknown cause of this problem, they are trying to find the causes of this disease. In recent years, many efforts have been made to gain more information about the biological responses associated with lithium treatment and the effects and mechanisms that exist on genes. In recent years, the use of the Bayesian network model to estimate gene networks from microarray gene expression has received a lot of attention. The term Bayesian network and the form used today were first introduced in 1988 by Judea Pearl. Today, more than 30 years after the Bayesian network was created, these networks are used as a powerful tool in various sciences (11).

In recent years, much attention has been paid to using the Bayesian network model to estimate gene networks from microarray gene expression data. The reasons for paying attention to these networks can be summarized as follows: First, the expression levels of genes are random, and there are errors in the data. There are many, and due to the high cost of experiments, it is impossible to repeat them, so a suitable method must have a probabilistic and statistical basis that this feature exists in Bayesian networks. Secondly, since gene expression is an inherently random phenomenon, the randomness of the model is a fundamental property. In addition, even if the system under investigation is deterministic, it may appear random due to the inability to measure all variables fully. So, it is essential that the learning algorithm could work with data that contains errors(12). The following case shows Bayesian networks have features for analyzing gene expression data.

Since they use statistical methods, they are suitable models for gene expression, even when complete information is unavailable, and so-called missing data can be used.
Relationships between genes are modeled by the method of conditional probabilities.

3- Bayesian networks are also helpful in non-linear understanding relationships based on probabilistic models in these networks (13).

Bayes network modeling has two stages: a) structural learning and b) parameterized learning. Structural learning seeks to find the best structure for the Bayesian network that matches the available data. Structured learning has different methods based on restrictions, scores, and combined algorithms. Parametric learning means estimating parameters, and its purpose is to evaluate the conditional probability of each network vertex based on its ancestors' condition (14). A 2018 study titled Bayesian Dirichlet network score and maximum relative entropy principle showed that the Bayesian Dirichlet model is a method with structural learning based on scores, leading to more accurate learning in the structure (15). In this study, the classification of gene expression data in BD with lithium treatment was done using Dirichlet Bayesian network.

Data resources

A gene expression dataset includes 47323 genes and 60 patients, 32 of who received standard treatment and 28 of whom received standard therapy with lithium, which was obtained as a result of Affymetrix analysis, which researchers can download from the GEO website under the name GSE45484. Of these patients, four people in the control group and nine people in the treatment group responded to the treatment. These data were first analyzed by Beach et al (B). This study is designed to classify patients based on genes whose expression in peripheral blood can be early markers for response to lithium treatment in patients with BD. Although changes in peripheral blood gene expression may not be directly related to mood symptoms, differences in treatment response at the biochemical level may mask some heterogeneity in clinical response to lithium (16).

Feature Selection

The negative aspect of data generation technologies is the inclusion of irrelevant variables (17, 18). These extraneous variables decrease the model's performance, increase the complexity of the model, and reduce the understanding of appropriate relationships. Therefore, removing irrelevant variables is very important. High-dimensional data sets are often prone to the "large number of variables, small sample size" problem. This problem is explicitly addressed in PLS regression. This method is a supervised method that was developed to solve the prediction problem in multivariate problems. This study used the PLS regression and VIP index to select essential variables (19). After choosing the necessary variables to determine the homogeneous and identical patterns in the selected genes, we used the biclustering method, and the representative of each cluster was discovered based on the principal component analysis method. The classification was done based on the Bayesian Dirichlet network model with uninformed priors using these representatives and the treatment response status of the patients. Finally, the accuracy of the classification was obtained using Roc curve analysis. The general steps of feature extraction and data classification in this study are as follows (Fig. 1):

• Gene expression data preprocessing; Examining the normality and dispersion of the data.

• Selection of a set of genes containing information; Removal of neutral and unchanged genes.

• Biclustering of data: discovering gene sets with the same patterns using the plaid algorithm.

• Cluster representative selection: using principal component analysis to select a representative.

• Classification of data: using the DBNmodel.

• Evaluation and validation: using Roc curve analysis.

Statistical Analysis

We used R3.6.2 software for data analysis, and Bioconductor, Biclust, bnlearn, ropls, pROC, and Bayesian Network software packages were used.

A Bayesian network is called a directed graph (DAG) $\varpi$ whose parameters, in the form of conditional probability, convert this structure from a qualitative state to a quantitative state. The vertices of this graph are random variables, and its directed edges represent the dependence between the vertices. The parameters in the network determine the degree of this dependence. The Bayesian network helps to learn causal relationships and is intuitively understandable due to its graphical structure. A Bayesian network consists of two parts: Bayesian structure and Bayesian probability. The structure of Bayesian networks does not have a directed circuit and X_i can be arranged so that the ancestors of X_i are in the set {X₁,X₂,...,X _(i−1)} and its descendants are in the set {X _(i+1),...,X_n }, so according to the law of total probability, the joint probability function can be written as follows:

$$P\left({X}|\varpi \right)=\prod _{i=1}^{n}P\left({X}_{i}|{X}_{1},\dots ,{X}_{i-1}\right)=\prod _{i=1}^{n}P\left({X}_{i}| \prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s} \right)$$

Learning the Bayesian network for data is usually done based on the Bayesian approach, and by using a score-based structure, the goal is to find a network for $\varpi$ that has the highest posterior distribution probability $P\left({X}|\varpi \right)$ when D is a sample from X.

$$p\left(D|\varpi \right)=\prod _{i=1}^{n}P\left({X}_{i}| \prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s} \right)=\prod _{i=1}^{n}\left[\int P\left({X}_{i}|\prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s}, {\theta }\right)P\left(\theta |\prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s}\right)d\theta \right]$$

If the Dirichlet conjugate prior is used, the posterior distribution is also Dirichlet and we have

$$BD\left(\varpi ,\text{D};\text{a}\right)=\prod _{i=1}^{n}BD\left({X}_{i}|\prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s}; {\text{a}}_{i}\right)=\prod _{i=1}^{n}\prod _{j=1}^{{q}_{i}}\left[\frac{{\Gamma }\left({a}_{ij}\right)}{{\Gamma }({a}_{ij}+{n}_{ij})}\prod _{k=1}^{{r}_{i}}\frac{{\Gamma }({a}_{ijk}+{n}_{ijk})}{{\Gamma }\left({a}_{ijk}\right)}\right]$$

that

${r}_{i}$ number of status of ${x}_{i}$

${q}_{i}$ is the number of possible values of $\prod {X}_{i}\text{a}\text{n}\text{c}\text{e}\text{s}\text{t}\text{o}\text{r}\text{s}$ that are assumed to be 1 if Xi has no ancestors

and a is Dirichlet parameter(15, 20).

The data used in this study were called using R3.6.2 software, which has Affymetrix format. The normalized data were available to users in the GEO database, so there was no need for preprocessing for normalization. Also, the data with outliers have been removed, and the range of changes in gene expression sizes is from 6.42 to 15.02. Checking for normality, for example, for one of the genes using the Cullen-Frey Fig. 2, is shown below.

The selection of the distribution is based on the skewness and elongation of the observations. The blue point is the elongation coefficient coordinate and the experimental observations' skewness coefficient. Among distributions such as logistic, uniform, exponential, average, etc., the closest distribution to the observed data is the normal distribution. A variable selection PLS method widely used in microarray data and VIP index is used to obtain essential and significant genes discovered was 10788, shown in Fig. 3.

In this study, the plaid algorithm was used to create homogeneous biclusters. Using this algorithm, nine biclusters were obtained. The results are shown that the discovered clusters are significant (p < 0.001). To evaluate the ability of this algorithm to find two-dimensional clusters and whether the set of genes obtained in the clusters is meaningful or not, we used the gene ontology interpretation method. The results of this analysis show that the discovered clusters have a high significance on average (more than 70%), and we can say that the clustering results are reliable. The average percentage of biclustering performance in Biological processes was 86.32%, Molecular function 83.47%, and Cellular components 81.26%. Based on the screen plot and eigenvalue higher than one for each PCA, it was shown that for each cluster, one component (first component) is sufficient as a representative of the cluster, and the analysis can be done based on it. The members were divided into two categories, and we evaluated the accuracy based on Roc's analysis. The results of which are shown in Table 1.

Table 1

The accuracy of the cut points of each gene is representative of the PCA.
Gene representative	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
accuracy	0.811	0.813	0.781	0.788	0.794	0.795	0.846	0.885	0.817
cutpoint	51.7	50.06	50.3	48.26	49.8	49.07	49.51	51.46	51.66

First, the learning structure was obtained using the BDE algorithm to fit the Bayesian network model on the data. Then the parameters related to this learning structure were estimated using the maximum likelihood method (Fig. 4).

The data were divided into two sets of training and testing to reduce bias and overfitting. Finally, the model's accuracy was evaluated based on Roc curve analysis and sensitivity and specificity values (Table 2).

Table 2

DBN model performance in two experimental and validation sets.
Learning structure	AUC	Confidence interval		Sensitivity	Spheresity	Precision
Learning structure	AUC	L.AUC	U.AUC	Sensitivity	Spheresity	Precision
Train(70%)	0.860	0.731	0.988	0.769	0.950	0.910
Test(30%)	0.862	0.730	0.986	0.772	0.934	0.910

This research consists of several parts. At first, it identified and selected essential genes that PLS regression can be used. After that, we did biclustering, and the formation of two-dimensional homogeneous clusters was done using the plaid algorithm. The gene ontology interpretation has been used to evaluate this algorithm's performance. In the following, a representative component has been selected for each cluster using PCA. After this, we obtained nine gene representatives from each bicluster, and these genes are essential variables that can predict the disease based on gene expression alone.

Finally, possible dependencies between genes were modeled using a Bayesian network. Since the study data had already been normalized and noises removed, there was no need to preprocess the data.

In 2011, Ji et al. used the PLS method to select and identify specific genes in acute myeloid leukemia and acute lymphoblastic leukemia datasets and SRBCT datasets and showed that PLS could effectively remove the collinear effects between variables. And it has strong predictive power and efficiency in small samples with a large number of genes. They used VEG, IEG, and VIP methods for this. In the gene expression dataset related to BD, we used the VIP index in our model to identify critical and significant genes, and 10,788 genes were placed out of 47,323.

As mentioned before, biclustering was proposed to overcome the limitations and find suitable patterns between genes, which has a more flexible computational framework (21). The plaid algorithm was developed in 2002 by. Lazzeroni and Owen (22) to analyze gene expression data. In a study performed in 2016 (23), the abilities of the plaid algorithm to identify biologically important groups of genes were evaluated through gene ontology and we evaluated the abilities of the plaid algorithm to identify biologically important groups of genes with GO. The ontology analysis showed that the biological importance of each biclustering is high, especially when the cluster sizes are large. This study used the plaid algorithm to identify gene expression patterns and form clusters.

PCA is a statistical method for determining the critical variables in a multidimensional data set. That explains the variance in observations. It can simplify the analysis and visualization of multidimensional data sets. This method is a valuable way to summarize data. This method reduces the dimension of the initial data set by finding a new set of variables that is independent and smaller than the original set of variables while preserving the most information of the data set. Raychaudhuri et al. used this method to summarize gene expression data. Data for this analysis included expression data for 6118 genes of the yeast Saccharomyces cerevisiae over seven time periods. Their study showed that we could be summarize the data in two components, so the first two principal components accounted for more than 90% of the total variation (24); in this study, PCA obtained nine parts as gene representatives.

We combined correlation network analysis and Bayesian networks to model interactions between thousands of genes in a network. Our model specifies the association between gene representations and treatment type. Biological processes in a cell often require coordination between several genes. Network analysis can identify subtle but coordinated changes in interacting and functionally related genes. Therefore, network analysis has advantages over conventional approaches based on a list of differentially expressed genes. Precisely, a Bayesian network models the interaction between a large number of genes based on their correlation pattern. Our method differs from other Bayesian network methods in which each random variable represents a single gene. Since the number of training samples in real applications is usually limited to a few hundred, these methods are generally unsuitable for modeling many (thousands of genes) in a Bayesian network. One solution is to filter the genes before learning the Bayesian network, which is inefficient due to information loss. Genes in a gene pattern are highly correlated and generally contribute to the same biological processes. Network analysis helps extract informative biomarkers (features) from gene expression profiles. In particular, we showed that specific genes have greater predictive power than single genes with a large number. A Bayesian network can be fitted to these data to model the relationship between gene sets and biological or clinical conditions.

Due to the increasing number of applications in various fields, the amount of data worldwide is growing dramatically. We should be analyzes these data to extract essential and valuable information from them. But the large volume of these data causes their analysis to be accompanied by problems. So there is a need for methods to reduce the complexity and difficulties of this type of data. Dimensionality reduction or selection of essential features is one of the most widely used methods to achieve this goal. There are many algorithms for this task, each of which has advantages and disadvantages. Recent years have shown that there has been an explosion in research related to the ensemble methods for classification or estimation models, and the results have been encouraging. Combining models produces more robust and accurate models than single models (25). Also, network analysis is a desirable approach to detect subtle but coordinated changes in the mutual and related expression of a set of genes. Due to the large volume of gene expression data, our model consists of several steps, and as mentioned, the ensemble models perform better when the number of variables is large. As a result, we suggest that we should be considered a combined algorithm for the task of classifying gene expression data.

network meta-analysis: NMA

Preferred Reporting Items for Systematic Reviews and Meta Analyses: PRISMA-NMA

Danggui Buxue Tang: DBT

Cochrane Central Register of Controlled Trials: CENTRAL

confidence interval: CI

Just Another Gibbs Sampler: JAGS

Bayesian inference Using Gibbs Sampling to conduct a Network meta-analysis: BUGSnet

Deviance Information Criterion: DIC

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

All data generated or analyzed during this study are included in this published article

Funding

No any sources of funding for the our research work and their role in the design of the study and collection, analysis, interpretation of data, and in writing the manuscript.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Shahsavari S, Souri pilangorgi S performed the analysis and Shahsavari S , Salari N and Almasi A interpreted the results. Shahsavari s , Salari N, Almasi A drafted the paper. All authors read and approved the final manuscript.

Acknowledgment

The authors would like to express their appreciation towards the financial support of the Research and Technology
Department of Kermanshah University of Medical Science paper.

Sadock BJ, Sadock VA, Ruiz P. Comprehensive textbook of psychiatry. lippincott Williams & wilkins Philadelphia; 2000.
Bauer M, Pfennig A. Epidemiology of bipolar disorders. Epilepsia. 2005;46:8–13.
Lopez AD, Murray CC. The global burden of disease, 1990–2020. Nat Med. 1998;4(11):1241–3.
Tsai S-YM, Kuo C-J, Chen C-C, Lee H-C. Risk factors for completed suicide in bipolar disorder. J Clin Psychiatry. 2002;63(6):469–76.
Vieta E, Benabarre A, Colom F, Gastó C, Nieto E, Otero A, et al. Suicidal behavior in bipolar I and bipolar II disorder. J Nerv Ment Dis. 1997;185(6):407–9.
Hakkaart-van Roijen L, Hoeijenbos M, Regeer EJ, Ten Have M, Nolen W, Veraart C, et al. The societal costs and quality of life of patients suffering from bipolar disorder in the Netherlands. Acta psychiatrica Scandinavica. 2004;110(5):383–92.
Havermans R, Nicolson NA, Devries MW. Daily hassles, uplifts, and time use in individuals with bipolar disorder in remission. J Nerv Ment Dis. 2007;195(9):745–51.
McMorris BJ, Downs KE, Panish JM, Dirani R. Workplace productivity, employment issues, and resource utilization in patients with bipolar I disorder. J Med Econ. 2010;13(1):23–32.
Morselli P, Elgie R, Cesana B. GAMIAN-Europe/BEAM survey II: cross‐national analysis of unemployment, family history, treatment satisfaction and impact of the bipolar disorder on life style. Bipolar Disord. 2004;6(6):487–97.
Zhang H, Wisniewski SR, Bauer MS, Sachs GS, Thase ME, Investigators S-B. Comparisons of perceived quality of life across clinical states in bipolar disorder: data from the first 2000 Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) participants. Compr Psychiatr. 2006;47(3):161–8.
Conrady S, Jouffe L. Introduction to bayesian networks & bayesialab. Bayesia SAS. 2013.
Xing L, Guo M, Liu X, Wang C, Wang L, Zhang Y. An improved Bayesian network method for reconstructing gene regulatory network based on candidate auto selection. BMC Genomics. 2017;18(9):17–30.
Agrahari R, Foroushani A, Docking TR, Chang L, Duns G, Hudoba M, et al. Applications of Bayesian network models in predicting types of hematological malignancies. Sci Rep. 2018;8(1):1–12.
Scutari M, Denis J-B. Bayesian networks: with examples in R. Chapman and Hall/CRC; 2021.
Scutari M. Dirichlet Bayesian network scores and the maximum relative entropy principle. Behaviormetrika. 2018;45(2):337–62.
Beech R, Leffert J, Lin A, Sylvia L, Umlauf S, Mane S, et al. Gene-expression differences in peripheral blood between lithium responders and non-responders in the Lithium Treatment-Moderate dose Use Study (LiTMUS). Pharmacogenomics J. 2014;14(2):182–91.
Frank IE. Intermediate least squares regression method. Chemometr Intell Lab Syst. 1987;1(3):233–42.
Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Royal Stat Society: Ser B (Statistical Methodology). 2010;72(1):3–25.
Homayoun S-B, Shrikant IB, Kazem M, Hemmat M, Reza M. Compared application of the new OPLS-DA statistical model versus partial least squares regression to manage large numbers of variables in an injury case-control study. Sci Res Essays. 2011;6(20):4369–77.
de Campos C, Ji Q, editors., editors. Properties of Bayesian Dirichlet scores to learn Bayesian network structures. Proceedings of the AAAI Conference on Artificial Intelligence; 2010.
Gan X, Liew AW-C, Yan H. Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC Bioinformatics. 2008;9(1):1–15.
Lazzeroni L, Owen A. Plaid models for gene expression data.Statistica sinica. 2002:61–86.
Alavi Majd H, Shahsavari S, Baghestani AR, Tabatabaei SM, Khadem Bashi N, Rezaei Tavirani M et al. Evaluation of Plaid Models in Biclustering of Gene Expression Data. Scientifica. 2016;2016.
Raychaudhuri S, Stuart JM, Altman RB. Principal components analysis to summarize microarray experiments: application to sporulation time series. Biocomputing 2000: World Scientific; 1999. p. 455 – 66.
Sagi O, Rokach L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2018;8(4):e1249.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Classification of patients with lithium-treated bipolar disorder based on gene expression: Dirichlet Bayesian network model

Status:

Version 1

Abstract

Figures

Backgrounds

Methods

Data resources

Feature Selection

Statistical Analysis

Results

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1