Anticancer drug synergy prediction based on CatBoost

doi:10.21203/rs.3.rs-3652163/v1

Download PDF

Research Article

Anticancer drug synergy prediction based on CatBoost

https://doi.org/10.21203/rs.3.rs-3652163/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

The research of cancer treatments has always been a hot topic in medical field. Cancer monotherapy as a common therapy has been proven to have many disadvantages such as toxicity and drug resistance. With the development of network pharmacology, multi-targeted combination drugs have become an ideal option for cancer treatment. Since the number of potential drug combinations is very huge, it is not feasible to use clinical experience or high-throughput screening to identify the complete combinatorial space. Methods such as machine learning models offer the possibility to explore the combinatorial space effectively.

Results

In this work, we proposed a machine learning method based on CatBoost to predict the synergy scores of anticancer drug combinations on cancer cell lines, which utilized oblivious trees and Ordered Boosting technique to avoid overfitting and bias. The model was trained and tested using the data screened from NCI-ALMANAC dataset. The drugs were characterized with morgan fingerprints, drug target information, monotherapy information, and the cell lines were described with gene expression profiles. In the stratified five-fold cross-validation, our method obtained excellent results and performed significantly better than three other advanced models. Additionally, when using SHAP to interpret the biological significance of the prediction results, we found that those genes with some associations with cancer occurrence played an important role in the prediction effect.

Conclusions

The model based on CatBoost has good quality for predicting drug synergy and could be considered as an optional method for anticancer drug combination research.

drug synergy

anticancer

CatBoost

prediction model

Cancer is a huge threat to the health of all mankind. Chemotherapy has been a common strategy for cancer treatment for a long time, but it has proven to be associated with many side effects[1]. Reports have shown that the cancer monotherapy often suffers from limited efficacy, poor safety, and drug resistance[2, 3]. Drug resistance and side effects have been the main reasons for the failure of cancer chemotherapy. On the other side, the progress of cancer drug research and development has become slower, and the cost of developing new drugs has become higher[4]. Therefore, it is a big challenge to improve the efficiency and reduce the cost of drug research and development.

With the arising of network pharmacology, multi-target combination drugs have become a new research direction[5–7]. Combination drugs may have greater or lesser effect on cancer cells than the additive sum of their individual effects, i.e., synergistic or antagonistic effects[1]. The synergistic drug combinations usually need lower dose than single drugs, with improved efficacy and reduced drug toxicity. Besides, they can maximum delay the formation of drug resistance. Therefore, drug combinations with synergistic effects may be the ideal therapeutic regimens for cancer[8, 9]. Finding synergistic drug combinations for specific cancer types is important to improve the efficacy of anticancer therapy[10, 11].

Effective drug combinations can be proposed based on clinical experience, but the benefits of this approach are much less than the time and cost it consumes. Another strategy to identify synergistic drug combinations is high-throughput screening (HTS) [12]. The HTS method can yield a large number of experimental results in a reasonable time and at a low cost, making it the preferred choice for discovering effective combinations of drugs. Finding new effective drug combinations is a complex task because the number of possible drug combinations is very large and this number increases each time a new drug is developed. It is clearly not feasible to identify the complete combination space using HTS.

As the data from clinical experience and high-throughput screening accumulating, opportunities for large-scale application of machine learning methods are available. For example, AstraZeneca, a leading pharmaceutical company, partnered with several organizations to launch a drug combination prediction challenge in the DREAM community, providing participants with 11,576 synergistic data derived from 910 drug combination experiments, involving 118 drugs and 85 cancer cell lines[13]. In 2017, the National Cancer Institute (NCI) released the largest of publicly available cancer drug combination datasets, ALMANAC, which contains synergy measurements for drug combinations of 104 drugs in 60 cancer cell lines in NCL-60[14]. Based on these data resources, several machine learning algorithms for anticancer drug combination prediction have been proposed. For instance, Li et al.[15] used the data from DREAM to predict drug combination synergy using a random forest model based on drug-target networks and gene expression profiles. Besides, Li et al.[16] proposed a novel network propagation method to simulate molecular features based on gene-gene networks and drug-target information, and combined the molecular features with single-drug treatment data to train random forest as a classifier for anticancer drug synergy prediction. Janizek et al. [17] proposed a method based on Extreme Gradient Boosting (XGBoost) to predict drug combination synergy. Celebi et al. [18] also proposed a XGBoost based approach to predict anticancer drug combinations using multi-omics data. In their work, the targeting pathways and monotherapy information were added to the feature space. Sidorov et al.[19] used XGBoost as well as random forest to build a separate model for each cell line, for the prediction of synergistic effects of anticancer drug combinations. Jeon et al. [20] proposed an ERT-based method for predicting anticancer drug combinations. Li et al. [21] used logistic regression to test the statistical significance of gene and pathway features in predicting the synergy of anticancer drug combinations. Julkunen et al. [22] proposed a new prediction method called ComboFM, which models the auxiliary features of two drugs, cell lines, and drug-cell lines as a fifth-order tensor and predicted the response of drug pairs using higher-order factorization machines (HOFM). With the development of deep learning algorithm, more and more models were constructed for drug synergy prediction based on deep learning. Preuer et al. [23] proposed a model named DeepSynergy, which is a three-layer feedforward neural network using genomic information and drug-chemical features as input features. They used a normalization strategy to account for the heterogeneity of the input features, and a conical layer model to predict drug synergy. Besides, Zhang et al. [24] proposed the model DeepSignalingSynergy. Instead of considering a large number of chemical and genomic features, the authors only utilized a small number of cancer signaling pathways to investigate the importance of individual signaling pathways for prediction. Zhang et al. [25] proposed the model AuDNNsynergy to predict drug combination synergy by integrating multi-omics and chemical structure data. Kim et al. [26] developed a drug synergy prediction model based on multitasking deep neural networks integrating multimodal inputs and multimodal outputs using data from multiple cell line features, and used migration learning to study data-poor tissues using data-rich tissues. Recently, Wang et al. [27] proposed a new deep learning prediction model PRODeepSyn. The model used graph convolutional neural networks to integrate protein-protein interaction (PPI) networks and histology data to construct low-dimensional embeddings of cell lines, which were fed into the deep neural networks together with the drug features to calculate drug synergy scores. Similarly, Hu et al. [28] proposed DTSyn to understand the mechanism of drug synergy from the perspective of chemical-gene-tissue interactions.

In this work, we present a CatBoost-based machine learning approach to predict the synergy scores of anticancer drug combinations. CatBoost is a symmetric decision tree (oblivious trees) based learner implementation with fewer parameters, which supports category-based variables and high accuracy Gradient Boosted Decision Tree (GBDT) framework. CatBoost has been widely used in the biomedical field for various tasks and studies. In a recent study, Pudova et al.[29] utilized the CatBoost algorithm to identify cancer-related microRNAs. Jinchao et al.[30] proposed a prediction model called CatBoost-SubMito for protein submitochondrial location prediction. In addition, Bouget et al.[31] used the CatBoost algorithm to predict patients' responses to tumor necrosis factor inhibitors. Clearly, CatBoost is playing an important role in the biomedical field.

In this paper, the performance of CatBoost was evaluated using stratified five-fold cross-validation. We found that CatBoost outperformed the models based on Deep Neural Networks (DNN), XGBoost and Logistic Regression in all metrics. In addition, an interpretation package named Shapley additive explanations (SHAP) was introduced to interpret the biological significance of the prediction results. It was found that the top-ranked genes contributing to the CatBoost model predictions were almost associated with known cancer mechanisms.

Datasets

We used the NCI-ALMANAC dataset as the source of anticancer drug synergy data[14]. NCI-ALMANAC contains data of 60 cancer cell lines and we only considered drugs that have at least one target gene (68 drugs). A total of 130182 samples were used for model training and testing. All of the NCI-60 cancer cell line characteristics (expression, mutations, copy number variation, etc.) could be obtained from CellMinerCDB[32]. Drug target data was provided by DrugBank[33], and the drug molecular properties could be processed using the RDKit package in Python. Cell line characteristics generally include gene expression profiles, mutation numbers, copy numbers, etc., while drug characteristics usually include morgan fingerprints, drug target information, monotherapy information, etc. We referred to the studies of Janizek, Celebi, Preuer et al[17, 18, 23] and used gene expression profiles as cell line features, and morgan fingerprints, drug target information, and monotherapy information as drug features. By processing the dataset through python's pandas library, we filtered out common feature data with no differences between individual features, and the dataset consisted of 130,182 samples with 2064 columns of features, of which 470 were cell line features and 1594 were drug features.

CatBoost

Based on the constructed features, this work used a new GBDT framework named CatBoost to predict drug synergy (Fig. 1). CatBoost as an oblivious trees-based learner has fewer parameters, and supports for categorical variables as well as high accuracy GBDT framework. Algorithm 1 gives the pseudo-code of the GBDT algorithm.

Algorithm 1

The Gradient Tree Boosting Algorithm

Compared with the traditional decision trees, the oblivious trees can better deal with the continuous attribute and the classification problem in the high dimensional space, which improves the prediction ability of the model. Since the oblivious trees can completely ignore some attribute information, it is very small for the existence of noise and missing values in the data. In addition, the oblivious trees use the principle of symmetry, and through the restriction of depth and number of branches, it can effectively avoid the problem of overfitting, and improve the model's generalization ability. CatBoost also uses the concept of Ordered Boosting, a permutation-driven approach, which trains the model on a subset of the data while computing the residuals on another subset, thus it can prevent target leakage and over fitting.

In a nutshell, CatBoost has the following peculiarities: 1. Excellent performance, it surpasses most advanced machine learning algorithms in terms of performance; 2. Robustness, it reduces the need for much hyperparameter tuning and decreases the chance of overfitting, which makes the model more versatile; 3. Practicality, it can handle both categorical and numerical features; 4. Scalability, it supports customized loss functions.

Experimental setup

To make CatBoost generalizable to unseen datasets, we used stratified k-fold cross-validation experiments to test the model. Stratified k-fold cross-validation is an enhanced version of k-fold cross-validation for unbalanced datasets. In stratified k-fold cross-validation, the whole dataset is divided into k equal-sized copies, and the positive and negative ratio of the label variable in each fold is the same as the percentage in the whole dataset. This work used stratified five-fold cross-validation to evaluate the model performance. The dataset of all samples was divided into five equal and unique parts, of which four parts were treated as training data and the remaining one as test data. Each part was regarded as test data in turn and was calculated their synergy scores using the learned model.

By using the stratified five-fold cross-validation, we also adjusted the parameters of CatBoost to obtain the most optimal model. The main parameters we adjusted included the iterations in the range of [400, 500, 600, 700, 800, 900, 1000], the depth in the range of [5, 6, 7, 8, 9, 10, 11], and the learning rate in [0.05, 0.5, 0.1]. By comparing the model prediction results under each parameter combination, the optimal parameter combination to train the model was set as: iterations = 600, depth = 9, learning rate = 0.1.

Drug synergy prediction usually have two ways: one is regression task and another is classification task. In this paper, we have considered both of these two pattens. When conducting synergy prediction as a classification task, the synergy scores were binarized, i.e., synergy labeled as 1 and antagonism as 0. Since there are many samples with synergy scores close to 0, it is crucial to choose an appropriate threshold to binarize the synergy score. In this work, we optimized the threshold by implementing stratified five-fold cross-validation to achieve optimal balance, and the value of threshold was finally set as 10.

To evaluate the prediction performance of CatBoost, we compare its synergy prediction ability with the following models: 1. DNN, a deep learning method based on feedforward neural networks; 2. XGBoost, a classical integrated learning method based on gradient boosting; 3, Logistic Regression, a log-linear regression model based on Sigmoid function. The hyperparameters used in these models were determined by stratified five-fold cross-validation.

Performance evaluation

We measured the prediction performance of our model in terms of the area under the receiver operating characteristic curve (ROC AUC) and the area under the precision recall curve (PR AUC) for classification. To further characterize the prediction performance of CatBoost, we also provided typical performance metrics for regression tasks: mean square error (MSE) and Pearson correlation coefficient (PCC). After carrying out stratified five-fold cross-validation based on the optimal parameter setting mentioned above, we can compute the values of four metrics for our model and other methods for comparison. The results were shown in Table. 1, from which we found that CatBoost outperformed the other three models on all metrics. For example, by comparing the ROC AUC values, CatBoost achieves 0.9217, improved 0.0099 over the next best model DNN; by comparing MSE, CatBoost obtained 0.1365 that was improved 6% over the next best model XGBoost.

Table 1

Comparison of the performance results of the four models
Models	ROC AUC	PR AUC	MSE	Pearson
CatBoost	0.9217	0.4651	0.1365	0.5335
New CatBoost	0.9208	0.4707	0.1360	0.5379
DNN	0.9118	0.3876	0.1441	0.4761
XGBOOST	0.8856	0.3601	0.1439	0.4552
Logistic Regression	0.8505	0.1945	0.1534	0.3101

To further investigate the quality of CatBoost predictions, we compared the distribution of synergy scores on cell lines between predicted result and actual values (seen in Fig. 2).Panel (a) shows the box plots of the distribution of actual synergy scores for each cell line, and panel (b) shows box plots for synergy scores predicted by CatBoost. Each point represents the actual or predicted synergy score for a drug combination. To understand how well our model predicted synergy for drug combinations on individual cell lines, we also gave the average PCC between predicted and actual synergy scores for each cell line (Fig. 2(c)). The order of the cell lines in Fig. 2(c) is the same as in panel (a) and (b). From panel (c), we could observe that the correlation between all cell lines ranged from 0.52 to 0.83.

Additionally, by calculating the PCC, we analyzed the performance of CatBoost from two viewpoints of anti-cancer drug and cancer cell line respectively. Figure 3 gives the PCC distribution over 68 drugs, where each bar represents the average PCC of all combinations of the specific drug tagged below the bar. The color of the bars corresponds to the drug target. As shown in Fig. 3, the PCC values for all anticancer drugs ranged from 0.51 to 0.88, where 31% of the drugs have PCC values below 0.6 and 39% above 0.7. We also found that, the color of the bars is scattered which means there is no clear association between the PCC and drug target. Therefore, the differences of performance between drugs could not be explained by target-based mechanisms. Moreover, the number of drugs acting on the same target does not affect the performance.

The PCC distribution over cell lines is illustrated in Fig. 4. Each bar is the average PCC between actual and predicted synergy scores for a cell line. The cell line names are shown on the x-axis. The color of the bars corresponds to the type of tissue from which the cell line originated. From this figure we can see that PCC values for all cell lines vary from 0.52 to 0.83, with only three cell lines having values below 0.6. Among them, more than 44% of the cell lines have PCC above 0.7. Furthermore, the colors of the bars are scattered and no correlation between tissue type and PCC can be observed. Therefore, the performance differences could not be explained by the tissue type of cell lines either.

Feature interpretability

While it is important for a model to predict drug combinations with synergistic effects, it is also important to explain why the model could predict synergistic drugs. Explainable machine learning has slowly become an important research direction in machine learning in the past few years. SHAP is a model interpretation package developed in Python, which can interpret the output of machine learning models. Inspired by cooperative game theory, SHAP constructs an additive explanatory model in which all features are considered "contributors". For each candidate sample, the model generates prediction values, and the SHAP value is the value assigned to each feature of that sample.

In this work, the SHAP values of all sample features were calculated by SHAP, and the 100 features with highest scores were obtained (see Additional file 1). We found that among the top 100 features, monotherapy information ranked first, suggesting that monotherapy information was more useful for predicting anticancer drug combination synergy. Statistically, 88 out of these 100 features belong to drug, of which Morgan's molecular fingerprint takes the largest proportion. The remaining 12 important features were genes in gene expression profiles, including PTK2, CCND1, GNA11, CRKL, ERBB2, WNT2B, CTBP2 etc. Apparently, drug features may play a more prominent role in drug synergy prediction than cell line features. The conclusion we reached is consistent with the findings of Janizek et al[17].

For the genes ranking by the front, we have made a case study to further investigate the significance of our model. For instance, PTK2, also known as FAK, was found to associate with tumors and functions primarily in the inhibition of apoptosis[34]. The upregulation of FAK expression, including high protein expression as well as overactivation, is present in almost all tumor tissues, such as lung, gastric, colorectal, uterine, and melanoma cancers. According to studies[35, 36], patients with high FAK expression have a significantly shorter survival than those with lower FAK expression, making FAK a promising indicator for early diagnosis of cancer and for predicting the prognosis of patients.

The gene CCND1, or cell cycle protein, plays a role in regulating CDK kinase[37]. This protein interacts with the tumor suppressor protein Rb that could positively regulate the expression of this gene. Mutations, amplification and overexpression of this gene alter the course of the cell cycle, which are frequently observed in a variety of tumors, and may contribute to tumorigenesis[38].

ERBB2, or erb-b2, is a 185-KDA cell membrane receptor encoded by the proto-oncogene erbB-2[39]. Clinically, erb-B2 expression is strongly correlated with patient prognosis, and patients with high erb-B2 expression are prone to tumor metastasis and have short survival. Because of the significant difference in expression levels between normal and tumor cells, erb-B2 has become an ideal target for tumor immunobiotherapy and is currently a hot molecule in the field of tumor therapy research[40].

Feature reduction

In order to further verify the impact of the front-rank features with high SHAP values on the model efficacy, we selected the top 400 features as input features and performed a new prediction using CatBoost, which was written as new CatBoost. The prediction performance of New CatBoost is also given in Table 1. By comparison, it is clear that when only these 400 features are taken into account, the performance of CatBoost is not affected much and still outperforms the other three models. Compared with the original CatBoost, the results of new CatBoost are even slightly improved in PR AUC, MSE and PCC.

We have proposed an integrated learning approach based on CatBoost for drug synergy prediction, through characterizing cancer cell lines with gene expression profiles and drug features with Morgan molecular fingerprints, drug targets, and monotherapy information. The performance of this model was evaluated by conducting stratified five-fold cross-validation. As a result, the CatBoost based model got 0.9217, 0.4651, 0.1365 and 0.5335 for ROC AUC, PR AUC, MSE and PCC in turn, which outperformed the other three machine learning methods for comparison.

In addition, a model interpretation package SHAP was used to make explanation of the input features, which resulted in a ranking of features. Through counting, we found the drug characters took up more seats in the top 100 features, which suggesting that the drug information was more important than that of cell lines for drug synergy prediction. The further study for the genes in the top 100 features showed that, almost all these genes had strong connections to various cancers.

The feature screening according to the SHAP values was performed in this work. We selected the 400 most important features to investigate the effect of input features. From the comparison result, we obtained the conclusion that the contribution from the top features with more importance was roughly equivalent to that from all features, while saving the model training and predicting time.

The good quality of our model for predicting drug synergy may attribute to several advantages of it. First of all, our model belongs to tree-based models. Compared to general DNN-based methods, this type of models is easier to build because they require less hyperparameter tuning or feature preprocessing.

Second, CatBoost uses the tactic of oblivious trees and Ordered Boosting to effectively avoid the problem of overfitting and improve the execution efficiency. Unlike the traditional decision trees, the oblivious trees have the advantages of small volume, high efficiency, and strong ability etc. Classical gradient boosting algorithms suffer from prediction bias and tend to overfit on small or noisy datasets. When computing gradient estimates for data instances, these algorithms use the same data instances on which the model is constructed, and therefore have no opportunity to encounter new data. Whereas CatBoost uses Ordered Boosting to combat noise points in the training set, thus avoiding bias in gradient estimation. Furthermore, CatBoost trains model and computes the residuals on different data subsets, thus it can prevent target leakage and avoid overfitting.

Third, this work used SHAP to make direct biological interpretation of the model output, which made it possible to do some more in-depth analysis of features. The results have indicated that the excellent performance of CatBoost is mainly due to the most important features, and at the same time, the speed of CatBoost operation could be greatly improved by reducing the feature dimension. We can assume that in the future, even when faced with very large data, CatBoost can downsize the feature space trough implementing SHAP and get a more stable model.

Finally, the performance and training time of the CatBoost model could be affected by the setting of hyperparameters as well as the size of the sample size. We believe that, with the increasing amount of data in the future, CatBoost will become more accurate and stable.

It is inevitable that our model has some limitations for drug synergy prediction. For example, using the synergy score to measure the therapeutic effect of drug combination may not be an ideal method as it is a score for a wide range of concentrations, but in practice, treatments with low concentrations perform better in the clinic. Additionally, we analyzed the forecasting effect of this model on different drugs or different cell lines, we found that the differences in prediction results were not significantly correlated with tissue-specific or target-specific mechanisms. These differences may be derived from the fact that some biological mechanisms are better modeled than others. The specific mechanism still needs to be further explored in the future.

HTS	High-throughput screening
NCI	National cancer institute
XGBoost	Extreme gradient boosting
HOFM	Higher-order factorization machines
PPI	Protein-protein interaction
GBDT	Gradient boosted decision tree
DNN	Deep neural networks
SHAP	Shapley additive explanations
ROC AUC	The area under the receiver operating characteristic curve
PR AUC	The area under the precision recall curve
MSE	Mean square error
PCC	Pearson correlation coefficient

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The datasets and the code are available in this published article and its supplementary information files.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by Science and Technology Planning Project of Guizhou Province of China (No. Qian Ke He Ji Chu -ZK[2021] Yi Ban 315), Science Foundation of Guizhou University of Finance and Economics (No. 2021KYYB21) and the Youth Science and Technology Talent Growth Project of Guizhou Provincial Education Department (No. QJH-KY-Z[2021]132).

Authors' contributions

CL implemented the data curation, wrote the code, finished the model construction, and wrote the original draft. NG designed and supervised the project, reviewed and edited the manuscript. HZ participated the data curation and model construction.

Additional file 1

Additional file 1 gives the supplementary table of the top 100 features ranked according to the SHAP value.

Madani Tonekaboni SA, Soltan Ghoraie L, Manem VSK, Haibe-Kains B: Predictive approaches for drug combination discovery in cancer. Briefings in bioinformatics 2018, 19(2):263-276.
Lehár J, Krueger AS, Avery W, Heilbut AM, Johansen LM, Price ER, Rickles RJ, Short GF, 3rd, Staunton JE, Jin X et al: Synergistic drug combinations tend to improve therapeutically relevant selectivity. Nat Biotechnol 2009, 27(7):659-666.
Andreuccetti M, Allegrini G, Antonuzzo A, Malvaldi G, Conte PF, Danesi R, Del Tacca M, Falcone A: Azidothymidine in combination with 5-fluorouracil in human colorectal cell lines: in vitro synergistic cytotoxicity and DNA-induced strand-breaks. Eur J Cancer 1996, 32a(7):1219-1226.
He L, Kulesskiy E, Saarela J, Turunen L, Wennerberg K, Aittokallio T, Tang J: Methods for High-throughput Drug Combination Screening and Synergy Scoring. Methods in molecular biology (Clifton, NJ) 2018, 1711:351-398.
Siegel RL, Miller KD, Jemal A: Cancer statistics, 2019. CA-Cancer J Clin 2019, 69(1):7-34.
Holohan C, Schaeybroeck SV, Longley DB, Johnston PG: Cancer drug resistance: an evolving paradigm. Nature Reviews Cancer 2013, 13(10):714-726.
Gottesman MM: Mechanisms of cancer drug resistance. Annual review of medicine 2002, 53(1):615-627.
Chou, T.-C.: Theoretical Basis, Experimental Design, and Computerized Simulation of Synergism and Antagonism in Drug Combination Studies. Pharmacological Reviews 2006, 58(3):621-681.
Csermely P, Korcsmáros T, Kiss HJM, London G, Nussinov R: Structure and dynamics of molecular networks: A novel paradigm of drug discovery. A comprehensive review. Pharmacology & therapeutics 2013, 138(3):333-408.
O'Neil J, Benita Y, Feldman I, Chenard M, Roberts B, Liu Y, Li J, Kral A, Lejnine S, Loboda A: An unbiased oncology compound screen to identify novel combination strategies. Molecular cancer therapeutics 2016, 15(6):1155-1162.
Huang Y, Jiang D, Sui M, Wang X, Fan W: Fulvestrant reverses doxorubicin resistance in multidrug-resistant breast cell lines independent of estrogen receptor expression. Oncology Reports 2016.
Jia J, Zhu F, Ma X, Cao ZW, Li YX, Chen YZ: Mechanisms of drug combinations: interaction and network perspectives. Nature Reviews Drug Discovery 2009, 8(6):111-128.
Menden MP, Wang D, Mason MJ, Szalai B, Bulusu KC, Guan Y, Yu T, Kang J, Jeon M, Wolfinger RJNc: Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. 2019, 10(1):2674.
Holbeck SL, Camalier R, Crowell JA, Govindharajulu JP, Hollingshead M, Anderson LW, Polley E, Rubinstein L, Srivastava A, Wilsker D et al: The National Cancer Institute ALMANAC: A Comprehensive Screening Resource for the Detection of Anticancer Drug Pairs with Enhanced Therapeutic Activity. Cancer research 2017, 77(13):3564-3576.
Yousef M, Khalifa W, Acar IE, Allmer J: MicroRNA categorization using sequence motifs and k-mers. BMC Bioinformatics 2017, 18(1):170.
Li H, Li T, Quang D, Guan Y: Network Propagation Predicts Drug Synergy in Cancers. Cancer research 2018, 78(18):5446-5457.
Janizek JD, Celik S, Lee SI: Explainable machine learning prediction of synergistic drug combinations for precision cancer medicine. Cold Spring Harbor Laboratory 2018.
Celebi R, Bear Don’t Walk O, Movva R, Alpsoy S, Dumontier M: In-silico prediction of synergistic anti-cancer drug combinations using multi-omics data. Scientific Reports 2019, 9(1):1-10.
Sidorov P, Naulaerts S, Ariey-Bonnet J, Pasquier E, Ballester PJ: Predicting Synergism of Cancer Drug Combinations Using NCI-ALMANAC Data. Frontiers in chemistry 2019, 7:509.
Jeon M, Kim S, Park S, Lee H, Kang J: In silico drug combination discovery for personalized cancer therapy. BMC systems biology 2018, 12(2):59-67.
Li J, Huo Y, Wu X, Liu E, Zeng Z, Tian Z, Fan K, Stover D, Cheng L, Li L: Essentiality and transcriptome-enriched pathway scores predict drug-combination synergy. Biology 2020, 9(9):278.
Julkunen H, Cichonska A, Gautam P, Szedmak S, Douat J, Pahikkala T, Aittokallio T, Rousu J: Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nature communications 2020, 11(1):6136.
Preuer K, Lewis RP, Hochreiter S, Bender A, Bulusu KC, Klambauer G: DeepSynergy: predicting anti-cancer drug synergy with Deep Learning. Bioinformatics 2018, 34(9):1538-1546.
Zhang H, Feng J, Zeng A, Payne P, Li F: Predicting tumor cell response to synergistic drug combinations using a novel simplified deep learning model. AMIA Annual Symposium Proceedings 2020, 2020:1364.
Zhang T, Zhang L, Payne PR, Li F: Synergistic drug combination prediction by integrating multiomics data in deep learning models. Translational bioinformatics for therapeutic development 2021:223-238.
Kim Y, Zheng S, Tang J, Jim Zheng W, Li Z, Jiang X: Anticancer drug synergy prediction in understudied tissues using transfer learning. Journal of the American Medical Informatics Association 2021, 28(1):42-51.
Wang X, Zhu H, Jiang Y, Li Y, Tang C, Chen X, Li Y, Liu Q, Liu Q: PRODeepSyn: predicting anticancer synergistic drug combinations by embedding cell lines with protein–protein interaction network. Briefings in Bioinformatics 2022, 23(2):bbab587.
Hu J, Gao J, Fang X, Liu Z, Wang F, Huang W, Wu H, Zhao G: DTSyn: a dual-transformer-based neural network to predict synergistic drug combinations. Briefings in Bioinformatics 2022, 23(5):bbac302.
Pudova EA, Kobelyatskaya AA, Katunina IV, Snezhkina AV, Fedorova MS, Pavlov VS, Bakhtogarimov IR, Lantsova MS, Kokin SP, Nyushko KM: Lymphatic Dissemination in Prostate Cancer: Features of the Transcriptomic Profile and Prognostic Models. International Journal of Molecular Sciences 2023, 24(3):2418.
Jinchao Z, Yinping J, Xi L, Xiao W: A Novel Submitochondrial Localization Predictor based on Gradient Boosting Algorithm and Dataset Balancing Treatment. International Journal of Performability Engineering 2020, 16(7):1038.
Bouget V, Duquesne J, Hassler S, Cournède PH, Fautrel B, Guillemin F, Pallardy M, Broët P, Mariette X, Bitoun S: Machine learning predicts response to TNF inhibitors in rheumatoid arthritis: results on the ESPOIR and ABIRISK cohorts. RMD open 2022, 8(2).
Luna A, Elloumi F, Varma S, Wang Y, Rajapakse VN, Aladjem MI, Robert J, Sander C, Pommier Y, Reinhold WC: CellMiner Cross-Database (CellMinerCDB) version 1.2: Exploration of patient-derived cancer cell line pharmacogenomics. Nucleic acids research 2021, 49(D1):D1083-d1093.
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z et al: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 2018, 46(D1):D1074-d1082.
Yi L, Zhou L, Luo J, Yang QJBC, Molecules,, Diseases: Circ-PTK2 promotes the proliferation and suppressed the apoptosis of acute myeloid leukemia cells through targeting miR-330-5p/FOXM1 axis. 2021, 86:102506.
Xie M, Sun M, Ji X, Li D, Chen X, Zhang B, Huang W, Zhang T, Wang Y, Tian D et al: Overexpression of BACH1 mediated by IGF2 facilitates hepatocellular carcinoma growth and metastasis via IGF1R and PTK2. Theranostics 2022, 12(3):1097-1116.
Chuang HH, Zhen YY, Tsai YC, Chuang CH, Hsiao M, Huang MS, Yang CJ: FAK in Cancer: From Mechanisms to Therapeutic Strategies. International journal of molecular sciences 2022, 23(3).
Akervall J, Bockmühl U, Petersen I, Yang K, Carey TE, Kurnit DM: The gene ratios c-MYC:cyclin-dependent kinase (CDK)N2A and CCND1:CDKN2A correlate with poor prognosis in squamous cell carcinoma of the head and neck. Clinical cancer research : an official journal of the American Association for Cancer Research 2003, 9(5):1750-1755.
Hosokawa Y, Arnold A, Cancer: Mechanism of cyclin D1 (CCND1, PRAD1) overexpression in human cancer cells: analysis of allele-specific expression. Genes Chromosomes 1998, 22(1):66–71.
Penuel E, Schaefer G, Akita RW, Sliwkowski MX: Structural requirements for ErbB2 transactivation. Seminars in oncology 2001, 28(6 Suppl 18):36-42.
Mcdonagh CF, Huhalov A, Harms BD, Adams S, Paragas VJMCT: Antitumor Activity of a Novel Bispecific Antibody That Targets the ErbB2/ErbB3 Oncogenic Unit and Inhibits Heregulin-Induced Activation of ErbB3. 2012, 11(3):582.

No competing interests reported.

Additionalfile1.xlsx
Additional file 1 Additional file 1 gives the supplementary table of the top 100 features ranked according to the SHAP value.
Datasetsandcode.zip

Download PDF

Version 1

posted

You are reading this latest preprint version

Anticancer drug synergy prediction based on CatBoost

Status:

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Methods

Datasets

CatBoost

Experimental setup

Results

Performance evaluation

Feature interpretability

Feature reduction

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1