Cancer is a huge threat to the health of all mankind. Chemotherapy has been a common strategy for cancer treatment for a long time, but it has proven to be associated with many side effects[1]. Reports have shown that the cancer monotherapy often suffers from limited efficacy, poor safety, and drug resistance[2, 3]. Drug resistance and side effects have been the main reasons for the failure of cancer chemotherapy. On the other side, the progress of cancer drug research and development has become slower, and the cost of developing new drugs has become higher[4]. Therefore, it is a big challenge to improve the efficiency and reduce the cost of drug research and development.
With the arising of network pharmacology, multi-target combination drugs have become a new research direction[5–7]. Combination drugs may have greater or lesser effect on cancer cells than the additive sum of their individual effects, i.e., synergistic or antagonistic effects[1]. The synergistic drug combinations usually need lower dose than single drugs, with improved efficacy and reduced drug toxicity. Besides, they can maximum delay the formation of drug resistance. Therefore, drug combinations with synergistic effects may be the ideal therapeutic regimens for cancer[8, 9]. Finding synergistic drug combinations for specific cancer types is important to improve the efficacy of anticancer therapy[10, 11].
Effective drug combinations can be proposed based on clinical experience, but the benefits of this approach are much less than the time and cost it consumes. Another strategy to identify synergistic drug combinations is high-throughput screening (HTS) [12]. The HTS method can yield a large number of experimental results in a reasonable time and at a low cost, making it the preferred choice for discovering effective combinations of drugs. Finding new effective drug combinations is a complex task because the number of possible drug combinations is very large and this number increases each time a new drug is developed. It is clearly not feasible to identify the complete combination space using HTS.
As the data from clinical experience and high-throughput screening accumulating, opportunities for large-scale application of machine learning methods are available. For example, AstraZeneca, a leading pharmaceutical company, partnered with several organizations to launch a drug combination prediction challenge in the DREAM community, providing participants with 11,576 synergistic data derived from 910 drug combination experiments, involving 118 drugs and 85 cancer cell lines[13]. In 2017, the National Cancer Institute (NCI) released the largest of publicly available cancer drug combination datasets, ALMANAC, which contains synergy measurements for drug combinations of 104 drugs in 60 cancer cell lines in NCL-60[14]. Based on these data resources, several machine learning algorithms for anticancer drug combination prediction have been proposed. For instance, Li et al.[15] used the data from DREAM to predict drug combination synergy using a random forest model based on drug-target networks and gene expression profiles. Besides, Li et al.[16] proposed a novel network propagation method to simulate molecular features based on gene-gene networks and drug-target information, and combined the molecular features with single-drug treatment data to train random forest as a classifier for anticancer drug synergy prediction. Janizek et al. [17] proposed a method based on Extreme Gradient Boosting (XGBoost) to predict drug combination synergy. Celebi et al. [18] also proposed a XGBoost based approach to predict anticancer drug combinations using multi-omics data. In their work, the targeting pathways and monotherapy information were added to the feature space. Sidorov et al.[19] used XGBoost as well as random forest to build a separate model for each cell line, for the prediction of synergistic effects of anticancer drug combinations. Jeon et al. [20] proposed an ERT-based method for predicting anticancer drug combinations. Li et al. [21] used logistic regression to test the statistical significance of gene and pathway features in predicting the synergy of anticancer drug combinations. Julkunen et al. [22] proposed a new prediction method called ComboFM, which models the auxiliary features of two drugs, cell lines, and drug-cell lines as a fifth-order tensor and predicted the response of drug pairs using higher-order factorization machines (HOFM). With the development of deep learning algorithm, more and more models were constructed for drug synergy prediction based on deep learning. Preuer et al. [23] proposed a model named DeepSynergy, which is a three-layer feedforward neural network using genomic information and drug-chemical features as input features. They used a normalization strategy to account for the heterogeneity of the input features, and a conical layer model to predict drug synergy. Besides, Zhang et al. [24] proposed the model DeepSignalingSynergy. Instead of considering a large number of chemical and genomic features, the authors only utilized a small number of cancer signaling pathways to investigate the importance of individual signaling pathways for prediction. Zhang et al. [25] proposed the model AuDNNsynergy to predict drug combination synergy by integrating multi-omics and chemical structure data. Kim et al. [26] developed a drug synergy prediction model based on multitasking deep neural networks integrating multimodal inputs and multimodal outputs using data from multiple cell line features, and used migration learning to study data-poor tissues using data-rich tissues. Recently, Wang et al. [27] proposed a new deep learning prediction model PRODeepSyn. The model used graph convolutional neural networks to integrate protein-protein interaction (PPI) networks and histology data to construct low-dimensional embeddings of cell lines, which were fed into the deep neural networks together with the drug features to calculate drug synergy scores. Similarly, Hu et al. [28] proposed DTSyn to understand the mechanism of drug synergy from the perspective of chemical-gene-tissue interactions.
In this work, we present a CatBoost-based machine learning approach to predict the synergy scores of anticancer drug combinations. CatBoost is a symmetric decision tree (oblivious trees) based learner implementation with fewer parameters, which supports category-based variables and high accuracy Gradient Boosted Decision Tree (GBDT) framework. CatBoost has been widely used in the biomedical field for various tasks and studies. In a recent study, Pudova et al.[29] utilized the CatBoost algorithm to identify cancer-related microRNAs. Jinchao et al.[30] proposed a prediction model called CatBoost-SubMito for protein submitochondrial location prediction. In addition, Bouget et al.[31] used the CatBoost algorithm to predict patients' responses to tumor necrosis factor inhibitors. Clearly, CatBoost is playing an important role in the biomedical field.
In this paper, the performance of CatBoost was evaluated using stratified five-fold cross-validation. We found that CatBoost outperformed the models based on Deep Neural Networks (DNN), XGBoost and Logistic Regression in all metrics. In addition, an interpretation package named Shapley additive explanations (SHAP) was introduced to interpret the biological significance of the prediction results. It was found that the top-ranked genes contributing to the CatBoost model predictions were almost associated with known cancer mechanisms.