The implementation of ASFP consists of two parts: the model construction and validation and the development of the web server that purposes in ML-based SF construction.
Model construction
Benchmark. The benchmark dataset I (Dataset I), which contains the kinase subset and the diverse subset in the Directory of Useful Decoys-Enhanced (DUD-E) benchmark, was used to train and assess the MLSFs. The kinase subset contains the inhibitors and decoys generated by DUDE for 26 kinases, and the diverse subset contains the inhibitors and decoys for seven representative targets in the entire DUDE set. The basic information of Dataset I is shown in Table S1.
The benchmark dataset Ⅱ (Dataset Ⅱ) extracted from the PDBbind database (version 2016)8 was used to train and evaluate the SVM regression model for binding affinity prediction. There are 4057 protein-ligand complexes in the "refined set" and 290 complexes in the "core set" of PDBbind version 2016.
Evaluation criteria. In this study, six evaluation criteria were utilized to assess the performance of the models. Among them, F1 score, Cohen’s kappa, Matthews correlation coefficient (MCC), the area under the receiver operating characteristic curve (ROC AUC) and the enrichment factor (EF) at 1% were used to evaluate the performance of target-specific models while the Pearson correlation coefficient (Rp) was calculated to assess the performance of the SVM regression model. The details of the metrics can be found in Supplementary material.
Preparation. The protein targets were prepared by using the Structure Preparation wizard in Schrodinger version 2018, which added hydrogen atoms, repaired the side-chains of the imperfect residues using Prime, and optimized the steric hindrance of side-chains. The protonation states of the proteins were determined by using PROPKA and the het groups were preprocessed by Epik. The ligands were prepared using the ligprep module, which added hydrogen atoms, ionized the structures using Epik, desalted, generated tautomers and stereoisomers. In the preparation process, the default settings were used.
Docking. The grids were firstly generated by using the Receptor Grid Generation utility with the size of binding box set to 10 Å × 10 Å × 10 Å centered on the co-crystallized ligand. Then, the Glide docking program with the SP scoring mode was used to dock the prepared ligands into the prepared proteins. For every ligand, only the pose with the highest docking score will be retained.
Descriptors generation. After molecular docking, the structural files of Dataset I and Dataset Ⅱ were retained for descriptors generation. In this study, a total of 15 descriptors calculation tools of various types were included in computing descriptors (Table 1). Considering some of the tools were restricted by license, two schemes were employed to generate the descriptors to establish MLSFs. First, all the SFs (excluding fingerprints and dpocket) supported by the computational tools in Table 1 were used to generate descriptors (ALL descriptors). Second, all the SFs supported by the computational tools without licenses restrictions in Table 1 (i.e. AffiScore version 3.0, AutoDock version 6.8, DSX version 0.9, GalaxyDockBP2, NNScore version 2.01 and SMoG2016) were used to generate descriptors (FREE descriptors). Both descriptors were implemented in the generic SF construction while only FREE descriptors were utilized to build target-specific classification models due to the huge computational cost.
Table 1
The basic information of the computational tools supported by the descriptor generation module.
Computational tools
|
Type of descriptors
|
No.
|
Types
|
AffiScore1
|
Energy terms
|
13
|
Empirical
|
ASP1
|
Energy terms
|
5
|
Knowledge Empirical
|
AutoDock
|
Energy terms
|
6
|
Force field
|
ChemPLP
|
Energy terms
|
11
|
Empirical
|
ChemScore
|
Energy terms
|
10
|
Empirical
|
DPOCKET
|
Pocket descriptors
|
49
|
-
|
DSX
|
Energy terms
|
1
|
Knowledge
|
RDKit
|
ECFP fingerprint
|
2048
|
-
|
GalaxyDockBP2
|
Energy terms
|
11
|
Empirical
|
Glide SP
|
Energy terms
|
17
|
Empirical
|
Glide XP
|
Energy terms
|
27
|
Empirical
|
GoldScore
|
Energy terms
|
6
|
Force field
|
NNscore
|
Energy terms
|
348
|
ML
|
PaDEL
|
Pubchem fingerprint
|
881
|
-
|
SMoG2016
|
Energy terms
|
5
|
Knowledge Empirical
|
1Computational tools without license restriction are marked in bold. |
Modeling. For the construction of target-specific MLSFs, the dataset for each target in Dataset I was split into the training set and test set with the ratio of 3:1, and preprocessed to scale the data and remove duplicated features. Then, three ML algorithms, including Support Vector Machine (SVM), Random Forest (RF) and eXtreme Gradient Boosting (XGboost), were used to develop the MLSF for each target, and the hyperparameters were optimized with the hyperopt package. The performance of each model was assessed by a ten-fold cross-validation on the training set and the actual prediction on the test set. To develop the generic SVM regression model for binding affinity prediction, the PDBbind version 2016 ‘refined set’ (excluding the PDBbind version 2016 ‘core set’) was used as the training set and the PDBbind version 2016 ‘core set’ was used as the test set.
Web API
Descriptors generation. With respect to the characterization of protein-ligand interactions, energy terms and knowledge-based pairwise potentials extracted from existing SFs are popular representation methods. These energy components correlated with the binding affinity of protein-ligand complexes can be used as the input for the development of MLSFs. Therefore, 12 scoring programs were integrated into this module and the scoring components from the output of the SFs implemented in these computational tools can be generated automatically. Besides, two computational tools, i.e., RDkit and PaDEL, were integrated into this module to calculate the Extended-connectivity fingerprint (ECFP) and Pubchem fingerprint, respectively, to characterize the structural features of small molecules. Furthermore, the function of fpocket was supported by this module to calculate 49 descriptors to characterize the structural information of protein pockets. It should be noted that the protein-ligand complexes should be docked before submitted to the server and the descriptors for small molecules may not be recommended for the development of MLSFs. The information of the 15 computational tools supported by ASFP are listed in Table 1. Because some computational tools implemented by ASFP are commercial, and therefore their functions are disabled. Based on the descriptors generated by this module, users can further construct a customized SF through a ML algorithm.
AI-Based Scoring Functions Construction. As one of the modules implemented in the server, the AI-based SF construction is designed for building customizing target-specific MLSFs. After submission, the workflow is summarized in Fig. 1. In this module, the 384 descriptors computed and extracted from the SFs implemented in 6 freely available computational tools (AffiScore version 3.0, AutoDock version 6.8, DSX version 0.9, GalaxyDockBP2, NNScore version 2.01 and SMoG2016) can be used for training SFs. First, the whole dataset uploaded by the user is divided into the training set and the test set according to the user’s input. Then, the dataset is preprocessed (standardization, removing features with low variance, and tree-based feature selection) using sklearn. For the sake of computational efficiency, three popular ML algorithms (RF, SVM and XGBoost) are provided. Users can choose a ML algorithm for training and set some options about hyperparameter optimization (which hyperparameter to be optimized, the hyperparameter range and the optimization times). Finally, according to the user's input, the server uses hyperopt to find the optimal hyperparameter combinations and chooses the corresponding ML algorithm for training (10-fold cross validation) and prediction, and then outputs the results with a PDF file.
Online Prediction. On the base of the model performance, 15 well-constructed customized SFs with research-worthy targets and the generic regression SF for binding affinity prediction were retained to form the third module, Online prediction. The detailed information of the models is provided in Table 2.
Table 2
The information of the 15 targets with well-established classification models.
Target
|
Data source
|
ML algorithm
|
ROC_AUC on test set
|
abl1
|
DUD-E Kinase subset
|
SVM
|
0.848
|
akt2
|
0.859
|
csf1r
|
0.902
|
egfr
|
0.894
|
igf1r
|
0.846
|
jak2
|
0.921
|
kpcb
|
0.890
|
mapk2
|
0.876
|
mk01
|
0.838
|
src
|
0.852
|
tgfr1
|
0.965
|
wee1
|
0.965
|
akt1
|
DUD-E Diverse subset
|
0.850
|
cxcr4
|
0.942
|
hivpr
|
0.947
|
The ASFP server based on a high-level Python web framework of Django is deployed on a Linux server of an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20 GHz CPUs with 28 cores and 64 GB of memory. Several SFs programs like autodock 9 were integrated to automate the calculation process. The overall workflow implemented in the ASFP server is shown in Supplementary Figure S1, and the manual of ASFP can be downloaded from the website (http://cadd.zju.edu.cn/asfp/).