The Comparison of the Prediction Accuracies
Table 1 The comparison on the prediction accuracies of BRD4is and non-BRD4is from RF, k-NN and C4.5 DT in the testing set
|
Method
|
Parameter
|
BRD4is
|
non-BRD4is
|
Q (%)
|
MCC
|
TP
|
FN
|
SE (%)
|
TN
|
FP
|
SP (%)
|
RF
|
Mtry=13
|
274
|
6
|
97.86
|
252
|
9
|
96.55
|
97.23
|
0.9445
|
k-NN
|
k = 6 eq
|
271
|
9
|
96.79
|
232
|
29
|
88.89
|
92.98
|
0.8614
|
C4.5 DT
|
/
|
259
|
21
|
92.50
|
226
|
35
|
86.59
|
89.65
|
0.7934
|
Three methods, RF, k-NN and C4.5 DT, were utilized to build classification prediction models for the same training set, and the performance of the models were evaluated by the same testing set. The data are summarized in Table 1.
Among them, Mtry is the parameter of RF method, and its default value is 13, which is approximately equal to the square root of the number of descriptors (189) [51]. k is the parameter of k-NN method, whose value was optimized by the internal parameter selection program. TP (True positive) denotes the number of positive samples predicted correctly, TN (True negative) denotes the number of negative samples predicted correctly, FP (False positive) denotes the number of negative samples mispredicted as positive, and FN (False negatives) denotes the number of positive samples mispredicted as negative. For the predicting outcomes in this article, TP means the correct number of BRD4is, TN means the correct number of non-BRD4is, FN means the wrong number of BRD4is, and FP means the wrong number of non-BRD4is. At the same time, there are several precision functions to measure the prediction performance, including SE (sensitivity), SP (specificity), Q (total prediction accuracy) and MCC (Matthews correlation coefficient) [52]. These functions have the following relationship with the previous variables:
From Table 1, it can be seen that RF model possesses the highest Q value (97.23%) and the highest MCC value (0.9445). That is to say, the prediction accuracy of the RF model is the best of all. Compared with the models established by k-NN and C4.5 DT, the RF model has greater advantages.
Optimization of Parameters for RF method
Different RF models were generated based on different values of parameters Mtry and Ntree. The parameters of the best-performing RF model were chosen by comparing prediction error rates (ERs) of testing set for all the models. The effect of the different values of Mtry (1-189) and Ntree (100-3000) on the prediction ERs of the testing set for different RF models are illustrated in Figure 1(A) and 1(B), respectively.
As can be seen from Figure 1(A), for different prediction models established by RF method with different parameters, when Mtry value is 9 (prediction ER of training set is 6.06%) and Mtry value is 18 (prediction ER of training set is 6.37%), the prediction ER of the testing set reaches the lowest 2.40%. However, when the value is taken as 9, the prediction ER of the corresponding training set is lower than 18, so 9 was selected as the optimal solution of parameter Mtry.
After determining the parameter Mtry, the Mtry value was fixed at 9 and the Ntree value was constantly changed. The results are displayed in Figure 1(B), which illustrate that the prediction ER of the testing set is the lowest 2.22% when Ntree equals to 300 or 400. Nevertheless, when Ntree is 400, the prediction ER of the corresponding training set is lower (300: 6.17% VS 400: 5.96%), therefore, 400 is more suitable as the optimal parameter.
Altogether, when the parameters of RF method are set to Mtry = 9 and Ntree = 400, the associated model achieves the best prediction performance, and, by extension, the Q of the training set is 94.04%, while the Q of the testing set is 97.60%. Compared with the model before the parameters optimization, the prediction accuracy has been improved.
Evaluation of the Optimal RF Model
According to the special feature selection procedure of the RF method, the model with the optimum parameters was processed further. 25 descriptors (listed in Table 2) most relevant to the properties of BRD4is were screened out from the 189 descriptors, which can serve as a theoretical basis for structural modifications. Each of these descriptors has its corresponding contribution rate, arranged in Figure 2(A) on the basis of its relative importance.
Table 2 The most relevant 25 descriptors identified by the best RF model for the prediction of BRD4is
Descriptor
|
Description
|
S(16)
|
Atom-type Estate sum for -CH3
|
Q N, Max
|
Most positive charge on N atoms
|
Q C, Min
|
Most negative charge on C atoms
|
4χPC
|
Simple molecular connectivity Chi indices for path/cluster
|
Hlb
|
Hydrophilic-Hydrophobic balance
|
Q H, Max
|
Most positive charge on H atoms
|
3χC
|
Simple molecular connectivity Chi indices for cluster
|
S(26)
|
Atom-type Estate sum for :C:-
|
S(22)
|
Atom-type Estate sum for >CH-
|
S(39)
|
Atom-type Estate sum for -OH
|
Tcent
|
Centric Index
|
S(34)
|
Atom-type Estate sum for =N-
|
S(12)
|
Atom-type H Estate sum for CHn (Saturated)
|
Q H, Min
|
Most negative charge on H atoms
|
Capty
|
Capacity factor
|
Q O, Max
|
Most positive charge on O atoms
|
3χvC
|
Valence molecular connectivity Chi indices for cluster
|
4χvPC
|
Valence molecular connectivity Chi indices for path/cluster
|
Shpl
|
Hydrophilic region
|
S(10)
|
Atom-type H Estate sum for :CH: (sp2, aromatic)
|
S(1)
|
Atom-type H Estate sum for -OH
|
S(25)
|
Atom-type Estate sum for =C<
|
S(35)
|
Atom-type Estate sum for :N:
|
Shpb
|
Hydrophobic region
|
Q O, SS
|
Sum of squares of charges on O atoms
|
As shown in Table 2 and Figure 2(A), the importance of these 25 descriptors decreases in turn, and the top three are S(16), Q N, Max and Q C, Min. It is quite clear that these three characteristics have very important reference value in predicting the BRD4is, which sequentially stand for the sum of the electric topological states of -CH3 atom type, the largest positive charge on N atom and the smallest negative charge on C atom.
Figure 2(B) displays the distribution of 541 molecules in the testing set from the established RF optimal model. As can be seen in the figure, the classification boundary line of the model can separate BRD4is from non-BRD4is very well.
By plotting the receiver operating characteristic (ROC) curve [53], we can further analyze and evaluate the discriminant effect of binary classification model. ROC curvecombines both SE and SP together. With the change of prediction probability threshold, many pairs of SE and “1-SP” will be produced. If SE is taken as the ordinate and “1-SP” as the abscissa, the ROC curve can be drawn by connecting each point, and the points on the curve represent the compromise between SE and SP when the prediction probability threshold is constantly changed. There is also a very important index to evaluate the prediction ability of classification model: the area under the ROC curve (AUC). The range for the AUC value is from 0.5 to 1, and the larger the value, the better the classification performance of the model is. The ROC curves of the training set and the testing set for RF optimal model in this paper are emerged in Figure 3. The curve fitting exhibits that the AUC value in the training set and the testing set is 0.981 and 0.993, separately, all reflecting excellent prediction performance of the RF model.
Virtual Screening of BRD4is
The above optimal RF model was applied to screen virtually about 100,000 compounds from the "drug-like" subset of ZINC database. Finally, 89 promising drug molecules were obtained, the details of which are given in Table S2 of Supporting Information. It can be found that some compounds share the same skeleton structure, for instance, ZINC00126622 and ZINC00126628, as exemplified in Figure 4.
The results indicated that the optimal RF model has filtered a number of highly useful structures from the database, such as sulfonyl, triazole and isoxazole, which have better potential inhibitory activity.
Molecular docking calculations
Using Autodock Vina software, the 89 molecules were docked to BRD4 protein afterwards. Seven molecules with binding energies less than or equal to -8 kcal/mol were selected in order of ascending energy. At the same time, the binding energy of the original ligand was also calculated, which is -7.4 kcal/mol, higher than that of the seven molecules above-mentioned. Their structures and binding energies scores are detailed in Table S3 from Supporting Information.
The Table S3 shows that the molecules chosen by us have stronger affinity with BRD4 protein than the original ligand and therefore can be used for further structural modification, chemicalsynthesis and biological testing. Among them, compound ZINC59239754 has isoxazole structure, which belongs to one category of known BRD4 inhibitors. The binding energy between ZINC59239754 and protein ranks third, so the complex is relatively stable.
Molecular dynamics simulations
In order to verify the binding stability of the above seven molecules to BRD4, molecular dynamics simulations were conducted respectively at 100 ns, not only on the complexes between the seven molecules and BRD4 after docking, but also on the complex of the original ligand with BRD4. The simulation results (Figure 5) show that the root mean square deviations (RMSD) for backbones of all systems hold steady after 60 ns, and the radius of gyration (Rg) is also stable at about 15 Å. The binding free energies obtained by molecular mechanics/Poisson–Boltzmann surface area (MM/PBSA) calculations (Table 3) indicate that compound ZINC59239754 has a lower binding free energy than the other six compounds, which is close to the original ligand. Therefore, compound ZINC59239754 is more likely to be a potential BRD4 inhibitor.
In addition, the ranking of binding free energies by molecular dynamics simulations is very different from the Molecular docking calculations
Table 3 Average RMSD, Rg and binding free energy of the seven top-ranked molecules-BRD4 complexs and the original ligand 3,5-dimethylisoxazol-BRD4 complex
Complex
|
RMSD(Å)
|
Radius of gyration(Å)
|
Binding free energy (kJ/mol)
|
ZINC67473070-BRD4
|
1.33±0.28
|
15.20±0.08
|
-81.59±0.68
|
ZINC00481768-BRD4
|
1.25±0.27
|
15.24±0.08
|
-67.60±0.76
|
ZINC59239754-BRD4
|
1.05±0.19
|
15.24±0.07
|
-100.54±2.56
|
ZINC04487544-BRD4
|
1.31±0.31
|
15.21±0.08
|
-82.51±0.73
|
ZINC22055514-BRD4
|
1.26±0.19
|
15.19±0.08
|
-90.93±0.60
|
ZINC71783667-BRD4
|
1.38±0.27
|
15.29±0.08
|
-75.53±0.70
|
ZINC71782051-BRD4
|
1.24±0.21
|
15.24±0.07
|
-79.38±0.56
|
The original ligand-BRD4
|
1.21±0.18
|
15.21±0.06
|
-99.75±1.06
|