According to our calculations of the eight molecular physicochemical properties, we first assessed the hydrophobicity-related properties which included the lipid-water distribution coefficient (AlogP), apparent distribution coefficient (logD), and solubility (logS). The AlogP values for the DILI-positive compounds ranged from − 18.27 to 13.41, with an average value of 2.24, while those for the DILI-negative compounds ranged from − 26.25 to 14.69, with an average value of 1.67. For logD, DILI-positive compounds had a range between − 23.45 and 13.41, with an average value of 1.64, whereas DILI-negative compounds had a range between − 32.81 and 10.25, with an average value of 0.99. The logS values for the DILI-positive compounds ranged between − 14.72 and 2.47, with an average value of -4.28, whereas for the DILI-negative compounds, the values ranged between − 31.28 and 4.58, with an average value of -4.12. Based on these results, it can be concluded that, on average, compounds demonstrating hepatotoxicity generally exhibit greater lipophilicity than those without hepatotoxicity.
The maximum molecular weight for the DILI-positive compounds was 2639.13, while it reached 4541.07 for the DILI-negative compounds. However, the average values between these two groups were comparable, suggesting that the molecular weight is not a highly discriminative property for distinguishing DILI. Conversely, in terms of the number of rotatable bonds (NRot), DILI-negative compounds exhibited higher counts than DILI-positive compounds, indicating that greater flexibility is associated with nonhepatotoxicity.
The remaining three molecular descriptors included the polar surface area, hydrogen bond acceptor count, and hydrogen bond donor count. These descriptors are usually used for characterizing the hydrogen bonding and electrostatic properties of molecules. The p-values for these descriptors were also calculated across different compound categories. The obtained p-values for the polar surface area, hydrogen bond acceptor count, and hydrogen bond donor count were 0.06, 0.7, and 0.000303 respectively, suggesting that these descriptors do not possess discriminatory capabilities in predicting DILI. For constructing a high-performing and reliable predictive model, it is preferable for the p-values to be below the 10− 20 threshold level. As a result, none of these eight molecular descriptors can be used to effectively discriminate compounds for their hepatotoxicity properties. Furthermore, regarding the other 28 descriptors encompassing various physicochemical parameters evaluated in our study, the statistical results indicate that accurate discrimination between DILI-positive and DILI-negative compounds cannot be achieved solely based on these simplistic physicochemical property-based rules/filters.
3.2. DILI prediction based on scaffold features.
3.2.1. Cumulative scaffold frequencies (CSFs) of the Murcko frameworks and the Levels 1 and 2 scaffolds. We utilized the Scaffold Tree tool to fragment the compound molecules, generating a series of substructures including Murcko frameworks, rings, ring assemblies, and bridge assemblies. These substructures collectively represent the cyclic components present in the molecules. Based on our analysis results (Table
3), a majority of the compounds used in this study exhibited cyclic structures, while only a small proportion lacked such structures. For example, the proportion of DILI-positive compounds containing Level 1 substructures was 79.05%, while for DILI-negative compounds, it was 68.13%. For the Level 2 scaffolds, the percentages of molecules exhibiting them in the DILI-positive and DILI-negative categories were 56.35% and 47.91%, respectively. However, the proportion of molecules possessing scaffolds at Level 3 or higher was less than 30%. Therefore, employing Level 1 or Level 2 scaffolds to characterize the structural features of both DILI-positive and DILI-negative compounds, as well as comparing scaffold diversity between them, is appropriate and justified.
Table 3
The number of scaffolds at different levels of the Scaffold Tree for 1141 DILI-positive and 1004 DILI-negative compounds
Level
|
No. of Scaffolds
|
No. of non-duplicated Scaffolds
|
DILI-positive
|
DILI-negative
|
DILI-positive
|
DILI-negative
|
Level 0
|
1056(92.55%)
|
880(87.65%)
|
211(18.49%)
|
169(16.83%)
|
Level 1
|
902(79.05%)
|
684(68.13%)
|
421(36.90%)
|
331(32.97%)
|
Level 2
|
643(56.35%)
|
481(47.91%)
|
415(36.37%)
|
303(30.18%)
|
Level 3
|
339(29.71%)
|
283(28.19%)
|
229(20.07%)
|
175(17.43%)
|
Level 4
|
101(8.85%)
|
102(10.16%)
|
85(7.45%)
|
72(7.17%)
|
Level 5
|
42(3.68%)
|
56(5.58%)
|
35(3.07%)
|
43(4.28%)
|
Level 6
|
22(1.93%)
|
26(2.59%)
|
19(1.67%)
|
21(2.09%)
|
Level 7
|
6(0.53%)
|
16(1.59%)
|
6(0.53%)
|
14(1.39%)
|
Level 8
|
4(0.35%)
|
7(0.70%)
|
4(0.35%)
|
7(0.70%)
|
Level 9
|
2(0.18%)
|
2(0.20%)
|
2(0.18%)
|
2(0.20%)
|
Level 10
|
1(0.09%)
|
1(0.10%)
|
1(0.09%)
|
1(0.10%)
|
Level 11
|
1(0.09%)
|
1(0.10%)
|
1(0.09%)
|
1(0.10%)
|
Level 12
|
1(0.09%)
|
0
|
1(0.09%)
|
0
|
As illustrated in Table 4, the presence of Murcko frameworks was observed in 92.55% and 87.65% of the DILI-positive and DILI-negative compounds, respectively, echoing the trends detailed in Table 3 and further substantiating the prevalence of cyclic structures among the analyzed compounds. The Murcko framework refers to the core structural skeleton of molecules remaining after the removal of their side chains. With respect to our datasets comprising DILI-positive and DILI-negative compounds, we identified 680 and 558 unique Murcko frameworks, respectively, indicating a certain level of structural diversity within these two types of molecules. A comparison of other cyclic systems, which include rings, ring assemblies, and bridge assemblies, revealed that the counts associated with DILI-positive compounds tended to be greater than those associated with DILI-negative compounds. A possible reason for this could be the relatively larger number of DILI-positive compounds in our dataset. Notably, chain assemblies exhibit an intriguing pattern; specifically, 668 unique chain assemblies were identified within the dataset of DILI-positive compounds, whereas 693 were identified within the dataset of DILI-negative compounds. This suggests that DILI-negative compounds exhibit greater variation in terms of side chains and linkers connecting major functional groups than the DILI-positive counterparts. Based on the above findings, it can be inferred that the Murcko framework, Level 1 and Level 2 scaffolds serve as representative substructures for effectively capturing the comprehensive structural features of the investigated molecules.
Table 4
The number of fragments with Murcko frameworks, ring assemblies, rings, bridge assemblies and chain assemblies present in 1141 DILI-positive and 1004 DILI-negative compounds
Scaffold architecture
|
No. of Scaffolds
|
No. of non-duplicated Scaffolds
|
DILI-positive
|
DILI-negative
|
DILI-positive
|
DILI-negative
|
Murcko framework
|
1056(92.55%)
|
880(87.65%)
|
680(59.60%)
|
558(55.58%)
|
Rings
|
3137(2.75)
|
2569(2.56)
|
206(18.05%)
|
168(16.73%)
|
Ring assemblies
|
2086(1.83)
|
1626(1.62)
|
353(30.94%)
|
301(29.98%)
|
Bridge assemblies
|
52(4.56%)
|
51(5.08%)
|
35(3.07%)
|
32(3.19%)
|
Chain assemblies
|
8406(7.37)
|
7173(7.14)
|
668(58.55%)
|
693(69.02%)
|
Here, we conducted a statistical analysis of the cumulative scaffold frequencies (CSFs) of Murcko frameworks, Level 1, and Level 2 substructures of DILI-positive/negative compounds. The results were obtained and are presented in Figs. 3 and 4 by sorting the frequencies of each representative scaffold. As depicted in Fig. 3a, with an increasing number of distinct Murcko frameworks, the growth rate of DILI-negative compounds slightly surpassed that of DILI-positive compounds. This finding can be explained by the fact that there were fewer DILI-negative compounds in our dataset than DILI-positive compounds, which resulted in a reduced variety of Murcko frameworks. The top ten most frequently occurring Murcko frameworks in both the DILI-positive compounds and DILI-negative compounds datasets are displayed in Fig. 3b. The analysis of representative Murcko frameworks from both datasets revealed that certain frameworks were more prevalent in DILI-positive compounds, but they also appeared in DILI-negative compounds, such as benzene (81) and diphenylmethane (10). Furthermore, some basic fragments like the pyridine ring (11) and quinoline ring (9) are common across a wide range of compound classes, suggesting their limited ability to distinguish between DILI-positive and DILI-negative compounds. However, there are also specific fragments with high frequencies observed only in positive DILI compounds, including N-(5-thio-1-diazabicyclo [4.2.0]oct-2-en-7-yl)-2-(thiazol-4-yl)acetamide (10), 4-phenyl-3,4-dihydropyridine (8), and benzophenone (7), indicating their potential as structural warning fragments.
The CSF plots of the Level 1 substructure fragments were also calculated and are depicted in Fig. 4a. These plots illustrate the growth rate of DILI-positive compounds initially trailing behind that of DILI-negative compounds; however, this phenomenon was later overtaken by the former, suggesting a relatively uniform distribution of diverse Level 1 substructures within the dataset containing DILI-positive compounds. As depicted in Fig. 4b, we have identified several fundamental segments, alongside others that are present in both DILI-negative and DILI-positive compounds. Moreover, several distinctive frameworks have captured our attention. The presence of 1,2,3,4,4a,5,6,8a-octahydronaphthalene (12) and 3-methylene-4-phenyl-3,4-dihydropyridine (12) as characteristic fragments strongly suggests their potential hepatotoxicity.
The CSF curves of the Level 2 scaffold revealed consistent growth trends for both DILI-positive and DILI-negative compounds as the number of Level 2 substructures increased, indicating a uniform dispersion of these substructures in both compound types at the Level 2 scaffold. (Fig. 4c). However, notable disparities exist in the occurrence of certain substructures (Fig. 4d). The scaffolds, namely N-(8-oxo-5-thio-1-diazabicyclo[4.2.0]oct-2-en-7-yl)-2-(thiazol-4-yl)acetamide (15) and 7-(piperazin-1-yl)-1,4-dihydroquinolin-4-one (8) are frequently observed in DILI-positive compounds, but rarely found in compounds without DILI risk, suggesting their potential as structural warning fragments. Other fragments have limited representation due to their low occurrence in DILI-positive compounds or multiple appearances in DILI-negative compounds.
3.2.2. Tree Maps. Based on the aforementioned analysis, we obtained a comprehensive understanding of the distribution pattern of molecular substructures within the DILI-positive/negative compound datasets. However, the distribution of substructure fragment similarities among different compound categories has not been fully elucidated. To bridge this gap in knowledge, we employed Tree Maps to investigate and analyze the similarity of distribution patterns exhibited by substructure fragments, which belong to two distinct compound categories. Specifically, we employed a clustering analysis based on ECFP_6 molecular fingerprint similarity, which incorporated representative structures, including Murcko frameworks and Level 1 scaffolds from DILI-positive/negative categories. Finally, our clustering results were visually represented through the Tree Maps strategy (Figs. 5 and 6).
In the Tree Maps, each distinct structural category is depicted by a gray circle, where the circle’s size reflects the quantity of substructure fragments it represents. The Tree Maps of each compound category highlight the six largest gray circles, signifying the predominant Murcko frameworks or Level 1 scaffold of the Scaffold Tree. Additionally, the number of gray circles in the Tree Maps for the DILI-positive compounds was slightly greater than that for the DILI-negative compounds, indicating a slightly greater diversity of structures in the DILI-positive compounds. Finally, we observed variations in the most frequently occurring scaffolds within clusters for DILI-positive and DILI-negative compounds. As depicted in Fig. 5, specific substructures of the Murcko framework, such as 6-phenyl-2,3,5,6-tetrahydroimidazo[2,1-b]thiazole and phenyl 2-phenoxyacetate, exhibit distinct characteristics compared to DILI-negative fragments. It is important to note, nevertheless, that certain DILI-negative fragments may be considered as modifications or additions to comparable DILI-positive fragments. For example, the presence of 1-cyclopropylethan-1-one moiety on the methylene group relative to 1-benzylisoquinoline in the DILI-positive fragment 1-cyclopropyl-2-(6,7-dihydrothieno[3,2-c]pyridin-5(4H)-yl)-2-phenylethan-1-one could offer novel insights for mitigating DILI risk. Additionally, as a common thiophene ring was observed in both DILI-positive and negative compounds, its association with DILI remains inconclusive. In Fig. 6, the three Level 1 scaffold categories of DILI-positive compounds are represented by an octahydro-1H-indole isocyclic ring, a 2H-indazole ring, and a 1,3,3a,7a-tetrahydrobenzo[c][1, 2, 5]oxadiazole ring. Similarly, the three Level 1 scaffold categories of DILI-positive compounds also include bicyclic rings with similar structural patterns such as an indole ring, a tetrahydroisoquinoline ring, and an octahydronaphthalene ring. The only notable differences lie in slight modifications to their unilateral rings. Consequently, these Level 1 scaffold architectures of DILI-negative compounds have the potential to function as bioisosteres for reducing DILI risk. Additionally, it was observed that introducing oxygen atoms into 5-cyclopentylpent-4-en-1-ylbenzene attenuated its potential for causing DILI.
3.3 DILI prediction models based on the naïve Bayes classifier. Previous sections have shown that neither individual physicochemical descriptors nor molecular substructures can serve as effective classification criteria for predicting DILI. Hence, we employed the naïve Bayes classifier (NBC) machine learning approach to develop more robust DILI classification models. We partitioned the entire dataset of 2145 compound molecules (1141 DILI-positive and 1004 DILI-negative) into two distinct subsets: a training set and a test set. The training set comprised 80% of the original dataset, totaling 1717 compound molecules (913 DILI-positive and 804 DILI-negative), while the remaining 20% of the data consisted of 428 compound molecules (228 DILI-positive and 200 DILI-negative) that formed the test set. Each compound in the dataset was meticulously labeled to verify its association with DILI. To identify the most suitable molecular features for accurate prediction of DILI, we constructed naïve Bayes models using 21 physicochemical properties, 76 VolSurf descriptors, and different molecular fingerprints (48 combinations in total). These molecular fingerprint descriptors included the ECFC, ECFP, EPFC, EPFP, FCFC, FCFP, FPFC, FPFP, LCFC, LCFP, LPFC and LPFP with varying bond lengths ranging from 4 to 10. The statistical results obtained from the leave-one-out (LOO) cross-validation procedure of NBC are summarized in Table 5 for the training set and in Table 6 for the test set.
Table 5
Performance of the naïve Bayesian classifiers for the training set based on different combinations of molecular descriptors
Descriptors
|
TP
|
FN
|
FP
|
TN
|
SE
|
SP
|
Q+
|
Q-
|
GA
|
AUC
|
MPa
|
499
|
414
|
237
|
567
|
0.547
|
0.705
|
0.678
|
0.578
|
0.621
|
0.621
|
Vb
|
527
|
386
|
241
|
563
|
0.577
|
0.700
|
0.686
|
0.593
|
0.635
|
0.629
|
MP + V
|
460
|
453
|
165
|
639
|
0.504
|
0.795
|
0.736
|
0.585
|
0.640
|
0.638
|
MP + V + ECFC_4
|
699
|
214
|
161
|
643
|
0.766
|
0.800
|
0.813
|
0.750
|
0.782
|
0.714
|
MP + V + ECFC_6
|
722
|
191
|
98
|
706
|
0.791
|
0.878
|
0.880
|
0.787
|
0.832
|
0.716
|
MP + V + ECFC_8
|
744
|
169
|
91
|
713
|
0.815
|
0.887
|
0.891
|
0.808
|
0.849
|
0.713
|
MP + V + ECFC_10
|
735
|
178
|
79
|
725
|
0.805
|
0.902
|
0.903
|
0.803
|
0.850
|
0.710
|
MP + V + ECFP_4
|
619
|
294
|
87
|
717
|
0.678
|
0.892
|
0.877
|
0.709
|
0.778
|
0.717
|
MP + V + ECFP_6
|
734
|
179
|
106
|
698
|
0.804
|
0.868
|
0.874
|
0.796
|
0.834
|
0.717
|
MP + V + ECFP_8
|
712
|
201
|
69
|
735
|
0.780
|
0.914
|
0.912
|
0.785
|
0.843
|
0.712
|
MP + V + ECFP_10
|
719
|
194
|
68
|
736
|
0.788
|
0.915
|
0.914
|
0.791
|
0.847
|
0.709
|
MP + V + EPFC_4
|
639
|
274
|
257
|
547
|
0.700
|
0.680
|
0.713
|
0.666
|
0.691
|
0.663
|
MP + V + EPFC_6
|
648
|
265
|
175
|
629
|
0.710
|
0.782
|
0.787
|
0.704
|
0.744
|
0.671
|
MP + V + EPFC_8
|
661
|
252
|
131
|
673
|
0.724
|
0.837
|
0.835
|
0.728
|
0.777
|
0.672
|
MP + V + EPFC_10
|
684
|
229
|
122
|
682
|
0.749
|
0.848
|
0.849
|
0.749
|
0.796
|
0.669
|
MP + V + EPFP_4
|
623
|
290
|
219
|
585
|
0.682
|
0.728
|
0.740
|
0.669
|
0.704
|
0.684
|
MP + V + EPFP_6
|
577
|
336
|
109
|
695
|
0.632
|
0.864
|
0.841
|
0.674
|
0.741
|
0.680
|
MP + V + EPFP_8
|
598
|
315
|
92
|
712
|
0.655
|
0.886
|
0.867
|
0.693
|
0.763
|
0.673
|
MP + V + EPFP_10
|
615
|
298
|
85
|
719
|
0.674
|
0.894
|
0.879
|
0.707
|
0.777
|
0.670
|
MP + V + FCFC_4
|
576
|
337
|
153
|
651
|
0.631
|
0.810
|
0.790
|
0.659
|
0.715
|
0.697
|
MP + V + FCFC_6
|
738
|
175
|
164
|
640
|
0.808
|
0.796
|
0.818
|
0.785
|
0.803
|
0.712
|
MP + V + FCFC_8
|
641
|
272
|
69
|
735
|
0.702
|
0.914
|
0.903
|
0.730
|
0.801
|
0.712
|
MP + V + FCFC_10
|
653
|
260
|
58
|
746
|
0.715
|
0.928
|
0.918
|
0.742
|
0.815
|
0.711
|
MP + V + FCFP_4
|
602
|
311
|
147
|
657
|
0.659
|
0.817
|
0.804
|
0.679
|
0.733
|
0.712
|
MP + V + FCFP_6
|
636
|
277
|
80
|
724
|
0.697
|
0.900
|
0.888
|
0.723
|
0.792
|
0.722
|
MP + V + FCFP_8
|
662
|
251
|
64
|
740
|
0.725
|
0.920
|
0.912
|
0.747
|
0.817
|
0.719
|
MP + V + FCFP_10
|
675
|
238
|
62
|
742
|
0.739
|
0.923
|
0.916
|
0.757
|
0.825
|
0.716
|
MP + V + FPFC_4
|
525
|
388
|
164
|
640
|
0.575
|
0.796
|
0.762
|
0.623
|
0.679
|
0.661
|
MP + V + FPFC_6
|
619
|
294
|
166
|
638
|
0.678
|
0.794
|
0.789
|
0.685
|
0.732
|
0.673
|
MP + V + FPFC_8
|
636
|
277
|
130
|
674
|
0.697
|
0.838
|
0.830
|
0.709
|
0.763
|
0.679
|
MP + V + FPFC_10
|
650
|
263
|
120
|
684
|
0.712
|
0.851
|
0.844
|
0.722
|
0.777
|
0.679
|
MP + V + FPFP_4
|
626
|
287
|
204
|
600
|
0.686
|
0.746
|
0.754
|
0.676
|
0.714
|
0.688
|
MP + V + FPFP_6
|
568
|
345
|
97
|
707
|
0.622
|
0.879
|
0.854
|
0.672
|
0.743
|
0.691
|
MP + V + FPFP_8
|
580
|
333
|
87
|
717
|
0.635
|
0.892
|
0.870
|
0.683
|
0.755
|
0.688
|
MP + V + FPFP_10
|
599
|
314
|
74
|
730
|
0.656
|
0.908
|
0.890
|
0.699
|
0.774
|
0.686
|
MP + V + LCFC_4
|
779
|
134
|
211
|
593
|
0.853
|
0.738
|
0.787
|
0.816
|
0.799
|
0.711
|
MP + V + LCFC_6
|
720
|
193
|
96
|
708
|
0.789
|
0.881
|
0.882
|
0.786
|
0.832
|
0.710
|
MP + V + LCFC_8
|
738
|
175
|
80
|
724
|
0.808
|
0.900
|
0.902
|
0.805
|
0.851
|
0.707
|
MP + V + LCFC_10
|
750
|
163
|
81
|
723
|
0.821
|
0.899
|
0.903
|
0.816
|
0.858
|
0.705
|
MP + V + LCFP_4
|
647
|
266
|
99
|
705
|
0.709
|
0.877
|
0.867
|
0.726
|
0.787
|
0.715
|
MP + V + LCFP_6
|
725
|
188
|
93
|
711
|
0.794
|
0.884
|
0.886
|
0.791
|
0.836
|
0.712
|
MP + V + LCFP_8
|
721
|
192
|
71
|
733
|
0.790
|
0.912
|
0.910
|
0.792
|
0.847
|
0.707
|
MP + V + LCFP_10
|
731
|
182
|
67
|
737
|
0.801
|
0.917
|
0.916
|
0.802
|
0.855
|
0.704
|
MP + V + LPFC_4
|
687
|
226
|
92
|
712
|
0.752
|
0.886
|
0.882
|
0.759
|
0.815
|
0.701
|
MP + V + LPFC_6
|
739
|
174
|
90
|
714
|
0.809
|
0.888
|
0.891
|
0.804
|
0.846
|
0.694
|
MP + V + LPFC_8
|
765
|
148
|
88
|
716
|
0.838
|
0.891
|
0.897
|
0.829
|
0.863
|
0.682
|
MP + V + LPFC_10
|
776
|
137
|
101
|
703
|
0.850
|
0.874
|
0.885
|
0.837
|
0.861
|
0.672
|
MP + V + LPFP_4
|
656
|
257
|
83
|
721
|
0.719
|
0.897
|
0.888
|
0.737
|
0.802
|
0.701
|
MP + V + LPFP_6
|
712
|
201
|
83
|
721
|
0.780
|
0.897
|
0.896
|
0.782
|
0.835
|
0.696
|
MP + V + LPFP_8
|
746
|
167
|
91
|
713
|
0.817
|
0.887
|
0.891
|
0.810
|
0.850
|
0.686
|
MP + V + LPFP_10
|
764
|
149
|
94
|
710
|
0.837
|
0.883
|
0.890
|
0.827
|
0.858
|
0.677
|
aMP represents 21 molecular physicochemical properties; |
b V represents 76 VolSurf descriptors.
Table 6
Performance of the naïve Bayesian classifiers for the test set based on different combinations of molecular descriptors
Descriptors
|
TP
|
FN
|
FP
|
TN
|
SE
|
SP
|
Q+
|
Q-
|
GA
|
AUC
|
MPa
|
98
|
130
|
73
|
127
|
0.430
|
0.635
|
0.573
|
0.494
|
0.526
|
0.548
|
Vb
|
126
|
102
|
89
|
111
|
0.553
|
0.555
|
0.586
|
0.521
|
0.554
|
0.562
|
MP + V
|
122
|
106
|
86
|
114
|
0.535
|
0.570
|
0.587
|
0.518
|
0.551
|
0.563
|
MP + V + ECFC_4
|
135
|
93
|
78
|
122
|
0.592
|
0.610
|
0.634
|
0.567
|
0.600
|
0.657
|
MP + V + ECFC_6
|
135
|
93
|
74
|
126
|
0.592
|
0.630
|
0.646
|
0.575
|
0.610
|
0.666
|
MP + V + ECFC_8
|
136
|
92
|
74
|
126
|
0.596
|
0.630
|
0.648
|
0.578
|
0.612
|
0.671
|
MP + V + ECFC_10
|
138
|
90
|
74
|
126
|
0.605
|
0.630
|
0.651
|
0.583
|
0.617
|
0.673
|
MP + V + ECFP_4
|
144
|
84
|
84
|
116
|
0.632
|
0.580
|
0.632
|
0.580
|
0.607
|
0.653
|
MP + V + ECFP_6
|
137
|
91
|
77
|
123
|
0.601
|
0.615
|
0.640
|
0.575
|
0.607
|
0.663
|
MP + V + ECFP_8
|
141
|
87
|
75
|
125
|
0.618
|
0.625
|
0.653
|
0.590
|
0.621
|
0.668
|
MP + V + ECFP_10
|
141
|
87
|
75
|
125
|
0.618
|
0.625
|
0.653
|
0.590
|
0.621
|
0.670
|
MP + V + EPFC_4
|
89
|
139
|
50
|
150
|
0.390
|
0.750
|
0.640
|
0.519
|
0.558
|
0.613
|
MP + V + EPFC_6
|
95
|
133
|
46
|
154
|
0.417
|
0.770
|
0.674
|
0.537
|
0.582
|
0.625
|
MP + V + EPFC_8
|
95
|
133
|
40
|
160
|
0.417
|
0.800
|
0.704
|
0.546
|
0.596
|
0.635
|
MP + V + EPFC_10
|
100
|
128
|
41
|
159
|
0.439
|
0.795
|
0.709
|
0.554
|
0.605
|
0.641
|
MP + V + EPFP_4
|
103
|
125
|
59
|
141
|
0.452
|
0.705
|
0.636
|
0.530
|
0.570
|
0.610
|
MP + V + EPFP_6
|
90
|
138
|
45
|
155
|
0.395
|
0.775
|
0.667
|
0.529
|
0.572
|
0.624
|
MP + V + EPFP_8
|
84
|
144
|
38
|
162
|
0.368
|
0.810
|
0.689
|
0.529
|
0.575
|
0.633
|
MP + V + EPFP_10
|
88
|
140
|
39
|
161
|
0.386
|
0.805
|
0.693
|
0.535
|
0.582
|
0.638
|
MP + V + FCFC_4
|
113
|
115
|
64
|
136
|
0.496
|
0.680
|
0.638
|
0.542
|
0.582
|
0.631
|
MP + V + FCFC_6
|
115
|
113
|
60
|
140
|
0.504
|
0.700
|
0.657
|
0.553
|
0.596
|
0.649
|
MP + V + FCFC_8
|
120
|
108
|
59
|
141
|
0.526
|
0.705
|
0.670
|
0.566
|
0.610
|
0.658
|
MP + V + FCFC_10
|
118
|
110
|
57
|
143
|
0.518
|
0.715
|
0.674
|
0.565
|
0.610
|
0.663
|
MP + V + FCFP_4
|
125
|
103
|
71
|
129
|
0.548
|
0.645
|
0.638
|
0.556
|
0.593
|
0.637
|
MP + V + FCFP_6
|
135
|
93
|
69
|
131
|
0.592
|
0.655
|
0.662
|
0.585
|
0.621
|
0.655
|
MP + V + FCFP_8
|
111
|
117
|
57
|
143
|
0.487
|
0.715
|
0.661
|
0.550
|
0.593
|
0.664
|
MP + V + FCFP_10
|
113
|
115
|
55
|
145
|
0.496
|
0.725
|
0.673
|
0.558
|
0.603
|
0.668
|
MP + V + FPFC_4
|
85
|
143
|
48
|
152
|
0.373
|
0.760
|
0.639
|
0.515
|
0.554
|
0.606
|
MP + V + FPFC_6
|
90
|
138
|
47
|
153
|
0.395
|
0.765
|
0.657
|
0.526
|
0.568
|
0.615
|
MP + V + FPFC_8
|
94
|
134
|
50
|
150
|
0.412
|
0.750
|
0.653
|
0.528
|
0.570
|
0.624
|
MP + V + FPFC_10
|
102
|
126
|
52
|
148
|
0.447
|
0.740
|
0.662
|
0.540
|
0.584
|
0.630
|
MP + V + FPFP_4
|
101
|
127
|
59
|
141
|
0.443
|
0.705
|
0.631
|
0.526
|
0.565
|
0.615
|
MP + V + FPFP_6
|
86
|
142
|
50
|
150
|
0.377
|
0.750
|
0.632
|
0.514
|
0.551
|
0.620
|
MP + V + FPFP_8
|
87
|
141
|
49
|
151
|
0.382
|
0.755
|
0.640
|
0.517
|
0.556
|
0.624
|
MP + V + FPFP_10
|
87
|
141
|
49
|
151
|
0.382
|
0.755
|
0.640
|
0.517
|
0.556
|
0.627
|
MP + V + LCFC_4
|
118
|
110
|
56
|
144
|
0.518
|
0.720
|
0.678
|
0.567
|
0.612
|
0.659
|
MP + V + LCFC_6
|
123
|
105
|
61
|
139
|
0.539
|
0.695
|
0.668
|
0.570
|
0.612
|
0.665
|
MP + V + LCFC_8
|
122
|
106
|
55
|
145
|
0.535
|
0.725
|
0.689
|
0.578
|
0.624
|
0.669
|
MP + V + LCFC_10
|
127
|
101
|
60
|
140
|
0.557
|
0.700
|
0.679
|
0.581
|
0.624
|
0.671
|
MP + V + LCFP_4
|
125
|
103
|
63
|
137
|
0.548
|
0.685
|
0.665
|
0.571
|
0.612
|
0.658
|
MP + V + LCFP_6
|
133
|
95
|
74
|
126
|
0.583
|
0.630
|
0.643
|
0.570
|
0.605
|
0.666
|
MP + V + LCFP_8
|
137
|
91
|
76
|
124
|
0.601
|
0.620
|
0.643
|
0.577
|
0.610
|
0.671
|
MP + V + LCFP_10
|
133
|
95
|
68
|
132
|
0.583
|
0.660
|
0.662
|
0.581
|
0.619
|
0.674
|
MP + V + LPFC_4
|
121
|
107
|
51
|
149
|
0.531
|
0.745
|
0.703
|
0.582
|
0.631
|
0.668
|
MP + V + LPFC_6
|
116
|
112
|
48
|
152
|
0.509
|
0.760
|
0.707
|
0.576
|
0.626
|
0.671
|
MP + V + LPFC_8
|
121
|
107
|
51
|
149
|
0.531
|
0.745
|
0.703
|
0.582
|
0.631
|
0.672
|
MP + V + LPFC_10
|
125
|
103
|
53
|
147
|
0.548
|
0.735
|
0.702
|
0.588
|
0.636
|
0.672
|
MP + V + LPFP_4
|
109
|
119
|
46
|
154
|
0.478
|
0.770
|
0.703
|
0.564
|
0.614
|
0.663
|
MP + V + LPFP_6
|
115
|
113
|
52
|
148
|
0.504
|
0.740
|
0.689
|
0.567
|
0.614
|
0.668
|
MP + V + LPFP_8
|
126
|
102
|
58
|
142
|
0.553
|
0.710
|
0.685
|
0.582
|
0.626
|
0.671
|
MP + V + LPFP_10
|
122
|
106
|
58
|
142
|
0.535
|
0.710
|
0.678
|
0.573
|
0.617
|
0.671
|
aMP represents 21 molecular physicochemical properties; |
b V represents 76 VolSurf descriptors.
According to the statistical results, we found that the Bayesian classifiers failed to achieve satisfactory discrimination capacities, as indicated by the relatively low GA and AUC values based solely on the 21 molecular properties, VolSurf descriptors, or a combination of both. The corresponding global accuracies were only 0.621, 0.635, and 0.640 respectively, while the AUC values were 0.621, 0.629, and 0.638, respectively. The prediction accuracy of NBC models for DILI can be improved significantly by adding any type of molecular fingerprint. The significance of molecular fingerprint descriptors in the construction of the naïve Bayesian classification model is highlighted.
The performance of the Bayesian classifier was subsequently assessed on an independent test set. Among the various combinations of molecular descriptors, the model that integrated the LCFP_10 molecular fingerprint with 21 physicochemical properties and VolSurf descriptors demonstrated the most optimal performance. It achieved a sensitivity of 0.583, a specificity of 0.660, a global accuracy of 0.619, and an AUC value of 0.674. Notably, the model's performance on the test set was noticeably inferior to its performance on the internal training set. Furthermore, our model exhibited a slightly superior predictive performance compared to that of Kotsampasakou et al.’s[21] model, which attained a maximum prediction accuracy of only 0.60 and an AUC value of 0.64 on the external testing set. Additionally, when combining molecular physicochemical properties and VolSurf descriptors with different molecular fingerprints such as ECFC_8, ECFC_10, ECFP_10, LCFC_10, LCFP_8, LPFC_6, LPFC_8, LPFC_10, LPFP_8, and LPFP_10, the resulting models yielded similar AUC values to that obtained by our best-performing model. This observation implies that utilizing larger step sizes of molecular fingerprints when constructing NBC models for DILI prediction, could offer certain advantages.
In addition to its application in constructing classification models for the DILI prediction, the Bayesian classifier can also employ Bayesian scoring to identify crucial molecular substructures that are highly relevant or irrelevant to DILI. For pharmaceutical chemists, these identified key substructures can serve as valuable references during drug design/development, aiding in the prevention and mitigation of hepatotoxicity. Figure 7 shows the top 20 molecular substructures most strongly associated with DILI (Fig. 7a) and the top 20 molecular substructures least significantly related with DILI (Fig. 7b), ranked based on their Bayesian scores.
The data presented above demonstrate the absence of identical substructures between Figs. 7a and 7b. Based on the results, it is evident that compounds with multi-substituted pyridine-like features are frequently associated with DILI-positive warnings, while DILI-negative compounds do not exhibit such structural characteristics. Although the pyridine ring is generally considered a universe functional group in medicinal chemistry, its multiple substitutions may increase the risk of DILI. Additionally, this study has identified novel fragments related to DILI, such as methyl 2-methyl-2-phenoxypropanoate and their corresponding substructures which strongly correlate with the occurrence of DILI. The presence of a 4-chlorophenyl ethers fragment should be avoided in drug design due to its association with hepatotoxicity. Furthermore, comparing Figs. 7a and 7b reveals that symmetrical alkane fragments containing nitrogen or oxygen atoms are frequently observed in DILI-positive compounds; however, removing these heteroatoms reduces their potential for causing liver injury. The occurrence of nonhepatotoxicity-related fragments (Figs. 7b) also demonstrates that long-chain alkenes or long-chain acids containing alkenes, certain benzoates as well as common heterocycles such as imidazole, pyrrole and indole can be considered structurally safe for drug design purposes. Nevertheless, it is worth noting that the presence of methyl-substituted on five-membered heterocycles may increase the risk of DILI. The findings may have significant implications for the identification of more promising drug candidates with reduced risk of DILI in the drug development.
3.4. DILI prediction models based on recursive partitioning. Compared to the naïve Bayesian technique, RP can provide simpler and more comprehensible classification models characterized by a series of hierarchical rules. Consequently, RP was employed to construct decision trees for discriminating DILI-positive compounds from DILI-negative compounds, aiming to compare the performance of Bayesian and RP classifiers based on identical molecular descriptors and datasets. The complexity of a decision tree in the RP analysis was judiciously regulated by calibrating a crucial parameter, the tree depth, based on predictions observed on the test set. The range of tree depths varied from 5 to 15, and the corresponding models were assessed for their classification accuracies. Subsequently, a series of RP models were generated using Discovery Studio software, and the model with the best predictive performance was chosen (Fig. 8). The optimal classification model, trained with a tree depth of 12, exhibited a cross-validation AUC of 0.619 in the training set and an AUC of 0.569 in the test set. It is evident that the NBC significantly outperforms the RP classifier when using identical molecular descriptors and datasets.