Target selections
The list of targets constituting ProfhEX has been taken from the study of Bowes et al. [11], where the Authors compiled a list of “minimal panel of targets” that normally go through in-vitro testing for liability profiling by world-leading pharmaceutical companies. Relevant targets have been selected based on probability of a hit at the target compared to the magnitude of the impact of this hit. For instance, hERG and muscarinic receptors are classified as a high rate/high impact targets.
Table 1 reports the list of selected targets and Fig. 2 depicts their protein family classification: most of them are membrane receptors from the GPCR (G protein-coupled receptors) superfamily (25), followed by enzymes (8 members), transcription factors (6) ion channels (4) and transporters (3), for a total of 46 targets. The majority of the selected targets is involved in the prediction of cardiovascular, central nervous system and gastrointestinal side-effect. Additionally, several targets are relevant for more than one liability such as the dopamine receptors (DRD1 and DRD2) whose activation could lead to cardiovascular, nervous, and immune system adverse effects.
Table 1
Target-liability reference. CV = cardiovascular, CNS = central nervous system, GI = gastrointestinal, ED = endocrine disruption, PU = pulmonary, RE = renal, IM = immune. In brackets, the number of targets is reported.
Liability
|
Targets
|
CV (25)
|
ACHE, ADORA2A, ADRA1A, ADRA2A, ADRB2, AVPR1A, CHRM1, DRD1, DRD2, HRH1, HRH2, HTR1B, HTR2A, HTR2B, KCNH2, MAOA, OPRD1, OPRK1, OPRM1, PDE3A, PTGS2, SCN5A, SLC6A2, SLC6A4
|
CNS (19)
|
ADORA2A, ADRA1A, ADRA2A, CHRM1, CNR1, DRD1, DRD2, EDNRA, HTR1A, HTR1B, HTR2A, MAOA, OPRD1, OPRK1, OPRM1, PDE4D, SLC6A2, SLC6A3, SLC6A4
|
GI (13)
|
ACHE, ADRA1A, ADRB1, CCKAR, CHRM1, CHRM3, HRH2, HTR3A, OPRK1, OPRM1, PPARA, PPARD, PTGS1
|
ED (7)
|
AR, DRD2, EDNRA, ESR1, HTR1A, HTR3A, NR3C1
|
PU (5)
|
ACHE, ADRB2, CHRM3, HTR2B, PTGS1
|
RE (2)
|
AVPR1A, PTGS1
|
IM (6)
|
CNR2, HRH1, LCK, NR3C1, PDE4D, PTGS2
|
Datasets size ranges from 819 (HRH2 - GPCR A histamine receptor) up to 18896 (CNR1 - GPRC A cannabinoid receptor). The distribution of activity values and key-physicochemical properties over the entire chemical space is represented in Fig. 3 (see Table S1 for more details). Most of the basic properties present a normal distribution with a certain degree of skewness, such as TPSA and number of rotatable bonds. The right tail extreme of MW, TPSA and number of rotatable bonds distributions are populated by peptides and natural products, which have normally more branching and substituents than small molecules.
The average pK value over the 46 datasets ranges from 5.23 (MAOA) to 7.76 (AVPR1A), with an average of 6.6 log and SD of 0.6. This indicates a shift of individual pK distributions mean values. The reasons for this deviation could be due to (i) an intrinsically higher receptor selectivity (i.e. it tends to be activated only by specific chemical families of small molecules); (ii) a bias of in in the design of experimentally tested chemical libraries. Concerning the latter point ESR1, OPRM1 and ADRB2 data (high average pK) comes from patents and reviews that describe the use of compounds for cancer, inflammation, osteoporosis and other disorders. On the other hand, hERG, COX-1 and MAOA datasets (low average pK) are mainly associated to liability studies, with the aim to demonstrate that compounds are not active on high hazard/impact receptors.
Figure 4-a (Figure S2) illustrates pairwise pK distribution distances calculated by the Kolmogorov-Smirnov test [45]. Targets such as KNCH2 (T4), MAOA (T5) and PTGS1 (T6) have significantly different distributions (p value < 0.05) compared to most of targets, having the lowest average pK values (around 5.5 log units). Figure 4-b (Figure S3) illustrates pairwise average Tanimoto similarity among the dataset’s compounds. Most of the times, datasets of the same protein family are clearly characterized by analogue chemical moieties, as the case of ADRB1 (T1) with ADRB2 (Tanimoto = 0.91) and HTR1A (T2) with HTR1B HTR2A and HTR2B (Tanimoto ~ 0.48). On the other hand, the chemistry of tested compounds on PDE3A (T5) is noticeably different from all the other dataset, except for PDE4A (Tanimoto = 0.37).
Chemical space analysis
Figure 5 depicts the PCA of the chemical space by means of annotated heatmaps. The cumulative variance explained by the first two components is 67% (46 and 21% respectively). Black and white scatterplots in Fig. 5-a depicts the trend of some key properties (see also loadings plot in Figure S1). Molecular weight is one of the most influent features and distributes the scaffolds on the x-axis from lower (left) to higher (right) MW compounds. Other properties such as topological surface area, number of aromatic rings, number of stereocenters and number of H-bond donor/acceptor, follow the same trend, which is quite expected as an increase in the molecular weight generally corresponds to a frequency increase of all chemotypes. The second principal component is mainly driven by the fraction of sp3 hybridized carbons and the number of aliphatic rings, both increasing when moving towards lower y-axis values.
Figure 5-a is annotated by the average pK value. Intuitively, smaller and common chemical moieties, such as indane (Fig. 5-a, structure i), benzene, naphthalene, cyclohexylbenzene and biphenyl are the most frequent scaffolds, as they serve as common building blocks for more complex structures. Therefore, the average activity value over the entire chemical space around 6 log units (Table S1). In addition, these scaffolds show relatively high standard deviation (Fig. 5-b) and are very frequent (Fig. 5-c).
There is a clear gradient of increasing activity when shifting towards bigger and more peculiar scaffolds (e.g. Figure 5-a structures ii – v): this is expected as such structures have been designed to be specifically selective against a given target (low standard deviation and frequency). For instance, the scaffold of the drug haloperidol (structure iv) appears in a total of 435 molecules with an average activity of 9.13 pK (SD of 1.65), spanning over 14 different targets (e.g. adrenoreceptors, dopamine, histamine, serotonin, opioid and solute carrier receptors) [46–48]. The scaffold of the drug fentanyl (structure v) appears in 486 molecules, showing an average pK of 7.1 (SD of 1.5) over 23 unique targets (e.g. ion channels, acetylcholinesterase, opioid, dopamine and cannabinoid receptors) [49–51]. On the other hand, there are some scaffolds which are clearly inactive towards multiple targets, such as dibenzo-oxazapine derivates (structure iii) with an average pK of 2.5 (SD of 1) over 5 unique targets (serotonin, histamine, dopamine and muscarinic receptors) [52].
Models’ internal and external validation performance
Table 2 and Fig. 6 report models performance averaged over the 46 modelled targets (see Table S2 for individual target performance). In both internal and external validation, the most performing algorithm resulted to be GB (R = 0.79–0.84 and RMSE = 0.77–0.69), followed by RF (R = 0.79–0.81 and RMSE = 0.85–0.79) with slightly lower performance. GB has already been reported to outperform RF in modeling biological data [13, 53]. Due to the higher predictive power, GB-based models were designed as champions for all the 46 targets. In 5-fold CV, GB and RF scored R of 0.83 and 0.82; whereas in bootstrap 0.79 and 0.79, respectively. Similar performances in external validation (R of 0.84 and 0.81 for GB and RF, respectively) indicate that the models have good predictive power on unseen data and support the absence of overfitting. Moreover, R and R2 values for both GB and RF models are close to zero in y-scrambling simulations: such drop in performance confirms that models are unlikely to be biased by chance correlations. Finally, significant higher performances than the baseline MLR classifier indicate that GB and RF algorithms proved successful in learning meaningful structure-activity relationship for the considered datasets.
Tree-based algorithms are very proficient in modelling toxicology data, as they are less prone to overfitting, less susceptible to outliers and not as heavily affected by noise as other algorithms [25, 54]. Toxicological in vitro and in vivo data is indeed highly affected by variability due to the high number of factors that contribute to error, such as experimental measurements, and inter-laboratories variability, lower accuracy of HTS methods and heterogeneous datasets composition (e.g. measurements coming from binding and functional assays) [55, 56]. The overall RMSE of trained models is 0.75 (SD = 0.09), which is comparable to the variability of experimental affinity measurements of 0.66 (SD = 0.22). Achieving a prediction error comparable to the experimental data variability support the validity of learned structure-activity relationships, as machine learning models cannot be more accurate than the error of training instances. When dealing with biological data, it has been reported that the variance in experimental measurements could contribute more to prediction error than the error from the model itself [54, 57, 58].
PTGS1 and PTGS2 were the lowest performing models (R = 0.6–0.66 and RMSE = 0.8–0.87), despite their relatively large size of roughly 3000 and 6300 compounds, respectively. One explanation could be related to their different properties compared to the other datasets (Fig. 4) which made the learning process more difficult.
The Office of Economic Cooperation and Development (OECD) principles [26] for building robust quantitative structure-activity relationship models were followed. The five OECD principles are: (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined applicability domain; (iv) appropriate measures for goodness-of-fit, robustness, and predictivity; (v) and a mechanistic interpretation, if possible. In this study, the endpoint for each model is well defined and goodness-of-fit, robustness and predictivity were evaluated using internal (5-fold CV, bootstrap, y-scrambling) and external validation. Model’s applicability domain is evaluated structural similarity comparison to training set’s compounds.
Table 2
Models’ performance averaged over the 46 targets for the given algorithm and validation approach. Standard deviation is reported in brackets.
Validation
|
Algorithm
|
R
|
R2
|
RMSE
|
External
|
MLR
|
0.62 (0.1)
|
0.35 (0.2)
|
1.08 (0.17)
|
GB
|
0.84 (0.05)
|
0.68 (0.1)
|
0.69 (0.08)
|
RF
|
0.81 (0.07)
|
0.64 (0.11)
|
0.79 (0.1)
|
Bootstrap
|
MLR
|
0.66 (0.1)
|
0.43 (0.13)
|
1.02 (0.14)
|
GB
|
0.79 (0.07)
|
0.63 (0.11)
|
0.77 (0.11)
|
RF
|
0.79 (0.07)
|
0.6 (0.11)
|
0.85 (0.1)
|
5-fold CV
|
MLR
|
0.65 (0.11)
|
0.42 (0.15)
|
1.02 (0.14)
|
GB
|
0.83 (0.06)
|
0.67 (0.1)
|
0.71 (0.09)
|
RF
|
0.82 (0.06)
|
0.65 (0.1)
|
0.79 (0.1)
|
Y-scrambling
|
MLR
|
0.0 (0.1)
|
0.46 (0.14)
|
0.99 (0.13)
|
GB
|
0.0 (0.02)
|
0.22 (0.09)
|
1.49 (0.28)
|
RF
|
0.0 (0.02)
|
0.05 (0.01)
|
1.37 (0.26)
|
Models’ enrichment factor performance
Figure 7-a depicts enrichment factor, whereas Fig. 7-b hit-detection performance in terms of AUC (Table S2) grouped by liability group. All liability groups showed good hit-detection power with comparable performance, with the only exception of renal toxicity (RE), which showed relatively lower discriminative power. However, such group is also composed by only two targets (PTGS1 and AVPR1A) which makes the evaluation less statistically robust. Overall, enrichment factor and AUC analysis showed that generated models are able to successfully retrieve active compounds at lower dataset faction levels, supporting their ability to discriminate true actives in large volume virtual screening campaigns.
Perspectives
Ligand-based approaches are generally easier to implement as they do not require knowledge of the crystal structure of the target protein, and thus can be trained by simpler 2D descriptors with good performances. Still, they possess some limitations: i) the absence of target-related information inhibits the model to learn any rules related to protein-ligand interactions; ii) the applicability domain is restricted to compounds which are similar to the chemical space delimited by their training set; and (iii) the distribution between active and inactive compounds is generally unbalanced in favor of the latter, leading to low recall rates and failure to reliably detect potential activity cliffs. To overcome these limitations, ligand-based or structure-based pharmacophore models can be developed to find common chemical features relevant for biological activity [59]. Recent applications of 3D pharmacophores reported their screening power in virtual screening studies and their synergistic combination with docking approaches [60]. Moreover, when the crystal structure is available, the inclusion of descriptors related to the crystal structure, (i.e. proteochemic models), and docking simulations can be employed. All these approaches can be ensembled in consensus. Finally, to provide a comprehensive liability profile, it would be important to evaluate the metabolites of the query compounds as some of them may provoke harmful responses once metabolized in the human body.
ProfhEX webservice implementation
Compounds should be submitted to ProfhEX (Fig. 8) via SMILES format. The prediction process leading to the generation of its liability profile comprises the following steps: (i) a vector of 46 predicted activities on the modelled targets is generated; (ii) predictions are binned into the two classes “concern” (C) and “not concern” (nC) based on a predefined pK cutoff value of 6.5 (300 nM); (iii) classes are grouped into the 7 liability groups according to the liability mapping as described in Table 1; (iv) for each liability group a liability score is computed as the number of C labels out of the total number targets relevant for the given liability (Eq. 4).
\({Ls}_{i}=\frac{ {C}_{i}}{{C}_{i} + {nC}_{i}}\) Eq. 4
Where, \({Hs}_{i}\) is the liability score for the given liability group i, ranging from 0 (no target flagged as C) to 1 (all targets flagged as C); \({C}_{i}\) and \({nC}_{i}\) are the number of targets for the given liability group i flagged as C and nC, respectively.
A predefined threshold of 6.5 log units has been selected to achieve a balance between active and inactive compounds. A more generalizable approach would be to select a variable threshold depending on the distribution of experimental pK measurements for the given target, for instance with the two-sigma-rules. Such approach could help considering response biases present in the training datasets. Furthermore, a weighted approach could also be implemented when calculating the liability score, by putting more importance on key-targets: for instance, voltage-gated channels such as KCNH2 (hERG) and KCNA5 (Kv1.5) are more relevant for cardiotoxicity than for the other liability groups.