ProfhEX: AI-based platform for small molecules liability profiling

doi:10.21203/rs.3.rs-2073134/v1

Download PDF

Research Article

ProfhEX: AI-based platform for small molecules liability profiling

https://doi.org/10.21203/rs.3.rs-2073134/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 09 Jun, 2023

Read the published version in Journal of Cheminformatics →

You are reading this latest preprint version

Drugs off-target interactions are one of the main reasons of candidate failure in the drug discovery process. Anticipating potential drug’s adverse effects in the early stages is necessary to minimize health risks on patients, animal testing, and economical costs. With the constantly increasing size of virtual screening libraries AI-driven methods can be exploited as first-tier screening tools proving liability estimation for drug candidates.

We present ProfhEX, an AI-driven suite of 46 OECD-compliant machine learning models able to profile small molecules on 7 relevant liability groups, namely: cardiovascular, central nervous system, gastrointestional, endocrine disruption, renal, pumlonary and immune response toxicities.

Experimental affinity data was collected from public and commercial data sources. The entire chemical space comprised 289’202 activity data for a total of 210’116 unique compounds, spanning over 46 targets with dataset sizes ranging from 819 to 18896. Gradient boosting and random forest algorithms were initially employed and ensembled for the selection of a champion model. Models were validated according to the OECD principles, including robust internal (cross validation, bootstrap, y-scrambling) and external validation.

Champion models achieved an average Pearson correlation coefficient of 0.84 (SD of 0.05), a R2determination coefficient of 0.68 (SD = 0.1) and a root mean squared error of 0.69 (SD of 0.08). All liability groups showed good hit-detection power with an average enrichment factor at 5 % of 13.1 (SD of 4.5) and AUC of 0.92 (SD of 0.05).

ProfhEX would be a useful tool for large-scale liability profiling of small molecules. This suite will be further expanded with the inclusion of new targets and by complementary modelling approaches, including structure-based and pharmacophore-based models. The platform is freely accessible at the following address: https://profhex.exscalate.eu/.

virtual screening

liability profiling

polypharmacology

machine learning

webservice

Nowadays, the concept of polypharmacology [1–4] predominates over the “one-target-one-disease” paradigm, thanks to a better understanding of drugs mode of action and pathological processes. Polypharmacology opened various possibilities in drug discovery, related to repurposing and for detecting potential off-targets liabilities which can lead to adverse drug reactions [5]. Indeed, recent studies estimated that small molecule drugs bind on average 6–11 distinct off-targets excluding their intended pharmacological one [6, 7]. Off-targets interaction is one of the main reasons of drug candidate clinical failure and, eventually, post-market withdrawn [8–10]. It is necessary to anticipate such adverse effects at the very early stages of the drug discovery process, to minimize health risks on patients, experimental animal testing, and economical costs [11]. The main reasons of failure are related to specific organ toxicities, with cardiovascular toxicity being the most common cause (17%), followed by hepatotoxicity (14%), renal toxicity (8%) and central nervous system (CNS) toxicity (7%) [10]. The most notably example is the human voltage-gated potassium channel subfamily H member 2 (KCNH2, or hERG), which is linked to cardiac arrhythmias. Indeed, activity on hERG is a mandatory evaluation to be performed to meet regulatory requirements [11].

The constantly increasing size of virtual screening libraries limits the possibility of experimentally testing drug candidates against a large panel of liability targets, even when employing in-vitro high-throughput (HTS) approaches. For this reason AI-driven methods, which are already extensively employed in drug discovery for hits identification [12, 13], can be also exploited to provide liability annotations on the desired chemical space, driving towards the selection of safe candidates. In the past years, several single-target machine learning models have been published, mainly targeting cardiotoxicity and neurotoxicity [14, 15]. However, given the high degree in off-targets interactions, the relevance of single-target models is rather limited, and a multi-target approach should be followed to generate a comprehensive drug’s liability profile. In this direction, few SAR-based cheminformatics systems [16–18] have been developed to retrieve putative targets of a given compound by querying the widely used ChEMBL or PubChem databases [19, 20]. However, these tools are not based on supervised learning algorithms but on simple searches by chemical similarity. The ToxCast and Tox21 [21, 22] programs contributed to generate a large chemical library of in-vitro HTS profiled compounds for a broad range of targets, including nuclear receptor and stress response signaling pathways. The COMPARA and CERAPP collaborative projects [23, 24] are two examples of first tier screening models built on HTS data for androgen and estrogen receptor activity, respectively. In the “big-data” domain, Kyoungyeul et al. [25] applied supervised binary classification models on 1121 targets and 235k compounds collected from CHEMBL. Also, Mayr et al. [26] followed a similar approach, extending the number of compounds to 500k. Arshadi et al. [27] adopted a complementary disease-related modelling task: hundreds of PubChem bioassays were mined with natural language processing techniques to assemble a series of modelling datasets relevant for key diseases (such as acute toxicity, cancer, infections, metabolism, etc.). Such models have the advantage of being able to directly provide the probability for a given compound to provoke unwanted effects on human. On the other hand, building correct associations between targets and a given disease is the main source of uncertainty of this approach. Moreover, target-related models are needed if a target deconvolution analysis is envisaged.

A common limitation of currently published models is an overlap of the same public data sources (mainly ChEMBL and PubChem), which narrows data availability and makes them redundant in terms of applicability domain. Moreover, the learning approach is simplified to a binary classification task between active and inactive compounds, which brings some issues: (i) the possibility to rank molecules according to their affinity is lost; (ii) the training process becomes more complicated when the binning process yields unbalanced datasets (iii) the determination of predefined binning cutoff is difficult [28], as there could exist an intrinsic bias in the measured activity which is specific for each protein target. Finally, a comprehensive “compound’s liability profile” is rarely provided, as current systems output in a tabular format numerical predictions for each target, without any mechanistic connection to a given liability hazard.

To the best of our knowledge, a readily available screening platform providing a comprehensive and mechanistically meaningful liability profile does not exist. In this study we aim to fill this gap with ProfhEX (AI-based liability profiler for small molecules in Exscalate), a suite of machine learning models hosted by the Exscalate computing center (https://www.exscalate.eu/en/platform.html) and freely accessible for the scientific community. In its first version, ProfhEX accounts for 46 Organization for Economic Co-operation and Development (OECD)-compliant [29] ligand-based machine learning models built on a combined chemical space of 289’202 activity data for a total of 210’116 unique compounds. It provides a safety index regarding seven important drug’s liability profiles, such as cardiotoxicity, neurotoxicity, gastrointestinal, endocrine disruption, pulmonary, renal, and immune system. We believe that ProfhEX would be a powerful first-tier virtual screening tool, providing researches in the drug discovery domain with useful information for virtual screening campaigns.

Figure 1 depicts the ProfhEX development workflow: (i) data step: selection of relevant targets for liability profiling [11], data collection from public and commercial data sources, and data preparation process; (ii) features encoding step: compounds encoding with physicochemical descriptors coupled with extended connectivity and feature invariant fingerprints; descriptor space reduction by feature selection techniques; (iii) model generation step: hyperparameter optimization machine learning approaches and champion model selection; (iv) validation step: internal validated by three complementary approaches and external validation on the test set partition; (v) deployment: webservice implementation.

Data cleaning, feature encoding and dataset creation

Data preparation and feature encoding steps have been carried out in a Konstanz Information Miner [30] workflow. Activity data was collected from two sources: the publicly available ChEMBL database [19] and the commercial Excelra’s GOSTAR database (https://www.gostardb.com/), which is the world’s largest manually curated structure-activity-relationship database that collects comprehensive intelligence on bio-active compounds [12]. Activity data has been retrieved from both ChEMBL and GOSTAR in an analogues way, as both databases have been designed with similar schemas. For the selected 46 targets all experimentally measured biological activity data were collected by UniProt identifier query. UniProt identifiers were retrieved from UniProt [31]. A series of sequential cleaning criteria have been applied to generate QSAR-ready entries. All measurements from sources other than “homo sapiens” or “human” have been excluded; censored values (i.e. > or <) have been excluded; only activity values encoded as “IC50”, “EC50”, “Ki or “Kd” have been considered and have been normalized to the negative log unit molar concentration (hereafter generally denoted as pK), which is the dependent variable of the models. Compound’s structures originally available as SMILES strings were preprocessed and standardized in a Pipeline Pilot protocol [32] by applying standard chemical compounds cleaning rules [33], such as removal of salts, standardization of functional groups (e.g. -nitro) and neutralization. Geometric optimization was not performed as employed descriptors are not conformer-dependent. De-duplication has been based on matching of standardized SMILES. The median pK value has been taken as representative value when multiple pK measurements were available for a given compound.

Feature encoding has been carried out using the RDKIT framework (https://www.rdkit.org/) available in KNIME [30]: 11 basic physicochemical properties coupled with extended connectivity (EC) and feature invariant (FC) fingerprints (radius of 6 and 1024 bits value each) have been generated, for a total of 2059 features. The value of 1024 was selected as an optimal fingerprint length to encode all datasets (from few hundreds up to several thousand compounds) without causing bit saturation. Finally, each dataset is partitioned into train/test set with a 80/20 ratio by stratified sampling (based on the dependent variable pK). Hyperparameter optimization, training, and internal validation have been carried out on the train partition, whereas the test partition was used in external validation.

All datapoints collected from ChEMBL and the KNIME data preparation protocol are freely available at the following Zenodo repository: https://doi.org/10.5281/zenodo.6810941.

Chemical space analysis

Principal component analysis (PCA) is a dimensionality reduction method that allows to visualize with 2D plots multidimensional datasets [34]. For compounds characterization, a total of 24 basic physicochemical features were computed such as molecular weight, topological surface area, fraction of Csp3 and several chemotype counts (e.g. aromatic/aliphatic rings, amide bonds, etc.). PCA was not directly applied to individual compounds but to their “basic framework” scaffolds [35]. Starting from the widely accepted definition of scaffold (which is generated by removing all side chains and terminal atoms), a basic framework brings an even higher level of abstraction, having all the atoms converted to carbons but still maintaining features such as ramifications, double and aromatic bonds. Generated basic frameworks have been used as a grouping term of the initial chemical space, reducing the number of items from 210’116 unique compounds down to 39’247 scaffolds. In the group-by process, computed features have been averaged over the entries matching a given scaffold.

Employed machine learning algorithms

All tasks related to feature selection, model training and scoring have been carried out with SAS Viya 3.5 software [36]. Tree based gradient boosting (GB) and random forest (RF) algorithms were employed for model generation. Gradient boosting [37] is a boosting approach that resamples the analysis data set several times to generate results that form a weighted average of the re-sampled data set. Tree-boosting creates a series of decision trees which are merged to form a single predictive model. Random forests [38] are a combination of tree predictors in which each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The performance of random forests is related to the quality of each tree in the forest. Because not all the trees “see” all the variables or observations, the trees of the forest tend to have small correlation or no correlation. In addition, multi linear regression (MLR) approach was implemented as baseline estimation.

Autotune

A hybrid approach was used to automate the tuning phase of the model hyperparameters (SAS “AUTOTUNE procedure”). Briefly, in the first stage Latin Hypercube Sampling [39] is employed to generate a semi-random sample ensuring the uniqueness of the value-hyperparameter pair in all the experiments. The results of the first stage are used to initialize the evolutionary-inspired Genetic Algorithm optimization [40], which allows to efficiently explore the hyperparameters space. The tuning procedure has been run exclusively on the train partition of each dataset by 5-fold cross validation split, and the root mean squared error was set as optimizing function to be minimized. This procedure was employed to tune the main parameters of GB and RF such as: number of trees, tree depths, number of bins, leaf size, learning rate and regularization L1 and L2. For each model, a time threshold of 100 minutes and a stagnation of the optimizing function over 5 sequential iterations were set as early stopping criteria.

Model training and validation

The best set of hyperparameters from the autotune procedure was used for model training. Model robustness and predictive power have been evaluated using complementary approaches. For internal validation, 5-fold cross validation procedure, 90/10% bootstrap sampling and y-randomization were used [41, 42]. All internal validation procedures were iterated 100 times. In contrast, the test set partition was employed for external validation, which never participated neither in model training nor in hyperparameter optimization. The RF or GB model with the highest Pearson coefficient was chosen as “champion” model for each considered target. This metric was chosen to avoid any potential bias caused by the markedly higher frequency of several specific values of the dependent variable. As seen in the pK value distribution plot (Fig. 3), there is a noticeable number of experimental values in correspondence of key pK-thresholds, such as 5 and 6 log units (corresponding to 10 and 1 µM, respectively). These values are frequently used in HTS single concentration assays to discriminate potentially active compounds. This makes RMSE a less robust metric, whereas R coefficient seems to be a better indicator, being a normalized measurement of the covariance between experimental and predicted pK.

Applicability domain

Each QSAR model has its own applicability domain [41], which defines the chemical space boundaries inside which the relationship between structure and activity can be considered valid and therefore the model’s prediction reliable. A structure similarity approach has been employed to define the AD, which analogues to distance-based methods [43]. Training set’s structure similarity of a given prediction compound is estimated by the Tanimoto coefficient (Tc) computed on the 2048 fingerprint variables. Each model’s AD is considered fulfilled with Tc > 0.7. An overall AD score is assigned based on the fraction of models (out of the total 46) that had their individual AD fulfilled: higher values indicate a more liability profile.

Evaluation metrics

Pearson correlation coefficient (R, Eq. 1), determination coefficient (R², Eq. 2) and root mean squared error (RMSE, Eq. 3) were selected as main metrics to monitor model performance. R coefficient has been calculated as:

\({R}_{X,Y}=\frac{cov(X,Y)}{{s}_{X} {s}_{Y}}\) Eq. 1

where X indicates the actual values, Y the predicted values, \(cov\left(X,Y\right)\) is the covariance \({s}_{X}\) and \({ s}_{Y}\) are the standard deviation of X and Y, respectively.

R² determination coefficient has been calculated as:

\({R}^{2}=1-\frac{Rss}{Tss}\) Eq. 2

Where R_ss and T_ss are the residual sum of squared and the total sum of squares, respectively.

RMSE has been calculated as:

\(RMSE= \sqrt{\sum _{i=1}^{n}\frac{{({X}_{i}- {Y}_{i})}^{2}}{n}}\) Eq. 3

where X are the actual values, Y the predicted values and \(n\) is the number of observations.

In addition, the enrichment factor (EF) and related ROC AUC score were also computed. The EF at a given cutoff χ is calculated from the proportion of true active compounds in the selection set in relation to the proportion of true active compounds in the entire dataset (Eq. 3). To enable EF calculation, the top 2% of the pK-sorted compounds for the given dataset have been labelled as true actives, whereas the remaining as inactives. We chose a variable cutoff rather than a fixed pK value (for such analysis a pK value between 5 and 6 is generally used [28]) for two reasons: (i) the datasets have different sizes and (ii) to have the same probability of randomly picking an active compound (i.e. the denominator of Eq. 3). The value of 2% has been chosen as a compromise between minimum number of actives and dataset sizes. Furthermore, hit rate values between 1 and 5% are normally found in virtual screening benchmarking datasets [44].

The enrichment factor at different levels (1%, 5% and 10%) has been calculated as:

Eq. 4

where χ is the top percentage in the distribution (assuming the values of 1, 5 and 10%), \({A}_{{\chi } }\)is the number of active molecules in the top k% of the distribution, \({M}_{{\chi } }\) is the number of molecules in the top k% of the distribution, A is the total number of actives and M is the total number of molecules.

Target selections

The list of targets constituting ProfhEX has been taken from the study of Bowes et al. [11], where the Authors compiled a list of “minimal panel of targets” that normally go through in-vitro testing for liability profiling by world-leading pharmaceutical companies. Relevant targets have been selected based on probability of a hit at the target compared to the magnitude of the impact of this hit. For instance, hERG and muscarinic receptors are classified as a high rate/high impact targets.

Table 1 reports the list of selected targets and Fig. 2 depicts their protein family classification: most of them are membrane receptors from the GPCR (G protein-coupled receptors) superfamily (25), followed by enzymes (8 members), transcription factors (6) ion channels (4) and transporters (3), for a total of 46 targets. The majority of the selected targets is involved in the prediction of cardiovascular, central nervous system and gastrointestinal side-effect. Additionally, several targets are relevant for more than one liability such as the dopamine receptors (DRD1 and DRD2) whose activation could lead to cardiovascular, nervous, and immune system adverse effects.

Table 1

Target-liability reference. CV = cardiovascular, CNS = central nervous system, GI = gastrointestinal, ED = endocrine disruption, PU = pulmonary, RE = renal, IM = immune. In brackets, the number of targets is reported.
Liability	Targets
CV (25)	ACHE, ADORA2A, ADRA1A, ADRA2A, ADRB2, AVPR1A, CHRM1, DRD1, DRD2, HRH1, HRH2, HTR1B, HTR2A, HTR2B, KCNH2, MAOA, OPRD1, OPRK1, OPRM1, PDE3A, PTGS2, SCN5A, SLC6A2, SLC6A4
CNS (19)	ADORA2A, ADRA1A, ADRA2A, CHRM1, CNR1, DRD1, DRD2, EDNRA, HTR1A, HTR1B, HTR2A, MAOA, OPRD1, OPRK1, OPRM1, PDE4D, SLC6A2, SLC6A3, SLC6A4
GI (13)	ACHE, ADRA1A, ADRB1, CCKAR, CHRM1, CHRM3, HRH2, HTR3A, OPRK1, OPRM1, PPARA, PPARD, PTGS1
ED (7)	AR, DRD2, EDNRA, ESR1, HTR1A, HTR3A, NR3C1
PU (5)	ACHE, ADRB2, CHRM3, HTR2B, PTGS1
RE (2)	AVPR1A, PTGS1
IM (6)	CNR2, HRH1, LCK, NR3C1, PDE4D, PTGS2

Datasets size ranges from 819 (HRH2 - GPCR A histamine receptor) up to 18896 (CNR1 - GPRC A cannabinoid receptor). The distribution of activity values and key-physicochemical properties over the entire chemical space is represented in Fig. 3 (see Table S1 for more details). Most of the basic properties present a normal distribution with a certain degree of skewness, such as TPSA and number of rotatable bonds. The right tail extreme of MW, TPSA and number of rotatable bonds distributions are populated by peptides and natural products, which have normally more branching and substituents than small molecules.

The average pK value over the 46 datasets ranges from 5.23 (MAOA) to 7.76 (AVPR1A), with an average of 6.6 log and SD of 0.6. This indicates a shift of individual pK distributions mean values. The reasons for this deviation could be due to (i) an intrinsically higher receptor selectivity (i.e. it tends to be activated only by specific chemical families of small molecules); (ii) a bias of in in the design of experimentally tested chemical libraries. Concerning the latter point ESR1, OPRM1 and ADRB2 data (high average pK) comes from patents and reviews that describe the use of compounds for cancer, inflammation, osteoporosis and other disorders. On the other hand, hERG, COX-1 and MAOA datasets (low average pK) are mainly associated to liability studies, with the aim to demonstrate that compounds are not active on high hazard/impact receptors.

Figure 4-a (Figure S2) illustrates pairwise pK distribution distances calculated by the Kolmogorov-Smirnov test [45]. Targets such as KNCH2 (T4), MAOA (T5) and PTGS1 (T6) have significantly different distributions (p value < 0.05) compared to most of targets, having the lowest average pK values (around 5.5 log units). Figure 4-b (Figure S3) illustrates pairwise average Tanimoto similarity among the dataset’s compounds. Most of the times, datasets of the same protein family are clearly characterized by analogue chemical moieties, as the case of ADRB1 (T1) with ADRB2 (Tanimoto = 0.91) and HTR1A (T2) with HTR1B HTR2A and HTR2B (Tanimoto ~ 0.48). On the other hand, the chemistry of tested compounds on PDE3A (T5) is noticeably different from all the other dataset, except for PDE4A (Tanimoto = 0.37).

Chemical space analysis

Figure 5 depicts the PCA of the chemical space by means of annotated heatmaps. The cumulative variance explained by the first two components is 67% (46 and 21% respectively). Black and white scatterplots in Fig. 5-a depicts the trend of some key properties (see also loadings plot in Figure S1). Molecular weight is one of the most influent features and distributes the scaffolds on the x-axis from lower (left) to higher (right) MW compounds. Other properties such as topological surface area, number of aromatic rings, number of stereocenters and number of H-bond donor/acceptor, follow the same trend, which is quite expected as an increase in the molecular weight generally corresponds to a frequency increase of all chemotypes. The second principal component is mainly driven by the fraction of sp3 hybridized carbons and the number of aliphatic rings, both increasing when moving towards lower y-axis values.

Figure 5-a is annotated by the average pK value. Intuitively, smaller and common chemical moieties, such as indane (Fig. 5-a, structure i), benzene, naphthalene, cyclohexylbenzene and biphenyl are the most frequent scaffolds, as they serve as common building blocks for more complex structures. Therefore, the average activity value over the entire chemical space around 6 log units (Table S1). In addition, these scaffolds show relatively high standard deviation (Fig. 5-b) and are very frequent (Fig. 5-c).

There is a clear gradient of increasing activity when shifting towards bigger and more peculiar scaffolds (e.g. Figure 5-a structures ii – v): this is expected as such structures have been designed to be specifically selective against a given target (low standard deviation and frequency). For instance, the scaffold of the drug haloperidol (structure iv) appears in a total of 435 molecules with an average activity of 9.13 pK (SD of 1.65), spanning over 14 different targets (e.g. adrenoreceptors, dopamine, histamine, serotonin, opioid and solute carrier receptors) [46–48]. The scaffold of the drug fentanyl (structure v) appears in 486 molecules, showing an average pK of 7.1 (SD of 1.5) over 23 unique targets (e.g. ion channels, acetylcholinesterase, opioid, dopamine and cannabinoid receptors) [49–51]. On the other hand, there are some scaffolds which are clearly inactive towards multiple targets, such as dibenzo-oxazapine derivates (structure iii) with an average pK of 2.5 (SD of 1) over 5 unique targets (serotonin, histamine, dopamine and muscarinic receptors) [52].

Models’ internal and external validation performance

Table 2 and Fig. 6 report models performance averaged over the 46 modelled targets (see Table S2 for individual target performance). In both internal and external validation, the most performing algorithm resulted to be GB (R = 0.79–0.84 and RMSE = 0.77–0.69), followed by RF (R = 0.79–0.81 and RMSE = 0.85–0.79) with slightly lower performance. GB has already been reported to outperform RF in modeling biological data [13, 53]. Due to the higher predictive power, GB-based models were designed as champions for all the 46 targets. In 5-fold CV, GB and RF scored R of 0.83 and 0.82; whereas in bootstrap 0.79 and 0.79, respectively. Similar performances in external validation (R of 0.84 and 0.81 for GB and RF, respectively) indicate that the models have good predictive power on unseen data and support the absence of overfitting. Moreover, R and R² values for both GB and RF models are close to zero in y-scrambling simulations: such drop in performance confirms that models are unlikely to be biased by chance correlations. Finally, significant higher performances than the baseline MLR classifier indicate that GB and RF algorithms proved successful in learning meaningful structure-activity relationship for the considered datasets.

Tree-based algorithms are very proficient in modelling toxicology data, as they are less prone to overfitting, less susceptible to outliers and not as heavily affected by noise as other algorithms [25, 54]. Toxicological in vitro and in vivo data is indeed highly affected by variability due to the high number of factors that contribute to error, such as experimental measurements, and inter-laboratories variability, lower accuracy of HTS methods and heterogeneous datasets composition (e.g. measurements coming from binding and functional assays) [55, 56]. The overall RMSE of trained models is 0.75 (SD = 0.09), which is comparable to the variability of experimental affinity measurements of 0.66 (SD = 0.22). Achieving a prediction error comparable to the experimental data variability support the validity of learned structure-activity relationships, as machine learning models cannot be more accurate than the error of training instances. When dealing with biological data, it has been reported that the variance in experimental measurements could contribute more to prediction error than the error from the model itself [54, 57, 58].

PTGS1 and PTGS2 were the lowest performing models (R = 0.6–0.66 and RMSE = 0.8–0.87), despite their relatively large size of roughly 3000 and 6300 compounds, respectively. One explanation could be related to their different properties compared to the other datasets (Fig. 4) which made the learning process more difficult.

The Office of Economic Cooperation and Development (OECD) principles [26] for building robust quantitative structure-activity relationship models were followed. The five OECD principles are: (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined applicability domain; (iv) appropriate measures for goodness-of-fit, robustness, and predictivity; (v) and a mechanistic interpretation, if possible. In this study, the endpoint for each model is well defined and goodness-of-fit, robustness and predictivity were evaluated using internal (5-fold CV, bootstrap, y-scrambling) and external validation. Model’s applicability domain is evaluated structural similarity comparison to training set’s compounds.

Table 2

Models’ performance averaged over the 46 targets for the given algorithm and validation approach. Standard deviation is reported in brackets.
Validation	Algorithm	R	R²	RMSE
External	MLR	0.62 (0.1)	0.35 (0.2)	1.08 (0.17)
	GB	0.84 (0.05)	0.68 (0.1)	0.69 (0.08)
	RF	0.81 (0.07)	0.64 (0.11)	0.79 (0.1)
Bootstrap	MLR	0.66 (0.1)	0.43 (0.13)	1.02 (0.14)
	GB	0.79 (0.07)	0.63 (0.11)	0.77 (0.11)
	RF	0.79 (0.07)	0.6 (0.11)	0.85 (0.1)
5-fold CV	MLR	0.65 (0.11)	0.42 (0.15)	1.02 (0.14)
	GB	0.83 (0.06)	0.67 (0.1)	0.71 (0.09)
	RF	0.82 (0.06)	0.65 (0.1)	0.79 (0.1)
Y-scrambling	MLR	0.0 (0.1)	0.46 (0.14)	0.99 (0.13)
	GB	0.0 (0.02)	0.22 (0.09)	1.49 (0.28)
	RF	0.0 (0.02)	0.05 (0.01)	1.37 (0.26)

Models’ enrichment factor performance

Figure 7-a depicts enrichment factor, whereas Fig. 7-b hit-detection performance in terms of AUC (Table S2) grouped by liability group. All liability groups showed good hit-detection power with comparable performance, with the only exception of renal toxicity (RE), which showed relatively lower discriminative power. However, such group is also composed by only two targets (PTGS1 and AVPR1A) which makes the evaluation less statistically robust. Overall, enrichment factor and AUC analysis showed that generated models are able to successfully retrieve active compounds at lower dataset faction levels, supporting their ability to discriminate true actives in large volume virtual screening campaigns.

Perspectives

Ligand-based approaches are generally easier to implement as they do not require knowledge of the crystal structure of the target protein, and thus can be trained by simpler 2D descriptors with good performances. Still, they possess some limitations: i) the absence of target-related information inhibits the model to learn any rules related to protein-ligand interactions; ii) the applicability domain is restricted to compounds which are similar to the chemical space delimited by their training set; and (iii) the distribution between active and inactive compounds is generally unbalanced in favor of the latter, leading to low recall rates and failure to reliably detect potential activity cliffs. To overcome these limitations, ligand-based or structure-based pharmacophore models can be developed to find common chemical features relevant for biological activity [59]. Recent applications of 3D pharmacophores reported their screening power in virtual screening studies and their synergistic combination with docking approaches [60]. Moreover, when the crystal structure is available, the inclusion of descriptors related to the crystal structure, (i.e. proteochemic models), and docking simulations can be employed. All these approaches can be ensembled in consensus. Finally, to provide a comprehensive liability profile, it would be important to evaluate the metabolites of the query compounds as some of them may provoke harmful responses once metabolized in the human body.

ProfhEX webservice implementation

Compounds should be submitted to ProfhEX (Fig. 8) via SMILES format. The prediction process leading to the generation of its liability profile comprises the following steps: (i) a vector of 46 predicted activities on the modelled targets is generated; (ii) predictions are binned into the two classes “concern” (C) and “not concern” (nC) based on a predefined pK cutoff value of 6.5 (300 nM); (iii) classes are grouped into the 7 liability groups according to the liability mapping as described in Table 1; (iv) for each liability group a liability score is computed as the number of C labels out of the total number targets relevant for the given liability (Eq. 4).

\({Ls}_{i}=\frac{ {C}_{i}}{{C}_{i} + {nC}_{i}}\) Eq. 4

Where, \({Hs}_{i}\) is the liability score for the given liability group i, ranging from 0 (no target flagged as C) to 1 (all targets flagged as C); \({C}_{i}\) and \({nC}_{i}\) are the number of targets for the given liability group i flagged as C and nC, respectively.

A predefined threshold of 6.5 log units has been selected to achieve a balance between active and inactive compounds. A more generalizable approach would be to select a variable threshold depending on the distribution of experimental pK measurements for the given target, for instance with the two-sigma-rules. Such approach could help considering response biases present in the training datasets. Furthermore, a weighted approach could also be implemented when calculating the liability score, by putting more importance on key-targets: for instance, voltage-gated channels such as KCNH2 (hERG) and KCNA5 (Kv1.5) are more relevant for cardiotoxicity than for the other liability groups.

In this work we presented ProfhEX, an AI-driven web-based platform for small molecules liability profiling. In its first version, ProfhEX is composed by 46 OECD-compliant ligand-based machine learning models trained on binding affinity data, built on a combined dataset of 289’202 activity data for a total of 210’116 unique compounds. ProfhEX provides estimation for 7 liability profiles relevant for drug discovery, such a: cardiovascular, central nervous system, gastrointestional, endocrine disruption, renal, pumlonary and immune response toxicities.

Collected data from public and commerical data sources was standardized and encoded by physicochemical descriptors as well as extended connectivity fingerprints.

Gradient boosting and random forest algorithms were implemented. Models were validated according to the OECD principles, including robust internal (5-fold cross validation, bootstrap, y-scrambling) and external validation. The most performing model for each target was designed as champion and impoemented in ProfhEX. Champion models achieved an average Pearson correlation coefficient of 0.84 (SD of 0.05), a R² determination coefficient of 0.68 (SD = 0.1) and a root mean squared error of 0.69 (SD of 0.08). All liability groups showed good hit-detection power with an average enrichment factor at 5% of 13.1 (SD of 4.5) and AUC of 0.92 (SD of 0.05). ProfhEX would be a useful tool large-scale liability profiling of small molecules. This suite will be further expanded with the inclusion of new targets and by complementary modelling approaches, such as docking- and pharmacophore-based models.

All collected data from public sources toghether with the knime data standardization protocol is available at the following Zenodo repository: https://doi.org/10.5281/zenodo.6810941. ProfhEX is freely accessible at the following address: https://profhex.exscalate.eu/.

AD: applicability domain

CNS: central nervous system

CV: cardiovascular

CV: cross validation

ED: endocrine disruption

EF: Enrichment factor

GB: gradient boosting

GI: gastrointestinal

HTS: High-throughput screening

IM: immune

ML: machine learning

MLR: multilinear regression

OECD: Organization for Economic Co-operation and Development

PCA: principal component analysis

PU: pulmonary

R: Pearson correlation coefficient

R²: determination coefficient

RE: renal

RF: random forest

RMSE: Root Mean Squared Error

SD: Standard deviation

Availability of data and materials

The dataset supporting the conclusions of this article is available via Zenodo repository at https://doi.org/10.5281/zenodo.6810941.

Competing interests

The authors declare that they have no competing interests.

Funding

Not applicable.

Authors' contributions

All authors read and approved the final manuscript.

Ethical Approval

Not applicable.

Achenbach J, Tiikkainen P, Franke L, Proschak E (2011) Computational tools for polypharmacology and repurposing. Future Med Chem 3:961–968. https://doi.org/10.4155/fmc.11.62
Proschak E, Stark H, Merk D (2019) Polypharmacology by Design: A Medicinal Chemist’s Perspective on Multitargeting Compounds. J Med Chem 62:420–444. https://doi.org/10.1021/acs.jmedchem.8b00760
Rastelli G, Pinzi L (2015) Computational polypharmacology comes of age. Front Pharmacol 6:1–4. https://doi.org/10.3389/fphar.2015.00157
Anighoro A, Bajorath J, Rastelli G (2014) Polypharmacology: Challenges and opportunities in drug discovery. J Med Chem 57:7874–7887
Tan Z, Chaudhai R, Zhang S (2016) Polypharmacology in Drug Development: A Minireview of Current Technologies. ChemMedChem 1211–1218. https://doi.org/10.1002/cmdc.201600067
Rao MS, Gupta R, Liguori MJ et al (2019) Novel Computational Approach to Predict Off-Target Interactions for Small Molecules. Front Big Data 2:1–17. https://doi.org/10.3389/fdata.2019.00025
Vo AH, Van Vleet TR, Gupta RR et al (2020) An Overview of Machine Learning and Big Data for Drug Toxicity Evaluation. Chem Res Toxicol 33:20–37. https://doi.org/10.1021/acs.chemrestox.9b00227
Lounkine E, Keiser MJ, Whitebread S et al (2012) Large-scale prediction and testing of drug activity on side-effect targets. Nat 2012 4867403 486:361–367. https://doi.org/10.1038/nature11159
Siramshetty VB, Nickel J, Omieczynski C et al (2016) WITHDRAWN—a resource for withdrawn and discontinued drugs. Nucleic Acids Res 44:D1080–D1086. https://doi.org/10.1093/NAR/GKV1192
Cook D, Brown D, Alexander R et al (2014) Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat Rev Drug Discov 2014 136 13:419–431. https://doi.org/10.1038/nrd4309
Bowes J, Brown AJ, Hamon J et al (2012) Reducing safety-related drug attrition: The use of in vitro pharmacological profiling. Nat Rev Drug Discov 11:909–922. https://doi.org/10.1038/nrd3845
Zhao L, Ciallella HL, Aleksunes LM, Zhu H (2020) Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling. Drug Discov Today 25:1624–1638. https://doi.org/10.1016/j.drudis.2020.07.005
Gupta R, Srivastava D, Sahu M et al (2021) Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers 25:1315–1360. https://doi.org/10.1007/s11030-021-10217-3
Vatansever S, Schlessinger A, Wacker D et al (2021) Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: State‐of‐the‐arts and future directions. Med Res Rev 41:1427. https://doi.org/10.1002/MED.21764
Rácz A, Bajusz D, Miranda-Quintana RA, Héberger K (2021) Machine learning models for classification tasks related to drug safety. Mol Divers 25:1409–1424. https://doi.org/10.1007/s11030-021-10239-x
Wang L, Ma C, Wipf P et al (2013) TargetHunter: An In Silico Target Identification Tool for Predicting Therapeutic Potential of Small Organic Molecules Based on Chemogenomic Database. AAPS J 15:395. https://doi.org/10.1208/S12248-012-9449-Z
Yao ZJ, Dong J, Che YJ et al (2016) TargetNet: a web service for predicting potential drug-target interaction profiling via multi-target SAR models. J Comput Aided Mol Des 30:413–424. https://doi.org/10.1007/S10822-016-9915-2
Awale M, Reymond JL (2019) Polypharmacology Browser PPB2: Target Prediction Combining Nearest Neighbors with Machine Learning. J Chem Inf Model 59:10–17. https://doi.org/10.1021/acs.jcim.8b00524
Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940. https://doi.org/10.1093/NAR/GKY1075
Kim S, Chen J, Cheng T et al (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49:D1388–D1395. https://doi.org/10.1093/NAR/GKAA971
Dix DJ, Houck KA, Martin MT et al (2007) The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci 95:5–12. https://doi.org/10.1093/TOXSCI/KFL103
Thomas RS, Paules RS, Simeonov A et al (2018) The US Federal Tox21 Program: A strategic and operational plan for continued leadership. Altex 35:163–168. https://doi.org/10.14573/ALTEX.1803011
Mansouri K, Abdelaziz A, Rybacka A et al (2016) CERAPP: Collaborative Estrogen Receptor Activity Prediction Project. Environ Health Perspect 124:1023–1033. https://doi.org/10.1289/EHP.1510267
Mansouri K, Kleinstreuer N, Abdelaziz AM et al (2020) CoMPARA: Collaborative modeling project for androgen receptor activity. Environ Health Perspect 128:27002. https://doi.org/10.1289/EHP5580
Lee K, Lee M, Kim D (2017) Utilizing random Forest QSAR models with optimized parameters for target identification and its application to target-fishing server. BMC Bioinformatics 18. https://doi.org/10.1186/s12859-017-1960-x
Mayr A, Klambauer G, Unterthiner T et al (2018) Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 9:5441–5451. https://doi.org/10.1039/c8sc00148k
Arshadi AK (2021) MolData, A Molecular Benchmark for Disease and Target Based Machine Learning. J Cheminform 1–23. https://doi.org/10.1186/s13321-022-00590-y
Lenselink EB, Ten Dijke N, Bongers B et al (2017) Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 9:45. https://doi.org/10.1186/S13321-017-0232-0
OECD Guidance Document on the Validation of (Quantitative) (2007) Structure Activity Relationship [(Q)SAR] Models. Tech. Rep. ENV/JM/MONO(2007)2, Paris, FR,
Berthold MR, Cebron N, Dill F et al (2006) KNIME: The konstanz information miner. Data Anal Mach Learn Appl 11:319–326. https://doi.org/10.1145/1656274.1656280
Bateman A, Martin MJ, Orchard S et al (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49:D480–D489. https://doi.org/10.1093/NAR/GKAA1100
BIOVIA, Systèmes D (2011) Pipeline Pilot version 2018. Dassault Systèmes, San Diego
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50:1189–1204
Wenderski TA, Stratton CF, Bauer RA et al (2015) Principal Component Analysis as a Tool for Library Design: A Case Study Investigating Natural Products, Brand-Name Drugs, Natural Product-Like Libraries, and Drug-Like Libraries. Methods Mol Biol 1263:225. https://doi.org/10.1007/978-1-4939-2269-7_18
Manelfi C, Gemei M, Talarico C et al (2021) “Molecular Anatomy”: a new multi-dimensional hierarchical scaffold analysis tool. J Cheminform 13:13–54
SAS Institute Inc. SAS/VIYA® 3.5 of the SAS System for Unix. https://www.sas.com/en/software/viya.html
Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29:1189–1232. https://doi.org/10.1214/aos/1013203451
Breiman L(2001) Random Forests. Mach Learn 2001 451 45:5–32. https://doi.org/http://dx.doi.org/10.1023/A:1010933404324
Iman RL, Helton JC, Campbell JE (1981) An Approach to Sensitivity Analysis of Computer Models: Part I—Introduction, Input Variable Selection and Preliminary Variable Assessment. J Qual Technol 13:174–183. https://doi.org/10.1080/00224065.1981.11978748
Sastry K, Goldberg D, Kendall G (2005) Genetic Algorithms. Search Methodol Introd Tutorials Optim Decis Support Tech. 97–125. https://doi.org/10.1007/0-387-28356-0_4
Tropsha A, Gramatica P, Gombar VK (2003) The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb Sci 22:69–77. https://doi.org/10.1002/QSAR.200390007
Gramatica P (2013) On the development and validation of QSAR models. Methods Mol Biol 930:499–526. https://doi.org/10.1007/978-1-62703-059-5_21
Sahigara F, Mansouri K, Ballabio D et al (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17:4791–4810. https://doi.org/10.3390/molecules17054791
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced (DUD-E): Better ligands and decoys for better benchmarking. J Med Chem 55:6582–6594. https://doi.org/https://doi.org/10.1021/jm300687e
Dodge Y (2008) The Concise Encyclopedia of Statistics. Springer, New York NY
Sampson D, Bricker B, Zhu XY et al (2014) Further evaluation of the tropane analogs of haloperidol. Bioorg Med Chem Lett 24:4294–4297. https://doi.org/10.1016/J.BMCL.2014.07.018
Saito DR, Long DD, Jacobsen JR. Theravance, Inc. Disubstituted alkyl-8-azabicyclo [3.2.1.] octane compounds as mu opioid receptor antagonists. WO2009029257A1, 27 Aug 2007
Jiang L, Beattie DT, Jacobsen JR et al (2017) Discovery of N-substituted-endo-3-(8-aza-bicyclo[3.2.1]oct-3-yl)-phenol and -phenyl carboxamide series of µ-opioid receptor antagonists. Bioorg Med Chem Lett 27:2926–2930. https://doi.org/10.1016/J.BMCL.2017.04.092
Alker A, Binggeli A, Christ AD et al (2010) Piperidinyl-nicotinamides as potent and selective somatostatin receptor subtype 5 antagonists. Bioorg Med Chem Lett 20:4521–4525. https://doi.org/10.1016/J.BMCL.2010.06.026
Dosen-Micovic L, Ivanovic M, Micovic V (2006) Steric interactions and the activity of fentanyl analogs at the µ-opioid receptor. Bioorg Med Chem 14:2887–2895. https://doi.org/10.1016/J.BMC.2005.12.010
McHardy SF, Bohmann JA, Corbett MR et al (2014) Design, synthesis, and characterization of novel, nonquaternary reactivators of GF-inhibited human acetylcholinesterase. Bioorg Med Chem Lett 24:1711–1714. https://doi.org/10.1016/J.BMCL.2014.02.049
Becker C, Rubens C, Adams J et al. ARYx Therapeutics Inc. DIBENZO[b,f][1,4]OXAZAPINE COMPOUNDS. US20080255088A1, 15 March 2007
Zhang J, Mucs D, Norinder U, Svensson F (2019) J Chem Inf Model 59:4150–4158. https://doi.org/10.1021/ACS.JCIM.9B00633/ASSET. /IMAGES/LARGE/CI9B00633_0005.JPEG LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity-Application to the Tox21 and Mutagenicity Data Sets
Kolmar SS, Grulke CM (2021) The effect of noise on the predictive limit of QSAR models. J Cheminform 13:1–19. https://doi.org/https://doi.org/10.1186/s13321-021-00571-7
Claassen V (2013) Neglected factors in pharmacology and neuroscience research: biopharmaceutics, animal characteristics, maintenance, testing conditions, vol 12. Elsevier, Amsterdam
Pham LL, Watford SM, Pradeep P et al (2020) Variability in in vivo studies: Defining the upper limit of performance for predictions of systemic effect levels. Comput Toxicol 15. https://doi.org/10.1016/j.comtox.2020.100126
Mazzatorta P, Estevez MD, Coulet M, Schilter B (2008) Modeling oral rat chronic toxicity. J Chem Inf Model 48:1949–1954. https://doi.org/10.1021/CI8001974
Truong L, Ouedraogo G, Pham LL et al (2018) Predicting in vivo effect levels for repeat-dose systemic toxicity using chemical, biological, kinetic and study covariates. Arch Toxicol 92:587–600. https://doi.org/10.1007/S00204-017-2067-X
Yang SY (2010) Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug Discov Today 15:444–450. https://doi.org/10.1016/J.DRUDIS.2010.03.013
Schaller D, Šribar D, Noonan T et al (2020) Next generation 3D pharmacophore modeling. Wiley Interdiscip Rev Comput Mol Sci 10. https://doi.org/10.1002/WCMS.1468

No competing interests reported.

PROFHEXSI.docx

Download PDF

Journal Publication

published 09 Jun, 2023

Read the published version in Journal of Cheminformatics →

Editorial decision: Major revision
25 Sep, 2022
Editor assigned by journal
24 Sep, 2022
Submission checks completed at journal
24 Sep, 2022
First submitted to journal
16 Sep, 2022

You are reading this latest preprint version

ProfhEX: AI-based platform for small molecules liability profiling

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Data cleaning, feature encoding and dataset creation

Chemical space analysis

Employed machine learning algorithms

Autotune

Model training and validation

Applicability domain

Evaluation metrics

Results And Discussion

Target selections

Chemical space analysis

Models’ internal and external validation performance

Models’ enrichment factor performance

Perspectives

ProfhEX webservice implementation

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1