The aim of this study is to develop a novel quantitative pharmacophore algorithm which can be applied on small datasets usually found in SAR studies of new drug targets. The algorithm allows the prediction of the pharmacophore’s activity without the need for molecules in the alignment process. This allows the prediction of pure pharmacophore, for example when pharmacophores are obtained from aposite pharmacophore modelling.
Model robustness
Data used to test the model robustness with cross-validation (CV) was pulled from ChEMBL(15). A list of popular QSAR targets was obtained from CortésCiriano(16) et al. and was selected as the data source for target selection. The Uniprot-ID of these targets was used as query to get a list of compounds, along with the biological activity of these compounds on the target of interest, from ChEMBL by using the chemblwebresourceclient(15) python package. Retrieved compounds were filtered according to the following parameters:
standard_type: ‘IC50’ or ‘Ki’
standard_units: ‘nM’
standard_relation: ‘=’
assay_type: ‘B’
target_organism: ‘Homo Sapiens’
Compounds were further grouped by ‘target_chembl_id’ and ‘assay_chembl_id’. Molecular structures were generated from the canonical SMILES representation. The field ‘standard_value’, in the following referred to as ‘activity value’ or ‘activity’, was saved and used as an endpoint in the QSAR studies.
After pulling the data from ChEMBL and basic filtering, the datasets, comprised of molecules from a single assay, were further filtered for their applicability for QSAR. The rationale behind this is that datasets should span a specific range of target values, here activity, for being analysed by QSAR. Here a cutoff of at least 3 logunits difference between the minimum and maximum activity value was applied; otherwise, the dataset got dismissed. Furthermore, a certain degree of heterogeneity in the distribution of activity values is necessary for successful QSAR analysis. The heterogeneity of datasets is often interpreted differently by multiple investigators. Therefore, the datasets were required to be as evenly distributed over the range of activity values as possible. Applying this constraint to a clearly defined measurement, the KL divergence of all datasets against a uniform distribution over the same range was calculated. The closer the KL divergence was to one, the more heterogeneous and the less clustered around a particular activity value the datasets were perceived to be. As a cut-off, 0.75 was used. All datasets with a KL divergence above the cutoff value were not considered for our studies.
The datasets were split into training and validation sets for cross-validation. The most and least active compound was hand-selected and put into the training set for all splits. Handselecting these two compounds was done to ensure the trained models only interpolate validation and test data, making sure the data was in the domain of the training set. Similar measures need to be taken during inference since the quality of predictions cannot be guaranteed for out of domain samples. In cross-validation, handpicking these two samples introduces a small bias, following which the individual folds are not 100% unique at each iteration. Apart from this slight bias, all samples were randomly separated into different datasets.
Datasets for cross-validation were generated randomly with a 5fold CVsplit, whereas each training split consisted of 20% of the data and the remaining 80% were used for validation (2080 split). In addition, a second CVsplit was conducted with 80% training and 20% validation data, which is the norm in machine learning (80 − 20 split). The 2080 split was done to mimic and evaluate performance in a typical SAR setting of a medicinal chemist, where the researcher might only have access to a meagre number of data points. Due to the small dataset sizes, training sets in the 2080 split typically consisted of 1015 samples.
Finally, for all molecules in the datasets, conformations were generated. For conformer generation, the commandline tool ‘iconfgen’(17) from included in LigandScout(18) (Inte:Ligand GmbH(19), Vienna, Austria) was used. The maximum number of generated conformers was set to 25, all other settings were kept at their default values.
KullbackLeibler divergence
The KullbackLeibler (KL) divergence for each dataset was calculated according to Eq. 1, whereas P and Q denote discrete uniform distributions over activity values from a given dataset. P represents the estimated uniform distribution of the datasets, Q resembles the reference uniform distribution over activity values, a being the minimum and b the maximum. P is estimated by binning the activity values into N (sample size) bins. Each P(x) is defined by the frequency of activity values in each bin x.
PHASE
Data for performance evaluation alongside PHASE was obtained from a paper published by Dixon(8) et al., initially published by Debnath(20) et al. in 2002. Dixon et al. compared the 3D pharmacophore QSAR method implemented in PHASE against the Hypogen algorithm implemented in Catalyst. These results will be used as a baseline for evaluating the quantitative pharmacophore QSAR in this paper. Conformations were created using iconfgen that is part of LigandScout. Splitting of training and test data was done as reported in the article by Dixon et al(8). No other modifications or filters were applied on the dataset, except for removing molecule number 67, for which no experimental activity value was provided by Dixon et al.
Baselines
Two general baselines were used to estimate the improvement of the quantitative pharmacophore over elementary QSAR models. The first baseline was a QSAR model built on the number of pharmacophore features per sample in the training set. The second model was constructed from a few standard physicochemical properties (number of HBond Donors / Acceptors, number of rotatable bonds, molecular weight, number of heavy atoms, cLogP(21), TPSA(22)). The baseline models were trained and tested on the same data splits as the quantitative pharmacophores. All baselines were trained with the same machine learning algorithm and the same parameters as the quant. pharmacophores to guarantee a fair comparison. Therefore, differences in performance between the quantitative pharmacophores and the baselines originate solely from the representation of molecules or pharmacophores.
Quantitative Pharmacophore Algorithm
The quantitative pharmacophore generation procedure is divided into two parts. At first, the mergedpharmacophore model is generated, which serves as the basis for training a machine learning model for the prediction of new samples’ activity.
Alignment template: As a starting point, a template is required to align all the other training samples. It can be either a pharmacophore or a molecule, although they are treated slightly different during the initialisation:
Pharmacophore: Pharmacophores should be used as a template if the user has additional information about its target or has reason to believe that the templatepharmacophore is of high importance to the user’s project. This might be the case if the researcher has access to a crystal structure of the investigated target. For example, an interactionpharmacophore of a cocrystallised ligand in the protein binding site could yield valuable information. If this information is perceived to be relevant, then such an interactionpharmacophore could be used as the quantitative pharmacophore’s initial template. All training samples are then aligned to the template via pharmacophore alignment. Samples can be either pharmacophores or molecules, in which case conformer ensembles for these molecules are required. Pharmacophores are generated for all conformations of a molecule. The conformation with the best-aligned pharmacophore is chosen.
Molecule: Instead of a pharmacophore, a molecule may be given as the initial template. Pharmacophores will be generated from each conformation of the molecule. Without additional information, none of these pharmacophores can be deemed more important than the others. However, alignment to any other molecule in the training set will result in a single best solution for the given pair of molecules. The conformation with the highest alignment score is chosen as the initial template. All other training samples will be aligned with this template. Depending on the counterpart molecule selected, the quality of the template and following the model will deviate. Therefore, templates and models are built for each sample in the training set. A supplied validation set will be used to select the best performing model. Due to multiple models trained in this step, model building will take considerably longer when choosing molecules over pharmacophores as the initial template. Nevertheless, training of all models is usually finished within a few minutes.
After a template was selected, the remaining samples from the training set are aligned with the template to create the mergedpharmacophore. Instead of sequentially merging aligned pharmacophores and building the mergedpharmacophore, all features of aligned pharmacophores are first added to a single pharmacophore data structure, also referred to as container in the following. Each feature is assigned the activity of its parent pharmacophore and stored alongside the feature. Furthermore, information about the orientation of directed interactions like Hbonding is disregarded, and all features modelling such interactions are represented by spheres. Once the training set got aligned to the template, the pharmacophore features collected in the container are clustered. A minimum distance hierarchical clustering algorithm is applied, whereas the cutoff can be set as a hyperparameter. The default value is equal to the default tolerance radius of pharmacophore features, 1.5 Å. Clusters are formed separately for each feature type: hydrophobic (H), aromatic (AR), positive/negative ionisable (PI, NI), HBond donor/acceptor (HBD, HBA).
Features belonging to a single cluster are then merged into a single feature, if possible, which represents the cluster. If no single feature can be created to represent the cluster, multiple features are placed. Clusters are merged with the priority of placing features as listed below:
Clusters containing a single feature: Clusters containing a single feature are represented by that one feature alone.
Clusters containing multiple features: For clusters containing various features, the goal is to find the smallest number of features required to represent the cluster. Ideally, a single feature is sufficient to represent the entire cluster. Finding multiple features to represent a cluster is not straightforward and has no objective best solution. In practice it was found that in the vast majority of cases a single feature is sufficient to represent a cluster.
A feature is declared to represent other features if the distance between their two centres is smaller or equal to the feature radius of one of the pharmacophore features. If their radii are different sizes, the smaller one is chosen as cutoff. Merged features are assigned the following properties:
List of activity values: The activity values of each feature merged into the representative. These values will be used to determine the importance of each feature.
Number of merged features: Indicates how many features were merged into the representative. This information will be used to determine the confidence in each feature.
Representative features are found in the following order. If at any point a representative was found, the algorithm goes on to the next cluster:
Selecting one of the features as representative: Each feature is checked whether it overlaps with all other features in the cluster. If so, this feature is chosen as the representative of the cluster. If none of the features fulfils the requirements, a new feature will be created instead.
New feature at the centroid: A new feature is created and placed at the centroid of all features. The new feature has no associated activity value and simply serves the purpose of merging the other features. All features are probed whether they are represented by the new feature.
New feature at the centre: If the new feature at the centroid does not satisfy all feature’s requirements, a new feature is placed at the centre of all features. Once again, a check is run whether all features match the new feature.
Placing multiple features: Multiple representative features are only considered when all options to place a single feature were exhausted. As already mentioned, there is no single solution for placing multiple features to represent a cluster of features. Our algorithm iteratively finds multiple representatives, whereas each representative should merge as many features into the cluster as possible. Therefore, the most connected feature, determined by the number of overlapping features, is selected as a representative at each iteration. This process is repeated until all features in the cluster were merged. Each representative is assigned the list of activity values merged and the number of merged features.
A post-processing step is applied to the merged pharmacophore after clustering the features and creating a mergedpharmacophore from all training samples. The post-processing step aims to remove noise and features with a widespread list of activity values from the quantitative pharmacophore model. For example, two molecules with the same scaffold but different residues will have the same pharmacophore features for the scaffold, but different peripheral features. Merging these two pharmacophores will be easy due to the shared features from the scaffold. However, suppose these two pharmacophores have different activity values. In that case, the scaffolds features will not add any useful information since they are included in the low and highly active sample. Only information obtained from peripheral features, which are unambiguous in their activity, are utilised by the model. Therefore, features are probed for their ambiguity in activity values, and only unambiguous features are retained, whereas ambiguous features are removed from the model. The following criterions determine the ambiguity of features:
A minimum number of activity values per feature is required. Thereby, only features the model is confident in are added.
The absolute difference in activity values of merged features must not be greater than the absolute difference in activity values of all features in the mergedpharmacophore.
Following this, a ML model is trained on the aligned samples to predict the pharmacophores’ activity values. In order to reason from the mergedpharmacophore to activity values, appropriate features need to be extracted from the aligned samples. For this, the features of the aligned samples are matched against the mergedpharmacophore, whereas a match is defined by an overlap of the query and reference pharmacophore feature tolerance spheres. For each match, the distance to the corresponding feature in the mergedpharmacophore is calculated. The inverse of the distance value is used as input to the ML model, whereas a maximum value of the inverse distance can be set as a hyperparameter. Using the inverse value of the distance keeps a zerovalue for features without matches. It is important to note that the ML model will not consider features from samples without a matching feature in the mergedpharmacophore.
Given that there are no corresponding features in the mergedpharmacophore, it is acceptable to disregard potentially missing features during prediction since no information about their contribution to activity in that location is known. Therefore, including these features at inference would only increase noise and weaken the model’s confidence in the prediction(23).
Once all distances are calculated, a vector of m features is obtained (m being the total number of features in the mergedpharmacophore). Each sample is represented by such a vector describing its alignment with the mergedpharmacophore. These vectors are used as input to the ML model. In order to comply with tiny dataset sizes, the choice of MLmodels is restricted to simple models like regularised linear regression or slightly more complex models like randomforest trees. For randomforest trees, their parameters are set to train only a few shallow trees to avoid overfitting. Optionally, a PCA might be performed beforehand on the input data, extracting only a few highly relevant features from the vector. This process aims to increase the ratio from the number of samples to the number of features, thereby reducing the possibility of overfitting.
Before training the model, weights may be applied to the input data and the type of weight can be set as a hyperparameter. The following options are available:
No weights at all: The distances are treated as binary information. (least recommended option)
Weighted by distance: Explained above in the feature extraction step. This is the default option.
Weighted by the number of mergedfeatures: The third option emphasises features obtained from a higher number of pharmacophore features that have been merged into the respective feature. However, due to the post-processing step, features which the model is not confident in were already removed. Therefore, this type of feature weighting is optional and should only be considered by the user if particular emphasis wants to be given to that information.
The current version of the algorithm is implemented in Python using the chemical data processing (CDP) toolkit(24) for molecule and pharmacophore representations. Machine learning models are trained using the scikitlearn package(25). The code and all datasets are available at https://github.com/StefanKohlbacher/QuantPharmacophore.