Validation of model-building protocols
Our first goal was to assess the ability of our new software methods and code to reproduce the previous study [14]. We determined that the major difference would be due to implementations of the descriptor calculations as the distributions of calculated physico-chemical properties are reasonably well (but not perfectly) correlated (Additional figure S1). To explore the impact of differences in fingerprint implementations on model-building and performance we used the MMV – St. Jude dataset with only FCFP6 fingerprints and RDKit Morgan fingerprints with radius of 3 and features from Pipeline Pilot and RDKit, respectively. After removing the uninformative bits, we built two Naïve Bayes models and predicted the MMV test set. A pairwise comparison using the Pearson correlation coefficient (R) for the two sets of scores gave a value of 0.88, indicating a good but not perfect correlation (Additional figure S2 A). In a second comparison, we used ECFP6 fingerprints and RDKit Morgan fingerprints with radius of 3 but without features. This gave an R coefficient of 0.98, indicating almost perfect identity between the two model implementations (Additional figure S2 B).
These results are not unexpected and are due to differences in the way that which ECFP and FCFP fingerprints are calculated. Both are circular fingerprints, describing the environment in the vicinity of each atom in the molecule. ECFP captures the atom-based substructural information while FCFP represents function-based substructural information [26]. Differences in feature definitions between Pipeline Pilot and RDKit explain the lower correspondence between models built using these two approaches, whilst the ECFP implementations are almost identical.
Nevertheless, the results did confirm to our satisfaction that it would be possible to deliver model performance that was very close to those of the previous study.
Model comparison
Internal validation. The performance for each dataset across the 6 different descriptor combinations is shown in Figure 1. Somewhat surprisingly, the 5-fold cross validations show that the choice of descriptors appears to have rather limited impact on the performance. Indeed, we observed little variability between the six descriptor sets. Moreover, combining physico-chemical descriptors and molecular fingerprints results in barely any effect (comparing model 1, 3 and 4, and model 2, 5, 6, respectively).
Figure 1: Results of the internal validation for the 6 models trained by each partner dataset. The error bars represent the standard deviation returned by the 5-fold cross validation.
External validation. External model performance was evaluated using the MMV test set (Figure 2). We observed relatively little difference between all the models. When comparing different combinations of descriptors, we observed little variation, though FCFP alone (model 2) is generally better than ECFP alone (model 1).
As the various models were so similar in performance, we pragmatically decided to use the same set of descriptors for each dataset and opted for model 2 (FCFP fingerprints only). Using a single set of descriptors makes the prediction of new compounds more straightforward and faster to compute.
Figure 2: Results of the external validation for the 6 models trained by each partner dataset. This validation was performed predicting the MMV test set.
Consensus approaches
Several consensus options were investigated to identify the optimal approach for inclusion in the public model. We describe here in detail two of the approaches explored, MaxScore and metamodel.
MaxScore. For any given combination of models, the MaxScore prediction for a test compound is the maximum score predicted by any one of the component models. With 11 different models there are a total of 2,047 possible combinations, each containing from 1 to 11 individual models. All 2,047 combinations were generated and evaluated using three criteria: ROC AUC score, EF[1%] and EF[10%]. EF[1%] and EF[10%] represent practical virtual screening scenarios where a small fraction is selected from a pool of compounds. We show in Figure 3 the impact of increasing the number of models in the MaxScore consensus, using these three metrics for the three validation sets.
Figure 3: Median performance of the consensus approaches for the MMV (blue), PubChem (red) and St. Jude Screening set (green) validation sets. The error bars represent the standard deviations. The x-axis represents how many models were combined to calculate the MaxScore. For direct comparison, the performance of the metamodel has been added on the right end side of the plots. The performance was assessed by means of (A) the ROC AUC score, (B) EF[1%] and (C) EF[10%].
Across the three validation sets, we observe significant variance in the performance of individual models; in some cases these outperform the all-model consensus. This is an expected result and it is dependent upon the validation dataset. Also, modest improvements in performance are obtained as more models are included in the consensus for ROC and EF[10%]. Small consensus models worked best for the EF[1%] metric for the PubChem and St. Jude Screening sets. The all-model MaxScore consensus delivers a performance at least equivalent to the median performance of other consensus models with fewer contributing models. We also note that enrichment factor performance correlates with the numbers of compounds in the validation sets, with MMV and St. Jude Screening containing the least and most populated ones, respectively. The three validation sets also differ on other aspects. The MMV and St. Jude Screening sets are the results of single-dose assays, while the PubChem active compounds were determined in a dose-response assay. Moreover, the MMV test set is already more enriched with antimalarial compounds than the other validation sets, and was specifically designed in the previous project. This might partly explain its relatively lower enrichment values.
Metamodel. In this approach, we merged all the different partner models into a single model based on a combined set of fingerprints bits, where the weight for any one bit is given by summing the weights for that bit across the different individual models. The resulting metamodel is a Naïve Bayes model that generates scores from 893,855 binary variables. It has the key advantage of providing a consensus score from a single model, thus providing significant run-time advantages when compared to the all-model MaxScore consensus which would require each new compound to be first scored by 11 different models. The metamodel approach is compared with the all-model MaxScore consensus in Table 3 and is also annotated on Figure 3.
In Table 3 we also present summary results for two other consensus approaches that were investigated. The mean score (MeanScore) is calculated as the arithmetic average of the prediction scores returned by the individual models. The MinRank algorithm scores a compound as the minimum rank across all models. Overall, we could not identify a clear winner when comparing the results across these three validation sets.
Table 3 Performance comparison between all the consensus approaches implemented, i.e. MaxScore, metamodel, MeanRank and MinRank for the three validation sets
consensus
|
performance metrics
|
MMV test set
|
PubChem
|
St. Jude Screening Set
|
MaxScore
|
ROC AUC score
|
0.70
|
0.77
|
0.79
|
metamodel
|
0.67
|
0.69
|
0.82
|
MeanScore
|
0.71
|
0.78
|
0.80
|
MinRank
|
0.64
|
0.77
|
0.73
|
MaxScore
|
EF[1%]
|
2.9
|
7.0
|
13.3
|
metamodel
|
3.3
|
7.3
|
10.9
|
MeanScore
|
2.8
|
12.8
|
16.8
|
MinRank
|
2.5
|
7.0
|
11.0
|
MaxScore
|
EF[10%]
|
2.4
|
3.0
|
4.6
|
metamodel
|
2.1
|
2.4
|
4.9
|
MeanScore
|
2.3
|
3.0
|
4.3
|
MinRank
|
1.7
|
3.0
|
3.6
|
Chemical space analysis
In machine learning a standard way to assess the chemical space over which the model can generate reliable predictions is by defining its domain of applicability [27–29]. It is usually calculated from the variables used to describe the model training set. Methods such as conformal prediction are furthermore able to quantify the prediction certainty [30–32]. Our consensus approaches were developed with the underlying training sets remaining obscured, leaving us with limited options to perform further analysis and in particular it was not possible to directly assess the model’s domain of applicability. Nevertheless, to confirm the predictivity of our methods, we comprehensively assessed their performance using three validation sets as described above. Additionally, we further investigated the variation in performance across these different sets by considering fingerprint coverage. For a given molecule the fingerprint coverage is calculated as the proportion of bits shared with the model and so is a measure of the similarity of the compound with the model. Taking the MaxScore consensus as an example, for each compound in the three validation sets, fingerprint coverage values were calculated as the mean across the 11 models and we compared the resulting metric with the corresponding consensus score, distinguishing active from inactive compounds. See Figure 4, where active compounds are dark and inactive compounds are light. The MMV and the St. Jude datasets show similar distributions (Figure 4 A and C, respectively), with high compound score for these sets appearing to require high fingerprint coverage. This might be considered expected behaviour since the score is the sum of feature log(probabilities) (see Methods). However, it is not the case that high fingerprint coverage necessarily leads to a high score. Indeed, the metric does not consider the weight of these shared bits, and it is possible that they contribute very little to final score, if not negatively. Having only a few bits shared with the model would be expected to lead to lower scores as there are just too few contributing bits. The results for the PubChem dataset differ from the two others, showing a much weaker relationship between score and the mean fingerprint coverage (Figure 4 B). Interestingly, the lowest fingerprint coverage values in this dataset have higher absolute values than for the other two sets, whilst also having a narrower range of scores.
Figure 4: Comparison between fingerprint coverage and prediction score for (A) MMV, (B) PubChem and (C) St. Jude Screening sets, where active compounds are dark and inactive compounds are light.
To further explore this, Figure 5 shows the results of a t-distributed Stochastic Neighbor Embedding (t-SNE) visualisation of all validation set compounds. Described above, this nonlinear dimensionality reduction technique is particularly useful for preserving local data structures [33, 34]. We observe in Figure 5 that the MMV test set and St. Jude Screening set compounds are more similar between each other than they are compared to the PubChem dataset compounds, respectively. One possible explanation for these results is that the PubChem set is inherently less diverse, being based around a small number of central scaffolds, than the other two sets. However, such analysis is beyond the scope of this study. We should also be cautious in over-interpreting these results. In particular, the fingerprint coverage metric relies only on the bits shared with the model, ignoring bits in the model but not in the molecule. Hence, while it gives an indication on how much of the compound bits are in the model, it completely eludes the information present in the model but not in the compound. This is analogous to the observations of Martin et al. [35] and Hempel’s “raven paradox”.
Figure 5: t-SNE visualisation of the three validation set chemical spaces.
Web application
The above results demonstrate that the metamodel shows comparable results to the other consensus approaches in our validation experiments. Furthermore, there are significant computational advantages in having to deal with just a single model instead of eleven from a system performance perspective. We have therefore implemented the metamodel in a web application called MAIP (MAlaria Inhibitor Prediction), which is available for the community to make antimalarial activity predictions.
MAIP is accessible through https://www.ebi.ac.uk/chembl/maip/ (Figure 6A). Users upload a file containing their molecules represented by SMILES strings and associated identifiers. Once submitted, a job is queued for execution on the EMBL-EBI compute infrastructure (Figure 6B). Each submitted molecule is standardised as described above for consistency with the methodology used for model training, FCFP descriptors are calculated and the molecule is then scored using the metamodel. A file containing individual compound scores and the standardised compound structures can be downloaded when the job finishes. To facilitate analysis of the results, a bar chart giving the score distribution is generated together with summary information on the three validation sets to assist users in any subsequent compound selections. Documentation on the project and MAIP are available at: https://chembl.gitbook.io/malaria-project/
Figure 6: A. Frontend of the MAIP platform, accessible at https://www.ebi.ac.uk/chembl/maip/; B. MAIP system architecture.
MAIP is intended to be used as a prioritisation tool, offering users a way to identify compounds of interest for further analysis and selection. MAIP does not directly return a prediction flag, e.g. ‘predicted antimalarial compound’; users should scrutinise their results before making any decisions on next steps. To assist this, documentation is provided with the results together with relevant statistics for our three validation sets (Figure 7). For each dataset, the higher the model score, the greater the observed enrichment. Further, the thresholds needed to pick 1%, 10% and 50% of the predictions correlate with the dataset size. These data can be used to guide the users in their analysis.
Users are also strongly advised to use additional in silico filters to assess the suitability of any virtual hits from MAIP prior to any experimental testing. High scoring compounds may have physicochemical properties and/or substructure features that are unsuitable as starting points for a malaria (or indeed any) drug discovery programme. In addition, some of the training sets used in MAIP contain examples of known antimalarial compounds (e.g. aminoquinolines). Thus, molecules that score highly in MAIP may have already been worked on extensively in antimalarial programmes. Public bioactivity resources such as ChEMBL can be used to determine whether the antimalarial activity is already known for specific structural classes [13]. As part of MMV’s commitment to Open Innovation, screening slots in the MMV Plasmodium falciparum asexual blood stage assay are being made available to test 3rd party compounds identified using MAIP. Please contact MMV ([email protected]) for more details.
Figure 7: Screenshot from MAIP document page (https://chembl.gitbook.io/malaria-project/using-maip-results#result-summary) showing the model validation results. For each validation set, the numbers of active and inactive compounds are indicated. The scores used to measure the enrichment factors are indicated between parentheses.