Similarity-Based Pairing Outperforms the Exhaustive Pairing. As illustrated in Fig. 3, the similarity-based pairing gives rise to pairs having a similarity in the range from 0.2 to 1.0 with two peaks, one at 0.8 and another at 0.4, respectively. The similarity values of the resulting pairs were relatively evenly distributed in the range from 0.3 to 0.8. The ΔlogD of the resulting pairs shows a normal distribution centered at 0 ranging from − 4 to 4. The data points at the similarity value of 1 mainly correspond to stereoisomers, and occasionally, the two paired compounds are indistinguishable by the ECFP4 fingerprint (Fig. 4).
The similarity property principle in cheminformatics states that compounds with similar chemical structures tend to have similar properties.34 There indeed exists a rather weak trend that the distribution of the experimental ΔlogD shrinks with an increase in the similarity of the compound pair. To have a better understanding of similarity on the prediction accuracy, each compound in the test set was paired with each compound in the training set, and the pairwise property difference was then predicted by the trained MLP-ΔFP model. The prediction error from each pair was measured against the similarity of the two compounds in that pair (the bottom diagram in Fig. 3). Notably, it becomes more pronounced that the prediction error is smaller when the reference compound (i.e., the compound from the training set) is more similar to the test compound, as evident by the lines depicting the 95% percentile of the distribution. For physicochemical properties such as logD, a single heavy atom change by an ionizable amine or an alcohol could drastically alter the property although the resulting compound is very similar to the parent one, giving rise to the property cliffs manifested by the large property difference between two similar compounds (Fig. 3). Arguably, such effects are largely transferable and hence predictable, underlying the concept of matched molecular pair analysis in medicinal chemistry.19, 20 The analysis on the other two datasets shows qualitatively similar observations (Figure S3).
Given the observation that reference compounds similar to a test compound yield a small prediction error, we investigate the impact of the number of reference compounds (or shots) on the prediction accuracy by the n-shot learning strategy. The compounds in the training set were ranked by their similarities to a test compound, and the top n compounds were chosen as reference to infer the absolute property of the test compound. Notably, the one-shot learning does not give rise to the lowest RMSE in comparison with the ensemble-based learning, although the single reference is most similar to the test compound (Fig. 5). The performance of the n-shot learning is stable for both the lipophilicity and ESOL datasets. The prediction performance deteriorates after eight reference compounds as the similarity to the test compounds decreases, recapitulating the impact of the similarity principle. It is most pronounced on the freesolv dataset as the performance decreases drastically after four reference compounds.
Next we compare the performance of the n-shot learning when the model MLP-ΔFP was trained on pairs generated by the similarity-based pairing and the exhaustive pairing, respectively (Fig. 6). The number of reference compounds n were determined to give the lowest RMSE for each of the three datasets per pairing method. The similarity-based pairing consistently performs equivalently with the exhaustive pairing on all three datasets. In terms of the computational time, the similarity based pairing significantly decreases the training time by reducing the complexity from O(n2) to O(n).
Deep-Learning Extracted Features Outperform the ECFP4 Fingerprint. The predefined ECFP4 fingerprint is rich in chemical information, but it is not task-specific, and in some cases, fails to distinguish the difference between paired compounds (Fig. 4). Deep-learning models have proven to outperform the ECFP4 fingerprint by extracting task-specific features from the SMILES strings only.7, 35, 36 To further investigate whether the similarity-based pairing applies to a deep-learning model, we include the pretrained Chemformer into a Siamese neural network. We first investigate the effect of dropout on the model performance. Dropout is a machine learning technique to randomly zero some of the elements of the input tensor with a given probability, and has proven to be effective for regularization and preventing overfitting during training. However, the varying random seed results in an actual dropout rate doubled for the difference between the two hidden states. As shown in Table 2, the performance degrades significantly on the ESOL dataset at a dropout rate of 0.17, which otherwise yields the best performance of the Chemformer. For the sake of simplicity, we refer to the performance of the Chemformer-SNN with no dropout in the following discussion.
Table 2
The effect of the dropout rate on the performance of the Chemformer-SNN.
| Lipophilicity | Freesolv | ESOL |
Dropout rate | RMSE | r2 | RMSE | r2 | RMSE | r2 |
0 | 0.62 | 0.73 | 1.12 | 0.91 | 0.73 | 0.88 |
0.05 | 0.61 | 0.74 | 1.10 | 0.91 | 0.84 | 0.84 |
0.1 | 0.61 | 0.74 | 1.08 | 0.92 | 0.88 | 0.82 |
0.17 | 0.63 | 0.72 | 1.09 | 0.92 | 0.90 | 0.81 |
The prediction performance of the Chemformer-SNN with the n-shot learning becomes stable on all three datasets after five reference compounds (Fig. 7). In comparison with the MLP-ΔFP, there is no significant deterioration in the prediction performance with an increase in the number of reference compounds up to 20. The results of all models are summarized in Table 3. The transformer-based models, both Chemformer and Chemformer-SNN, show superior performance on all three datasets over the two MLP models. The performance of the Chemformer-SNN is comparable to that of the Chemformer on the lipophilicity and freesolv datasets, and slightly worse on the ESOL dataset. The overall performance of the Chemformer-SNN is comparable among the state-of-the-art machine learning models,2, 9, 31 suggesting that the similarity-based pairing is applicable to train a deep-learning based Siamese neural network. Encouragingly, the similarity-based pairing together with the n-shot learning improves the performance of random forest, particularly on the free solvation dataset.
Table 3
Summary of the prediction performance from the 10-fold cross validation.a
| Lipophilicity | Freesolv | ESOL |
| RMSE | r2 | RMSE | r2 | RMSE | r2 |
MLP-FP | 0.75 | 0.62 | 1.57 | 0.82 | 0.84 | 0.84 |
MLP-ΔFP | 0.74 | 0.62 | 1.60 | 0.82 | 0.81 | 0.84 |
RF-FP | 0.77 | 0.58 | 1.91 | 0.75 | 0.92 | 0.81 |
RF-ΔFP | 0.74 | 0.62 | 1.62 | 0.81 | 0.83 | 0.84 |
Chemformer | 0.58 | 0.76 | 1.07 | 0.91 | 0.58 | 0.92 |
Chemformer-SNN | 0.62 | 0.74 | 1.12 | 0.91 | 0.73 | 0.88 |
a See Table S1 for the summary of different models studied here.
Uncertainty Quantification. The n-shot learning provides a convenient way to quantify the uncertainty of the prediction. A high uncertainty measured by the standard deviation indicates less confidence in the prediction. To visualize the uncertainty, the confidence curve plotting is adopted, which displays how the error varies with the sequential removal of compounds from the lowest to the highest confidence.25 As shown in Fig. 8, the prediction error of RMSE decreases on all three datasets when compounds with low confidence are sequentially removed. The relationship between the high confidence and small prediction error is evident. For example, removal of the 20% compounds with the highest uncertainty decreases the RMSE from 1.1 to 0.7 on the freesolv dataset. Concomitantly, the average similarity of reference compounds to test compounds corresponds with the increase in confidence, in line with the similarity principle. Intriguingly, when less than 10% compounds were left, the increase in RMSE was observed most pronounced on the ESOL dataset. This could be ascribed to the statistic noise due to an insufficient number of compounds in the evaluation of RMSE, which could be largely affected by activity cliff37, 38 or nonadditivity.39
The detailed view of the correlation of uncertainty with the average similarity of reference compounds reveals a general trend that the uncertainty increases with the decrease in similarity, most prominent on the lipophilicity dataset (Fig. 9). However, outliers do exist. High uncertainty at high similarity could be an indication of activity cliffs or non-linear SAR contributions (e.g., the nonadditivity from double-transformation cycles). Intriguingly, low uncertainty at low similarity has been observed too.
Implications of the Similarity Principle in Machine Learning. To further evaluate the impact of the similarity principle on machine learning, we compare the prediction errors at different similarity cutoffs. For all models, if the highest similarity between a test compound and any compound in the training set is less than the given cutoff, that test compound is excluded from the evaluation. This leads to exclude 1.4%, 4.7%, 9.7%, 15.6% and 21.1% at the cutoff of 0.3, 0.35, 0.4, 0.45 and 0.5 for lipophilicity; 5.7%, 9.3%, 16.9%, 23.7% and 33.4% for freesolv; 5.1%, 8.7%, 14.9%, 21.3% and 30.9% for ESOL, respectively. As shown in Fig. 10, the prediction error of RMSE decreases with an increase in the similarity for all models and the correlation coefficient r2 increases correspondingly, signifying the role of the similarity principle in machine learning. Our observations corroborate the previous findings that the prediction error associated with a molecule rather depends on its distance to the training molecules.30, 40, 41 Dependence of the prediction performance on the similarity is striking for both the MLP-ΔFP and Chemformer-SNN. The similarity-based pairing is designed to capture the transferable effect of a small chemical transformation, inspired by the concept of matched molecular pair analysis. When the two paired compounds are extremely dissimilar to each other, poor predictions could be expected since the transformation now concerns the two molecules as a whole rather than a few local variations.