Siamese Neural Networks for Regression: Similarity-Based Pairing and Uncertainty Quantification

doi:10.21203/rs.3.rs-2247795/v1

Download PDF

Research Article

Siamese Neural Networks for Regression: Similarity-Based Pairing and Uncertainty Quantification

https://doi.org/10.21203/rs.3.rs-2247795/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 30 Aug, 2023

Read the published version in Journal of Cheminformatics →

You are reading this latest preprint version

Here we present a similarity-based pairing method for generating compound pairs to train Siamese neural networks. In comparison with the conventional exhaustive pairing, it reduces the algorithm complexity from O(n²) to O(n). It also results in a better prediction performance consistently on the three physicochemical datasets, using a multilayer perceptron with the circular fingerprint as a proof of concept. We further include into a Siamese neural network the transformer-based Chemformer which extracts task-specific features from the simplified molecular-input line-entry system representation of compounds. Additionally, we propose a means to measure the prediction uncertainty by utilizing the n-shot ensemble learning. Our results demonstrate that the high prediction accuracy correlates with the high confidence. Finally, we investigate implications of the similarity property principle in machine learning.

Quantitative structure-activity relationship (QSAR) models relate molecular properties to a biological activity or physical property, and play an essential role in drug discovery.¹ Deep-learning based models have recently led to superior performance on a variety of modeling tasks, including message passing neural networks (MPNN),^2–4 convolutional neural networks (CNN),^{5, 6} recurrent neural networks (RNN),⁷ and transformers.^8–10

A Siamese neural network consists of two identical subnetworks working in parallel on each of the two different input vectors to compute comparable output vectors. It was originally proposed for image classification such as face or handwriting verification.¹¹ Given its competitive edge in addressing low-data prevailing in drug discovery, it has been applied to the prediction of drug toxicity,¹² drug response similarity,¹³ drug-drug interactions,¹⁴ and the classification of bioactivities.⁷ Inspired by the relative binding free energy simulation methods which focus on the difference in affinity between two congeneric ligands using a thermodynamic cycle,¹⁵ Jimenez-Luna et al. utilize a Siamese convolutional neural network to determine the relative binding affinity between two bound protein − ligand complexes.¹⁶ That seminal work greatly expands the application scope of a Siamese neural network from the distance/similarity-based classification to regression. Recently, its prediction performance was further improved by a linear combination of loss terms via the increased regularization of the latent space.¹⁷

A Siamese neural network by design is trained on pairs, which has a complexity of O(n²). It is thus intractable to train a deep-learning based Siamese neural network on a dataset having just a few thousands of compounds as it would result in millions of pairs. However, in the optimization phase of a drug discovery project, medicinal chemists typically make a few hundred up to thousand derivatives of a lead compound with small variations, empowered by the high-throughput experimentation.¹⁸ In addition, drug metabolism and pharmacokinetic (DMPK) properties such as aqueous solubility, lipophilicity and human liver microsome clearance have been routinely measured on newly synthesized compounds, given their important roles in determining the fate of a drug candidate. The DMPK database in a pharmaceutical company could have accumulated over time to hundreds of thousands of data points. With the exhaustive pairing, it is unfortunately impossible to harness such a wealth of data in a Siamese neural network.

Matched molecular pair analysis (MMPA) is a cheminformatic method comparing the properties of two molecules that differ only by a single chemical transformation, for example, the substitution of a hydrogen atom by a fluorine one.^{19, 20} A matched molecular pair (MMP) rule for a defined transformation can be derived from the ensemble of corresponding MMPs and their associated property changes. One greater aspect of MMPA lies on the transferable effect of a chemical transformation, which can then be used to prioritize synthesis.^19–23

Uncertainty quantification has been increasingly recognized as an important aspect in molecular property prediction pipelines, where QSAR models are used to prioritize lab-intensive and time-consuming experimentations.^{1, 24} Both the opaque characteristics of deep-learning models and the vast chemical space drive the need for an effective uncertainty quantification.^25–27 The popular approaches include ensemble-based methods,²⁸ Bayesian uncertainty estimation,²⁹ and distance-based methods.³⁰

Inspired by MMPA, we propose a similarity-based pairing method to generate compound pairs for training a Siamese neural network. Our rationality is that the Siamese network could easily correlate the structural differences with the property differences when trained on pairs sharing high similarity which helps mitigate overfitting. We compare its performance against the exhaustive pairing using a multilayer perceptron with the circular fingerprint as a proof of concept. We further investigate the performance of a transformer-based Siamese neural network trained with the similarity-based pairing strategy. In addition, we quantify the uncertainty of the prediction with n-shot learning.

Siamese networks utilize two identical weights-sharing arms and take in two inputs simultaneously. It hinges on the similarity of the two hidden states for signature verification¹¹ or their pairwise difference for the prediction of delta-properties.^{16, 17} In case of classification, it could tackle the challenging issue of low data; while for regression, it could remove the systematic errors associated with a single-arm network by predicting the delta-property. The disadvantage is the combinatorial exploration of pairs for training. Here we describe a similarity-based pairing method followed by the transformer Siamese network.

Data Preparation. The three physiochemical datasets, namely lipophilicity (i.e., logD), freesolv (free energy of solvation) and ESOL (aqueous solubility), were downloaded from Molecule Net.³¹ These three datasets have been widely benchmarked against a variety of machine learning models, and are of general interest to the community of medicinal chemistry in drug discovery (Table 1). Each of the three datasets was randomly split into a training, a validation and a test set by 80:10:10 with a 10-fold stratification. Distribution of the training, the validation and the test set from a single split by t-distributed stochastic neighbor embedding (t-SNE) is shown in Figure S1.

Opposite to the additivity principle which underlies the SAR analysis is nonadditivity (NA), where the combination of two R-groups gives a very different result than the sum of each individual contribution. Nonadditivity presents a great challenge for the QSAR modeling, and can be calculated from double-transformation cycles consisting of four compounds that connected by two identical transformations.³² The amount of nonadditive compounds in each dataset is summarized in Table 1 together with the estimated experimental uncertainty, based on the nonadditivity analysis proposed by Kramer.³²

Table 1

**Summary of the three physicochemical datasets.**
	Lipophilicity	Freesolv	ESOL
Size	4200	642	1128
Mean	2.19	-3.80	-3.05
Std	1.20	3.85	2.10
Estimated experimental uncertainty (σ)	0.2	0.3	0.3
Cycles	169	7389	8731
Cycles with significant NA (> 2σ)	40 (23.7%)	2241 (30.3%)	2660 (30.5%)
Cpds with significant NA (> 2σ)	83 (2.0%)	99 (15.4%)	94 (8.4%)
Cpds with strong NA (> 4σ)	26 (0.6%)	29 (4.5%)	13 (1.2%)

Similarity-Based Pairing. The Tanimoto similarity between the two paired compounds was calculated using the open-source cheminformatic tool RDKit (https://www.rdkit.org) with the count-based extended-connectivity fingerprint (ECFP4).³³ As illustrated in Fig. 1, only the compound pair which has the highest similarity per row in the upper triangle of the similarity matrix was taken to train a Siamese neural network. This results in N pairs in contrast to the N²/2 pairs from the exhaustive pairing.^{16, 17} It ensures that each compound in the training set would appear at least once in the resulting pairs.

Multilayer Perceptron-based Siamese Neural Network. With the conventional exhaustive pairing, it becomes impractical to train a deep learning model on a dataset having more than just a few thousands data points. The lipophilicity dataset, for example, has around 4000 compounds yielding 8 million pairs, which requires large computational resources unaffordable even for a commercial organization. In order to evaluate whether the proposed similarity-based pairing could perform comparably with the exhaustive pairing, we use the count-based ECFP4 fingerprint, which is rich in chemical information and has been widely used in the field of cheminformatics including property predictions. We thus skip a deep-learning step to extract features from the simplified molecular-input line-entry system (SMILES) representation of compounds, and that training step is most resource-intensive and time-consuming. Essential to a Siamese neural network (Fig. 2), the difference between the ECFP4 fingerprints of a compound pair is fed into a multilayer perceptron (MLP) which predicts the delta property. The MLP has an input layer of 2048 neurons, a hidden layer of 128 neurons followed by a ReLU activation function, and an output layer of a single neuron. For the sake of simplicity, we term it MLP-ΔFP in contrast to the conventional regression model of MLP-FP. The architecture of both MLP-FP and MLP-ΔFP is illustrated in Figure S2.

Random Forest with Pairwise Difference Input. In addition, a random forest (RF) model was built with the pairwise difference input, called RF-ΔFP in comparison with the conventional RF-FP, using the default parameters in the Python library scikit-learn.

Transformer-based Siamese Neural Network. The model architecture of a transformer-based Siamese neural network is illustrated in Fig. 2. Briefly, each SMILES string of a compound pair is fed into a transformer encoder sharing identical weights. The SMILES string is tokenized and then embedded with the positional encoding. The encoding layer consists of a self-attention block, an add-layer normalization block, a feedforward block and a second add-layer normalization block. The encoding layer may repeat several times. Finally, the hidden state of the start token is subtracted from each other, and the subtraction is fed into a read-out regression network, which outputs the delta-property of the compound pair.

The transformer-based model, Chemformer which has been pretrained on approximately 100 million SMILES strings by reconstructing the original SMILES strings,⁹ was integrated into a Siamese neural network. The Chemformer uses 6 encoding layers each having 8 attention heads, a model dimension of 512 and a feedforward dimension of 2048. The model, called Chemformer-SNN, was trained with a learning rate of 0.0005 for 150 epochs on the training set, and the state yielding the best performance on the validation set was used to predict the test set. Data augmentation including both mask and random SMILES strings was applied during training.

Inference of the Absolute Properties. Each compound in the test set is paired with each compound in the training set, and the delta-properties of the resulting pairs were predicted by the Siamese neural network. Since the compound from the training set of the pair has its property known, the property of the test compound can be determined by Eq. 1:

$${\text{Property}}_{\text{test}}\text{= }{\text{Property}}_{\text{training}}\text{+ }\text{ΔProperty}$$

Thus each compound in the training set would give rise to a prediction for a test compound. The mean value of all predictions provides a single estimate for the test compound, and the standard deviation provides a way to quantify the prediction uncertainty. We apply the n-shot learning strategy, where n is the number of compounds in the training set which share the highest similarity to the test compound.

In addition, we introduce a similarity cutoff for choosing reference compounds. Only compounds in the training set having a similarity to a test compound no less than the given cutoff will be chosen as reference. If all compounds in the training set have a similarity below the cutoff to the test compound, that test compound will be excluded from the evaluation. We consider cutoff values ranging from 0.3 to 0.5 with an interval of 0.05.

Performance Metrics. The prediction performance was measured by the pooled root mean square errors (RMSE) and the correlation coefficient r² from a 10-fold stratified cross-validation.

Similarity-Based Pairing Outperforms the Exhaustive Pairing. As illustrated in Fig. 3, the similarity-based pairing gives rise to pairs having a similarity in the range from 0.2 to 1.0 with two peaks, one at 0.8 and another at 0.4, respectively. The similarity values of the resulting pairs were relatively evenly distributed in the range from 0.3 to 0.8. The ΔlogD of the resulting pairs shows a normal distribution centered at 0 ranging from − 4 to 4. The data points at the similarity value of 1 mainly correspond to stereoisomers, and occasionally, the two paired compounds are indistinguishable by the ECFP4 fingerprint (Fig. 4).

The similarity property principle in cheminformatics states that compounds with similar chemical structures tend to have similar properties.³⁴ There indeed exists a rather weak trend that the distribution of the experimental ΔlogD shrinks with an increase in the similarity of the compound pair. To have a better understanding of similarity on the prediction accuracy, each compound in the test set was paired with each compound in the training set, and the pairwise property difference was then predicted by the trained MLP-ΔFP model. The prediction error from each pair was measured against the similarity of the two compounds in that pair (the bottom diagram in Fig. 3). Notably, it becomes more pronounced that the prediction error is smaller when the reference compound (i.e., the compound from the training set) is more similar to the test compound, as evident by the lines depicting the 95% percentile of the distribution. For physicochemical properties such as logD, a single heavy atom change by an ionizable amine or an alcohol could drastically alter the property although the resulting compound is very similar to the parent one, giving rise to the property cliffs manifested by the large property difference between two similar compounds (Fig. 3). Arguably, such effects are largely transferable and hence predictable, underlying the concept of matched molecular pair analysis in medicinal chemistry.^{19, 20} The analysis on the other two datasets shows qualitatively similar observations (Figure S3).

Given the observation that reference compounds similar to a test compound yield a small prediction error, we investigate the impact of the number of reference compounds (or shots) on the prediction accuracy by the n-shot learning strategy. The compounds in the training set were ranked by their similarities to a test compound, and the top n compounds were chosen as reference to infer the absolute property of the test compound. Notably, the one-shot learning does not give rise to the lowest RMSE in comparison with the ensemble-based learning, although the single reference is most similar to the test compound (Fig. 5). The performance of the n-shot learning is stable for both the lipophilicity and ESOL datasets. The prediction performance deteriorates after eight reference compounds as the similarity to the test compounds decreases, recapitulating the impact of the similarity principle. It is most pronounced on the freesolv dataset as the performance decreases drastically after four reference compounds.

Next we compare the performance of the n-shot learning when the model MLP-ΔFP was trained on pairs generated by the similarity-based pairing and the exhaustive pairing, respectively (Fig. 6). The number of reference compounds n were determined to give the lowest RMSE for each of the three datasets per pairing method. The similarity-based pairing consistently performs equivalently with the exhaustive pairing on all three datasets. In terms of the computational time, the similarity based pairing significantly decreases the training time by reducing the complexity from O(n²) to O(n).

Deep-Learning Extracted Features Outperform the ECFP4 Fingerprint. The predefined ECFP4 fingerprint is rich in chemical information, but it is not task-specific, and in some cases, fails to distinguish the difference between paired compounds (Fig. 4). Deep-learning models have proven to outperform the ECFP4 fingerprint by extracting task-specific features from the SMILES strings only.^{7, 35, 36} To further investigate whether the similarity-based pairing applies to a deep-learning model, we include the pretrained Chemformer into a Siamese neural network. We first investigate the effect of dropout on the model performance. Dropout is a machine learning technique to randomly zero some of the elements of the input tensor with a given probability, and has proven to be effective for regularization and preventing overfitting during training. However, the varying random seed results in an actual dropout rate doubled for the difference between the two hidden states. As shown in Table 2, the performance degrades significantly on the ESOL dataset at a dropout rate of 0.17, which otherwise yields the best performance of the Chemformer. For the sake of simplicity, we refer to the performance of the Chemformer-SNN with no dropout in the following discussion.

Table 2

The effect of the dropout rate on the performance of the Chemformer-SNN.
	Lipophilicity		Freesolv		ESOL
Dropout rate	RMSE	r²	RMSE	r²	RMSE	r²
0	0.62	0.73	1.12	0.91	0.73	0.88
0.05	0.61	0.74	1.10	0.91	0.84	0.84
0.1	0.61	0.74	1.08	0.92	0.88	0.82
0.17	0.63	0.72	1.09	0.92	0.90	0.81

The prediction performance of the Chemformer-SNN with the n-shot learning becomes stable on all three datasets after five reference compounds (Fig. 7). In comparison with the MLP-ΔFP, there is no significant deterioration in the prediction performance with an increase in the number of reference compounds up to 20. The results of all models are summarized in Table 3. The transformer-based models, both Chemformer and Chemformer-SNN, show superior performance on all three datasets over the two MLP models. The performance of the Chemformer-SNN is comparable to that of the Chemformer on the lipophilicity and freesolv datasets, and slightly worse on the ESOL dataset. The overall performance of the Chemformer-SNN is comparable among the state-of-the-art machine learning models,^{2, 9, 31} suggesting that the similarity-based pairing is applicable to train a deep-learning based Siamese neural network. Encouragingly, the similarity-based pairing together with the n-shot learning improves the performance of random forest, particularly on the free solvation dataset.

Table 3

Summary of the prediction performance from the 10-fold cross validation.^a
	Lipophilicity		Freesolv		ESOL
	RMSE	r²	RMSE	r²	RMSE	r²
MLP-FP	0.75	0.62	1.57	0.82	0.84	0.84
MLP-ΔFP	0.74	0.62	1.60	0.82	0.81	0.84
RF-FP	0.77	0.58	1.91	0.75	0.92	0.81
RF-ΔFP	0.74	0.62	1.62	0.81	0.83	0.84
Chemformer	0.58	0.76	1.07	0.91	0.58	0.92
Chemformer-SNN	0.62	0.74	1.12	0.91	0.73	0.88

^a See Table S1 for the summary of different models studied here.

Uncertainty Quantification. The n-shot learning provides a convenient way to quantify the uncertainty of the prediction. A high uncertainty measured by the standard deviation indicates less confidence in the prediction. To visualize the uncertainty, the confidence curve plotting is adopted, which displays how the error varies with the sequential removal of compounds from the lowest to the highest confidence.²⁵ As shown in Fig. 8, the prediction error of RMSE decreases on all three datasets when compounds with low confidence are sequentially removed. The relationship between the high confidence and small prediction error is evident. For example, removal of the 20% compounds with the highest uncertainty decreases the RMSE from 1.1 to 0.7 on the freesolv dataset. Concomitantly, the average similarity of reference compounds to test compounds corresponds with the increase in confidence, in line with the similarity principle. Intriguingly, when less than 10% compounds were left, the increase in RMSE was observed most pronounced on the ESOL dataset. This could be ascribed to the statistic noise due to an insufficient number of compounds in the evaluation of RMSE, which could be largely affected by activity cliff^{37, 38} or nonadditivity.³⁹

The detailed view of the correlation of uncertainty with the average similarity of reference compounds reveals a general trend that the uncertainty increases with the decrease in similarity, most prominent on the lipophilicity dataset (Fig. 9). However, outliers do exist. High uncertainty at high similarity could be an indication of activity cliffs or non-linear SAR contributions (e.g., the nonadditivity from double-transformation cycles). Intriguingly, low uncertainty at low similarity has been observed too.

Implications of the Similarity Principle in Machine Learning. To further evaluate the impact of the similarity principle on machine learning, we compare the prediction errors at different similarity cutoffs. For all models, if the highest similarity between a test compound and any compound in the training set is less than the given cutoff, that test compound is excluded from the evaluation. This leads to exclude 1.4%, 4.7%, 9.7%, 15.6% and 21.1% at the cutoff of 0.3, 0.35, 0.4, 0.45 and 0.5 for lipophilicity; 5.7%, 9.3%, 16.9%, 23.7% and 33.4% for freesolv; 5.1%, 8.7%, 14.9%, 21.3% and 30.9% for ESOL, respectively. As shown in Fig. 10, the prediction error of RMSE decreases with an increase in the similarity for all models and the correlation coefficient r² increases correspondingly, signifying the role of the similarity principle in machine learning. Our observations corroborate the previous findings that the prediction error associated with a molecule rather depends on its distance to the training molecules.^{30, 40, 41} Dependence of the prediction performance on the similarity is striking for both the MLP-ΔFP and Chemformer-SNN. The similarity-based pairing is designed to capture the transferable effect of a small chemical transformation, inspired by the concept of matched molecular pair analysis. When the two paired compounds are extremely dissimilar to each other, poor predictions could be expected since the transformation now concerns the two molecules as a whole rather than a few local variations.

In summary, we propose a similarity-based pairing method to generate compound pairs for training a Siamese neural network. Our results show that it performs equivalently with the exhaustive pairing and reduces the model complexity from O(n²) to O(n), hence making it tractable to train a deep-learning based Siamese neural network on a big dataset. Combining the Siamese neural network with n-shot learning, we further quantify the prediction uncertainty and show that the high prediction accuracy indeed correlates with the high confidence. Therefore, the uncertainty quantification could be used to guide experimental designs by selecting compounds of high uncertainty for exploration and compounds of low uncertainty for exploitation. Finally, we demonstrate that the similarity property principle underpins the performance of machine learning.

Ethical Approval

Not applicable.

Competing interests

J.H., E.N., C.T. and H.Z. are employees of AstraZeneca and own stock options. J.M. is a visiting PhD student at AstraZeneca and owns stock options.

Authors’ contributions

H.Z. conceived the idea and wrote the main manuscript text. Y.Z., J.M. and E.N. performed the calculations and prepared figures. All authors reviewed the manuscript.

Funding

GRK 2515: Chemical biology of ion channels (Chembion) funded by the Deutsche Forschungsgemeinschaft.

Availability of data and materials

The data sets used in this study are available on the MoleculeNet (https://moleculenet.org/). The source code is publicly available on the GitHub https://github.com/AstraZeneca/Siamese-Regression-Pairing.

ACKNOWLEDGMENTS

Yumeng Zhang is thankful to Dr. Werngard Czechtizky for having the opportunity to do her Master thesis at AstraZeneca. Janosch Menke and Dr. Oliver Koch are grateful to the training group “GRK 2515: Chemical biology of ion channels (Chembion)” funded by the Deutsche Forschungsgemeinschaft.

Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, II; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R.; Consonni, V.; Kuz'min, V. E.; Cramer, R.; Benigni, R.; Yang, C.; Rathman, J.; Terfloth, L.; Gasteiger, J.; Richard, A.; Tropsha, A., QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 2014, 57, 4977-5010.
Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.; Barzilay, R., Analyzing Learned Molecular Representations for Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370-3388.
Xiong, J.; Xiong, Z.; Chen, K.; Jiang, H.; Zheng, M., Graph neural networks for automated de novo drug design. Drug Discov. Today 2021, 26, 1382-1393.
Volkov, M.; Turk, J. A.; Drizard, N.; Martin, N.; Hoffmann, B.; Gaston-Mathe, Y.; Rognan, D., On the Frustration to Predict Binding Affinities from Protein-Ligand Structures with Deep Neural Networks. J. Med. Chem. 2022.
Jimenez, J.; Skalic, M.; Martinez-Rosell, G.; De Fabritiis, G., KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287-296.
Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; Siedlecki, P., Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics 2018, 34, 3666-3674.
Fernandez-Llaneza, D.; Ulander, S.; Gogishvili, D.; Nittinger, E.; Zhao, H.; Tyrchan, C., Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction. ACS Omega 2021, 6, 11086-11094.
Zhang, X. C.; Wu, C. K.; Yang, Z. J.; Wu, Z. X.; Yi, J. C.; Hsieh, C. Y.; Hou, T. J.; Cao, D. S., MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform. 2021, 22.
Irwin, R.; Dimitriadis, S.; He, J.; Bjerrum, E. J., Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn.: Sci. Technol 2022, 3, 015022.
Wu, Z.; Jiang, D.; Wang, J.; Zhang, X.; Du, H.; Pan, L.; Hsieh, C. Y.; Cao, D.; Hou, T., Knowledge-based BERT: a method to extract molecular features like computational chemists. Brief. Bioinform. 2022, 23.
Bromley, J.; Bentz, J. W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R., Signature Verification Using a "Siamese" Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669-688.
Altae-Tran, H.; Ramsundar, B.; Pappu, A. S.; Pande, V., Low Data Drug Discovery with One-Shot Learning. ACS Cent. Sci. 2017, 3, 283-293.
Jeon, M.; Park, D.; Lee, J.; Jeon, H.; Ko, M.; Kim, S.; Choi, Y.; Tan, A. C.; Kang, J., ReSimNet: drug response similarity prediction using Siamese neural networks. Bioinformatics 2019, 35, 5249-5256.
Schwarz, K.; Allam, A.; Perez Gonzalez, N. A.; Krauthammer, M., AttentionDDI: Siamese attention-based deep learning method for drug-drug interaction predictions. BMC Bioinformatics 2021, 22, 412.
Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Lupyan, D.; Robinson, S.; Dahlgren, M. K.; Greenwood, J.; Romero, D. L.; Masse, C.; Knight, J. L.; Steinbrecher, T.; Beuming, T.; Damm, W.; Harder, E.; Sherman, W.; Brewer, M.; Wester, R.; Murcko, M.; Frye, L.; Farid, R.; Lin, T.; Mobley, D. L.; Jorgensen, W. L.; Berne, B. J.; Friesner, R. A.; Abel, R., Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 2015, 137, 2695-703.
Jimenez-Luna, J.; Perez-Benito, L.; Martinez-Rosell, G.; Sciabola, S.; Torella, R.; Tresadern, G.; De Fabritiis, G., DeltaDelta neural networks for lead optimization of small molecule potency. Chem. Sci. 2019, 10, 10911-10918.
McNutt, A. T.; Koes, D. R., Improving DeltaDeltaG Predictions with a Multitask Convolutional Siamese Network. J. Chem. Inf. Model. 2022, 62, 1819-1829.
Shevlin, M., Practical High-Throughput Experimentation for Chemists. ACS Med. Chem. Lett. 2017, 8, 601-607.
Leach, A. G.; Jones, H. D.; Cosgrove, D. A.; Kenny, P. W.; Ruston, L.; MacFaul, P.; Wood, J. M.; Colclough, N.; Law, B., Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. J. Med. Chem. 2006, 49, 6672-82.
Griffen, E.; Leach, A. G.; Robb, G. R.; Warner, D. J., Matched molecular pairs as a medicinal chemistry tool. J. Med. Chem. 2011, 54, 7739-50.
Dossetter, A. G.; Griffen, E. J.; Leach, A. G., Matched molecular pair analysis in drug discovery. Drug Discov. Today 2013, 18, 724-31.
Kramer, C.; Fuchs, J. E.; Whitebread, S.; Gedeck, P.; Liedl, K. R., Matched molecular pair analysis: significance and the impact of experimental uncertainty. J. Med. Chem. 2014, 57, 3786-802.
Dalke, A.; Hert, J.; Kramer, C., mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets. J. Chem. Inf. Model. 2018, 58, 902-910.
Muratov, E. N.; Bajorath, J.; Sheridan, R. P.; Tetko, I. V.; Filimonov, D.; Poroikov, V.; Oprea, T. I.; Baskin, II; Varnek, A.; Roitberg, A.; Isayev, O.; Curtarolo, S.; Fourches, D.; Cohen, Y.; Aspuru-Guzik, A.; Winkler, D. A.; Agrafiotis, D.; Cherkasov, A.; Tropsha, A., QSAR without borders. Chem. Soc. Rev. 2020, 49, 3525-3564.
Scalia, G.; Grambow, C. A.; Pernici, B.; Li, Y. P.; Green, W. H., Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. J. Chem. Inf. Model. 2020, 60, 2697-2717.
Hirschfeld, L.; Swanson, K.; Yang, K.; Barzilay, R.; Coley, C. W., Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. J. Chem. Inf. Model. 2020, 60, 3770-3780.
Janet, J. P.; Duan, C.; Yang, T.; Nandy, A.; Kulik, H. J., A quantitative uncertainty metric controls error in neural network-driven chemical discovery. Chem. Sci. 2019, 10, 7913-7922.
Reker, D.; Schneider, G., Active-learning strategies in computer-assisted drug discovery. Drug Discov. Today 2015, 20, 458-65.
Zhang, Y.; Lee, A. A., Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning. Chem. Sci. 2019, 10, 8154-8163.
Liu, R.; Wallqvist, A., Molecular Similarity-Based Domain Applicability Metric Efficiently Identifies Out-of-Domain Compounds. J. Chem. Inf. Model. 2019, 59, 181-189.
Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V., MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513-530.
Kramer, C., Nonadditivity Analysis. J. Chem. Inf. Model. 2019, 59, 4034-4042.
Rogers, D.; Hahn, M., Extended-connectivity fingerprints. J Chem Inf Model 2010, 50, 742-54.
Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J., Molecular similarity in medicinal chemistry. J. Med. Chem. 2014, 57, 3186-204.
Gomez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernandez-Lobato, J. M.; Sanchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A., Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268-276.
Winter, R.; Montanari, F.; Noe, F.; Clevert, D. A., Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 2019, 10, 1692-1701.
Cruz-Monteagudo, M.; Medina-Franco, J. L.; Perez-Castillo, Y.; Nicolotti, O.; Cordeiro, M. N.; Borges, F., Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov. Today 2014, 19, 1069-80.
Hu, H.; Bajorath, J., Introducing a new category of activity cliffs combining different compound similarity criteria. RSC Med. Chem. 2020, 11, 132-141.
Gogishvili, D.; Nittinger, E.; Margreitter, C.; Tyrchan, C., Nonadditivity in public and inhouse data: implications for drug design. J. Cheminform. 2021, 13, 47.
Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K., Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912-28.
Tetko, I. V.; Sushko, I.; Pandey, A. K.; Zhu, H.; Tropsha, A.; Papa, E.; Oberg, T.; Todeschini, R.; Fourches, D.; Varnek, A., Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J. Chem. Inf. Model. 2008, 48, 1733-46.

Competing interest reported. J.H., E.N., C.T. and H.Z. are employees of AstraZeneca and own stock options. J.M. is a visiting PhD student at AstraZeneca and owns stock options.

TOC.png
TOC
SuppInfo.docx

Download PDF

Journal Publication

published 30 Aug, 2023

Read the published version in Journal of Cheminformatics →

Editorial decision: Major revision
08 May, 2023
Reviewers agreed at journal
08 Apr, 2023
Reviews received at journal
03 Mar, 2023
Reviewers agreed at journal
06 Feb, 2023
Reviewers agreed at journal
18 Jan, 2023
Reviewers invited by journal
10 Dec, 2022
Editor assigned by journal
08 Dec, 2022
Submission checks completed at journal
08 Dec, 2022
First submitted to journal
07 Nov, 2022

You are reading this latest preprint version

Siamese Neural Networks for Regression: Similarity-Based Pairing and Uncertainty Quantification

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Results And Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1