Extracting Predictive Representations from Hundreds of Millions of Molecules

doi:10.21203/rs.3.rs-745668/v1

Download PDF

Article

Extracting Predictive Representations from Hundreds of Millions of Molecules

https://doi.org/10.21203/rs.3.rs-745668/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Nov, 2021

Read the published version in The Journal of Physical Chemistry Letters →

Version 1

posted

You are reading this latest preprint version

Although deep learning can automatically extract features in relatively simple tasks such as image analysis, the construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive, time-consuming, and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse datasets. In this work, we develop a self-supervised learning approach via a masking strategy to pre-train transformer models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a new protocol based on data traits to automatically select the optimal model for a specific predictive task. To validate the proposed representation and protocol, we consider 10 benchmark datasets in addition to 38 ligand-based virtual screening datasets. Extensive validation indicates that the proposed representation and protocol show superb performance.

Biological sciences/Computational biology and bioinformatics

Biological sciences/Molecular biology

Self-supervised learning

Pre-training

Virtual screening

Property prediction

There is NO Competing Interest.