Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties

doi:10.21203/rs.3.rs-1570270/v1

Download PDF

Article

Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties

https://doi.org/10.21203/rs.3.rs-1570270/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 21 Dec, 2022

Read the published version in Nature Machine Intelligence →

Version 1

posted

You are reading this latest preprint version

Predicting the properties of a chemical molecule is of great importance in many applications, including drug discovery and material design. Machine learning-based models promise to enable more accurate and faster molecular property predictions than the current state-of-the-art techniques, such as Density Functional Theory calculations or wet-lab experiments.Various supervised machine learning models, including graph neural nets, have demonstrated promising performance in molecular property prediction tasks. However, the vast chemical space and the limited availability of property labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, unsupervised transformer-based language models pre-trained on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets.Experiments show that utilizing the learned molecular representation outperforms existing baselines on downstream tasks, including supervised and self-supervised graph neural net baselines and language models, on several classification and regression tasks from ten benchmark datasets while performing competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that the large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.

There is NO Competing Interest.

MolFormernaturesupplement.pdf
MolFormerNatMatIntelSI.pdf
Supplementary Information

Download PDF

Journal Publication

published 21 Dec, 2022

Read the published version in Nature Machine Intelligence →

Version 1

posted

You are reading this latest preprint version

Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties

Status:

Journal Publication

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1