Data pre-processing
The data cleaning procedures implemented in this study closely follow those established by Hu et al.[33]. Four rules were applied to all raw data:
(1) Molecules are limited to contain elements H, C, N, O, F, P, S, Cl, Br, and I;
(2) Molecules containing isotopes were excluded;
(3) Duplicates were removed;
(4) Molecules with molecular weight (MW) less than 200 or over 500, or a total atom count under 10 were removed.
All molecules were then converted into the canonical Simplified Molecular Input Line Entry System (SMILES) format with atom chirality information preserved. To improve diversity, the string-based Edit Distance metric was employed to calculate the similarity between any two SMILES strings, and only the one with similarity lower than 0.8 to any others were kept. Furthermore, a vocabulary was built for converting the input SMILES strings into tokens for ORGAN, and those SMILESs containing out-of-vocabulary tokens were removed. Further details about the vocabulary are available in Table S1.
Algorithm of ORGAN
Goodfellow et al. introduced Generative Adversarial Networks (GANs), a framework training a generator (\(G\)) to mimic data distributions and a discriminator (\(D\)) to evaluate sample origins[35]. GANs operate on a competitive principle, with \(G\) aiming to generate data resembling real datasets and \(D\) distinguishing between real and generated data. This process evolves until \(G\) produces data indistinguishable by \(D\), demonstrating GANs' ability to create highly authentic data.
Key to this framework is the strategic interaction between \(G\) and \(D\), guided by Eq. 1. \(G\) seeks to minimize \(log(1-D(G\left(z\right)\left)\right)\) by convincing \(D\) of the authenticity of its generated data, \(G\left(z\right)\). Simultaneously, \(D\) aims to accurately classify real data, \(x\), and generated data, enhancing its ability to detect \(G\)'s outputs. The training aims for a convergence where generated data's distribution mirrors that of real data, marking the end of the process and showcasing \(G\)'s capacity to fool \(D\).
$${min}_{G}{max}_{D}V\left(D,G\right)={E}_{X\sim{P}_{data}(x)}\left[logD\left(x\right)\right]+{E}_{Z\sim{P}_{z}(z)}[log(1-D\left(G\right(z\left)\right)\left)\right]$$
1
The ORGAN algorithm extends seqGAN[28, 36], focusing on generating high-quality sequences, \({Y}_{1:T}\), through \({G}_{\theta }\) and classifying them against real sequences using \({D}_{\varnothing }\). For discrete data sets, the process of sampling data is inherently non-differentiable. However, this challenge can be overcome by employing reinforcement learning for \({G}_{\theta }\), treating it as an agent in an environment to optimize sequence generation strategies. A defined reward function, \(R\left({Y}_{1:T}\right)\), motivates \({G}_{\theta }\) to produce sequences that not only meet quality criteria but also remain indistinguishable to \(D\), employing a mix of rewards to balance between reinforcement learning and the seqGAN model.
Given any partial sequence, referred to as st, the agent is required to select an action and determine the subsequent token. The agent's decision-making follows a stochastic policy, denoted by \({G}_{\theta }\left({y}_{t}\right|{Y}_{1:T})\), with the primary objective of maximizing expected long-term rewards. This methodology enables the generator to devise strategies for crafting high-quality sequences within the realm of discrete data spaces. The strategy's objective function is represented as follows:
$$J\left(\theta \right)=E\left[R\left({Y}_{1:T}\right)|{s}_{0},\theta \right]=\sum _{{y}_{1}\in Y}{G}_{\theta }\left({y}_{1}|{s}_{0}\right)·Q({s}_{0},{y}_{1})$$
2
Here, \({s}_{0}\) is established as a constant initial state. The action-value function, \(Q\left(s,a\right)\), reflects the anticipated reward for undertaking action a in adherence to the current strategy, which then guides the completion of the sequence. For any complete sequence \({Y}_{1:T}\), \(Q\left(s={Y}_{1:T-1},a={y}_{T}\right)=R\left({Y}_{1:T}\right)\), yet there is also merit in evaluating Q for partial sequences. This evaluation aims to incorporate the potential future returns upon the completion of the sequence. To facilitate this, N iterations of Monte Carlo searches are conducted, producing standard rollout sequences under the guidance of policy \({G}_{\theta }\):
$${MC}^{{G}_{\theta }}\left({Y}_{1:T};N\right)=\{{Y}_{1:T}^{1},\dots ,{Y}_{1:T}^{N}\}$$
3
In this setup, \({Y}_{1:T}^{n}={Y}_{1:t}\), with \({Y}_{t+1:T}^{n}\) generated through random sampling by policy \({G}_{\theta }\). Subsequently, the action-value function \(Q\left(s,a\right)\) undergoes an update:
$$Q\left({Y}_{1:t-1},{y}_{t}\right)=\{\genfrac{}{}{0pt}{}{\frac{1}{N}\sum _{n=1\dots N}R\left({Y}_{1:T}^{n}\right), with {Y}_{1:T}^{n}\in {MC}^{{G}_{\theta }}\left({Y}_{1:t};N\right), if t<T.}{R\left({Y}_{1:T}\right), if t=T.}$$
4
The reward function evolves, incorporating feedback from \({D}_{\varnothing }\) and other metrics, controlled by \(\lambda\). This adjustment allows for dynamic training, preventing mode collapse by penalizing repetition and promoting diversity in generated sequences.
$$R\left({Y}_{1:T}\right)=\lambda {D}_{\varnothing }\left({Y}_{1:T}\right)+(1-\lambda ){O}_{i}\left({Y}_{1:T}\right)$$
5
This streamlined translation aims to convey the intricate processes involved in leveraging reinforcement learning algorithms for training on discrete data, emphasizing the strategic formulation and implementation of sequence generation within this framework.
Details of training ORGAN
To enhance ORGAN's ability to assess the potential of compound development, 1 million "real" samples from the ZINC database were collected.
Pre-training of ORGAN's generator and discriminator was essential before adversarial training. The generator was first trained on a diverse subset of 800,000 molecules, with 200,000 set aside for validation to track progress, each using a batch size of 64, stopping when the loss stagnated.
Initial discriminator training involved "real" (positive) and synthetically generated (negative) samples, divided into training and validation sets in an 8:2 ratio, trained with a batch size of 64, stopping when there was no improvement in validation loss.
The generator is trained using the Policy Gradients method to optimize its parameters by maximizing the cumulative reward. This reward includes feedback from the discriminator and domain-specific objective functions predefined based on task requirements. In this study, solubility and molecular docking score serve as the objective functions, which are linearly combined with specified weights. This approach guides the generator to produce samples meeting these specific objectives.
The discriminator is trained using adversarial loss to differentiate between real and fake data generated by the generator. It also undergoes classification loss training, specifically referring to the classification loss calculated for multi-class tokens in SMILES sequences. This dual training approach enables the discriminator to distinguish between "real" and "fake" data and classify real data, enhancing its ability to evaluate generated samples. Additionally, the discriminator employs convolutional layers to extract features from sequences and compare the generated and real sequences at the feature level, ensuring that the generated data aligns with real data statistically.
During adversarial training, the generator is updated for every 64 sequences (for solubility as the objective) or 32 sequences (for affinity as the objective), using strategic sample batches to improve discriminative accuracy and generator efficiency.
Discriminator adjustments are aligned with generator settings, balancing batches of "fake" and "real" samples for training, with a termination criterion based on epochal loss plateau. A consistent learning rate of 0.0001 is applied across all training sessions, leveraging the Adam optimizer for its reliability and performance efficiency.
In molecular generation tasks, where molecular sequences are discrete data and the sampling process is non-differentiable, the generator is optimized using the policy gradient method.