Drug discovery and development are two of the most important tasks of the pharmaceutical industry. To meet customers’ increasing demands while guaranteeing good potency and minimal side effects, the structures of new drug molecules have become increasingly complicated. Meanwhile, drugs that have such complexity cannot be manufactured in an acceptable time period using existing R&D technology; that is, extensive pharma R&D activities are dramatically required, and a relatively short discovery-development-deployment cycle is desirable[1].
Over recent decades, as a rate-limiting factor, innovations in organic synthesis have significantly enabled the discovery and development of important life-changing medicines, thereby improving the health of patients worldwide[2, 3]. Nevertheless, innovations and excellence in organic synthesis are expected to be the most powerful driver for all phases of drug discovery and development. Recently, chemists enthusiastically applied advanced machine learning and artificial intelligence (AI) technologies toward the synthesis of drug molecules. For example, the AI-driven discovery of drug molecules[4-6], automated planning of synthetic routes[7-9], machine learning-driven optimization of reaction conditions[10-12] and autonomous assembly of synthetic processes[13-15].
Synthesis planning, which is regarded as the central element of organic synthesis, can be traced back to the 1960s[16]. Traditional computer-aided approaches for synthesis planning have different disadvantages, such as low efficiency, poor repeatability and high experimental cost. Additionally, many new compounds and reactions have been discovered, which highly requires novel synthesis planning approaches. For example, the numbers of reactions and compounds currently contained in the Reaxys database have exceeded 40 million and 100 million, respectively. During the last three years, a variety of machine learning and AI methods, such as random forest, automated reasoning, support vector machines, and more recently, deep learning, have demonstrated their capacity for organic molecule discovery, design and production[13, 17-19]. Clearly, the application of the aforementioned new technologies for end-to-end organic molecule discovery and development will be key to achieving fully automated synthesis planning[20].
Retrosynthetic analysis is a canonical technique used to plan the synthesis of small organic molecules for drug discovery[21]. Generally, retrosynthesis analysis consists of four steps: (1) determine the target compound; (2) disconnect certain bonds that are considered to be easy to form according to known chemical knowledge in the target compound as the reverse of the reaction, and search for possible precursors in this manner; (3) repeat step 2 for all the precursors to form a synthesis tree, and expand it until all precursors are available; and (4) evaluate all the branches of the synthetic tree individually, and then take the most possible branch as the optimal route. Because different groups of reaction sites in the same group of precursors may exist, that is, the difference between real reactions and expected reactions in the branches may occur, step 4 is therefore critical, but also the most difficult. Forward reaction prediction is necessary to guarantee the correct evaluation of each branch. The main aim of this paper is to present machine learning approaches for assisting forward reaction prediction.
The existing approaches for forward reaction prediction can be categorized as template-based and template-free methods. For example, Coley et al.[17] applied reaction templates to reactants to generate as many candidate reactions as possible, which were then used to train a neural network. Whereas, for template-free methods, Schwaller et al.[22] compared chemical reactions from reactants to products to translations from one language to another so that forward reaction prediction could be transformed into machine translation and solved by training a seq-to-seq recurrent neural network that directly took reactants’ simplified molecular input line entry specification (SMILES) as input.
As the research target is to discover new reactions that are meaningful for organic synthesis according to existing reaction mechanisms, in this paper, the template-based approach is adopted to fully reuse the discovered reaction rules summarized from experimental results to date. Specifically, a novel hard-threshold-based deep neural network is adopted to improve the accuracy of forward reaction prediction. In detail, hard-threshold activation and the target propagation algorithm are implemented by introducing mixed convex-combinatorial optimization. Comparative tests are conducted to explore the optimal hyperparameter set. The reminder of the paper is structured as follows: forward reaction prediction is described first, and then a hard-threshold neural network is briefly described; next, results and discussions are presented; finally, the conclusion is presented.
Forward reaction prediction
The overall approach for template-based forward reaction prediction is summarized in Fig. 1.
Given a certain group of reactants to predict, reaction templates extracted from a known popular template set are applied to generate a group of candidate reactions. Then the candidate reactions are converted into vectors. The vectors of the group of candidates are then used for training the hard-threshold-based deep neural network. Generally, the data augmentation strategy, vectorized description of the reaction and candidate reaction selection are key components.
Data augmentation strategy
Considering that the existing chemical knowledge databases only contains real reactions that take place in practice, to train the neural network to identify real reactions, it is necessary to adopt a data augmentation strategy to expand the database. Specifically, real reactions are first transformed into SMILES, which are further converted to form the template set in the SMARTS format via a heuristic algorithm. Then, all feasible popular templates are applied to each group of reactants in real reactions to generate a large amount of fake reactions that actually cannot occur in practice. Finally, the augmented reaction dataset including real reactions labeled as positive examples along with fake reactions labeled as negative examples are provided to the hard-threshold-based deep neural network.
Vectorized description of the reaction
Because a neural network traditionally requires vector format input, the augmented reactions must be converted to vectors in an appropriate manner. Clearly, the strategy for choosing features to construct the vectors significantly affects the final prediction accuracy.
Following the “Edit Vector” format[17], an atom is described as a 32-dimensional feature vector and a bond is described as a four-dimensional feature vector. Each reaction is first decomposed into four types of basic Edits, that is, hydrogen loss, hydrogen gain, bond loss, and bond gain, and then described as a combination of feature vectors. Note that the loss and gain of non-hydrogen atoms are considered as the loss and gain of the bond between two atoms, respectively. As a result, the Edit Vector for either the bond loss or bond gain includes information from the two atoms and the bond. For example, hydrogen loss is described as a corresponding 32-dimensional atom feature vector, whereas bond loss is composed of two corresponding atom feature vectors and the corresponding bond feature vector, which thus can be described as a 68-dimensional feature vector.
To exactly explain how the Edit Vector describes the reaction, in the following is a simplified example based on the chemical reaction expressed in Fig. 2. The atom feature vector in this example has six dimensions [is_carbon, is_nitrogen, is_oxygen, is_chlorine, num_Hs, num_non-H_atoms] chosen from the aforementioned 32 dimensions, and the bond feature vector has four dimensions [is_single, is_aromatic, is_double, is_triple].
(See Supplementary Files for Figure 2 Expressions)
Candidate reaction selection
The selection step uses a complex neural network that consists of several subnetworks, as shown in Fig. 3. For each candidate reaction, the aforementioned four Edit Vectors are calculated, and then provided as input to four corresponding subnetworks. The sum of outputs from the four subnetworks is then fed to the lower integrating subnetwork to produce scalar probability scores. The above steps are repeated for all candidate reactions, and then all the probability scores are normalized using the softmax method to estimate the probability of occurrence of each candidate reaction. Finally, all candidate reactions are sorted by the probability score, and the candidate that ranks first is regarded as the prediction result. The prediction is correct if its outcome has the same SMILES as the recorded reaction’s, and vice versa.
To improve the prediction accuracy, a hybrid model that uses the Edit Vector and extended-connectivity fingerprint (ECFP)[23] is also considered in this paper. The only difference between the two models is that an extra subnetwork without a hidden layer, which evaluates the ECFP, is added to the hybrid model, as shown in the middle of Fig. 3. The output of the ECFP subnetwork is multiplied by ε when the subnetwork outputs are summed before input to the final integrating subnetwork, where ε is the mixing factor. By adjusting ε, the proportion of the ECFP subnetwork’s output can be precisely controlled.
Hard-threshold neural network
The neural network, particularly the deep neural network learning, is currently the most popular machine learning algorithm and has powerful fitting capability[24]. However, with the ceaseless expansion of the size of the neural network, a series of problems, particularly gradient vanishing and gradient explosion, often occur. To eliminate this dilemma, in this paper, a hard-threshold neural network is applied to predict the outcomes of organic synthesis.
Constructing a hard-threshold neural network
“Hard-threshold neural network” refers to a neural network that has hard-threshold activation, which includes step activation and staircase activation shown in Figs. 4a and b, respectively. Staircase activation is the sum of some step activations. Hard-threshold activation was used in early contributions to perform binary classification before the neural network had been proposed. Hard-threshold activation has a constant derivative that is zero, which can effectively avoid gradient vanishing and gradient explosion. Additionally, the scale of the output is almost fixed and insensitive to the scale of the input, which helps to avoid certain abnormal propagation and simplify the computation. However, the zero derivative of hard-threshold activation also prevents it from being trained using traditional backpropagation.
Target propagation algorithm
A new backpropagation algorithm is required to train the hard-threshold neural network to bypass the zero derivative of hard-threshold activation.
Based on the “target propagation” concept[25], a new target propagation algorithm called FTPROP-MB was recently proposed[26]. Essentially, because a perceptron with step activation is trainable, the hard-threshold neural network could also be trainable if it could be decomposed into a perceptron. Specifically, a target vector td is introduced to represent what the dth layer is supposed to output for all hard-threshold activation layers. After the normal forward propagation procedure for each layer, FTPROP-MB determines td first, and then introduces a layer loss Ld, which is used to compute the gradient, similar to training a perceptron, so that the weights can be updated.
Considering that the output of hard-threshold activation is a set of discrete values, the determination of td can be reduced to a combinatorial optimization problem. In detail, the question of how to optimize td regarding the overall loss and layer loss can be expressed in a standard form as follows: (see Equation 1 in the Supplementary Files)
where Wd and zd represent the weight and pre-activation output of the dth layer, respectively. The search space is large and discrete because all the components of td are restricted to 0 and 1, so it is difficult for common search algorithms to determine the optimal solution in a reasonable time horizon. Because layer loss is typically convex, FTPROP-MB determines the target vector td in the dth layer using a heuristic method: compute the derivative of layer loss in the (d+1)th layer Ld+1 with respect to the dth layer’s output hd, and then set td according to the opposite sign of this derivative. This method can be mathematically formulated as: (see Equation 2 in the Supplementary Files)
When the layer loss function is convex, the negative partial derivative of Ld+1 on hdj points to the global minimal of Ld+1. Consider hdj = -1 as an example. If r(hdj) = -1, which indicates that the partial derivative of Ld+1 is positive, clearly Ld+1 increases if hdj = +1 when fixing the other components of hd; thus, tdj = r(hdj) = -1. By contrast, when r(hdj) = +1, which means that the partial derivative of Ld+1 is negative, it is not known exactly whether Ld+1 increases or decreases if hdj = +1. However, the difference between hdj and r(hdj) indicates that the current value of hdj is a lack of confidence, so a natural choice is to lead zdj to zero by adjusting tdj to make it more possible for hdj to flip, w.r.t tdj = r(hdj) = +1.
To summarize, the training process of an n-layer hard-threshold neural network has both an optimization problem on the weights and convex-combinatorial optimization problem on the target vectors; hence, a mixed convex-combinatorial optimization problem is formed. A block diagram of the target propagation algorithm is shown in Fig. 5.
Layer loss function
Till now we are still encountering the problem of choosing the layer loss function. According to related work[27], it is acceptable to adopt soft hinge loss and weighing according to the gradient, which is shown in Fig. 6.