Fragment databases generation
ChEMBL database (version 22) was used as a source of structures for the databases of interchangeable fragments. 1 557 992 distinct structures containing only organic elements (C, N, O, S, P, F, Cl, Br, I, B) remained after curation procedure. The curation was performed using the previously developed protocol [36] which includes Chemaxon Stardardizer and Checker [37] and RDKit [38] sanitization checks.. To estimate the context radius’ effect on the generated compounds we generated two databases with context radius from 1 to 5. The first database was generated from all ChEMBL compounds. The second one was generated from compounds without PAINS (1 464 907 compounds). The number of distinct context and fragment combinations grew linearly with context radius increase (Table 2).
Table 2. The number of distinct fragments and corresponding contexts in databases generated from the whole ChEMBL data set and its PAINS-less subset.
radius
|
ChEMBL
|
PAINS-less ChEMBL
|
1
|
35 833 160
|
34 240 810
|
2
|
41 676 473
|
39 800 734
|
3
|
51 730 960
|
49 403 630
|
4
|
62 821 316
|
59 971 431
|
5
|
74 168 168
|
70 717 996
|
To investigate the hypothesis that limiting synthetic complexity in fragmented structures improves synthetic accessibility of generated compounds we created fragment libraries from ChEMBL compounds having SCScore below a specified threshold (2, 2.5, 3, 3.5). The fixed context radius 3 was chosen as a reference because at this radius most functional groups are distinguishable and therefore structural replacements are reasonably specific. At smaller radius some functional groups cannot be distinguished, e.g. N-substituted amide (*-N-C(=O)) and amino (*-N-C-C) groups. Larger radius in this case would result in too specific replacements because it would be able to distinguish amides differently substituted at α-carbon atom. The number of compounds and resulted fragment and context pairs substantially decreased with lowering the SCScore threshold (Table 3).
Table 3. Statistics of the initial ChEMBL data set and filtered ones by SCscore values and the number of resulted distinct fragment and context pairs.
Data set
|
number of compounds
|
number of distinct fragments & contexts of radius 3
|
ChEMBL
|
1 557 992
|
51 730 960
|
SCscore <= 3.5
|
552 162
|
20 514 883
|
SCscore <= 3
|
284 461
|
10 661 179
|
SCscore <= 2.5
|
111 365
|
4 091 634
|
SCscore <= 2
|
27 916
|
951 993
|
Parent compounds selection
DrugBank compounds were selected as a source of parent compounds for simulation of a local exploration of a chemical space. 6002 compounds were left after curation of the whole database using the same protocol mentioned above. They were ranked according to their SCScore values and each twelfth compound was selected to create a subset of 500 compounds.
The MUTATE operation was applied to the selected 500 compounds. We set the minimum size of the replaced fragment to 0 to enable replacement of hydrogens. The maximum size of the replaced fragment was up to 10 heavy atoms. The size of a replacing fragment could be smaller or larger that the replaced fragment by at most ±2 heavy atoms.
Influence of a context radius on generated compound sets
The database of fragments generated from the whole ChEMBL data set was used to study influence of a context radius on generated sets of compound. Starting from the selected 500 DrugBank compounds the corresponding number of data sets was generated; after that average novelty, diversity and synthetic complexity were calculated for each data set. As we expected the number of generated compounds substantially dropped with increase of context radius because a smaller number of matching context-fragment pairs were found in a fragment database. Average novelty and diversity decreased more smoothly and generated data sets still covered a large range of these values. Synthetic complexity seemed to be the least affected but it also decreased with increase of context radius (Figure 4).
The general distributions of data set parameters depicted in Figure 4 struggle to give a complete picture of properties of the generated data sets. Therefore, we calculated difference plots to emphasize the changes of parameters depending on a context radius for each data set. We selected data sets generated at the context radius 3 as reference points and calculated differences between corresponding parameters of corresponding data sets generated at other context radius and the reference data sets (Figure 5). The results revealed the same trend in a more pronounced way. Despite of substantial drop in the number of generated compounds up to 1 million, novelty, diversity and synthetic complexity mainly were not decreased much with increasing of the context radius from 1 to 3. Increase in the context radius to 4 or 5 resulted in a continuation of decrease of the number of generated compounds on up to 100 000 and slight decrease of average novelty, diversity and synthetic complexity.
Control over synthetic complexity
We examined two possible strategies to reduce synthetic complexity of generated compounds. The first one is based on an idea that constructing structures from frequently occurring fragments results in more synthetically feasible compounds [32]. We used the whole ChEMBL fragment database limiting replacements by the fragment occurrence in the database: no limits or minimum 10, 100 or 1000 occurrences. The second strategy is based on fragment databases generated from more synthetically accessible compounds selected according to SCScore [35]. Here we used databases generated from compounds having SCScore value not greater than 2, 2.5, 3 or 3.5. No restrictions on fragment occurrence were applied in this case. 500 DrugBank compounds were used as a starting structures and the corresponding number of compound sets were generated using each of these strategies.
Table 4 shows the number of times each generation strategy resulted in a data set with best average synthetic complexity, novelty and diversity. The strategy, which used whole ChEMBL fragment database with no restriction, (ChEMBL & 0) resulted in the highest novelty and diversity of enumerated data sets in about 70% of cases, whereas the lowest average synthetic complexity was observed in only 13 cases (2.6%). According to synthetic complexity of generated compounds the top strategy with the greatest number of wins (205 cases or 41%) was based on the fragment database generated from compounds with SCScore ≤ 2. Limitation of fragment replacements based on fragment occurrence improved synthetic complexity of generated compounds, but substantially decreased the number of output compounds. The second best strategy resulted in 169 data sets (33.8%) with lowest synthetic complexity used fragments occurred at least 1000 times in the whole ChEMBL database. However, the average number of output compounds was 15 in this case whereas the strategy used the fragment database generated from compounds with SCScore ≤ 2 resulted in 430 compounds in average. The latter strategy outperformed all strategies based on fragment occurrence limitation by the number of generated compounds and their synthetic complexity.
Table 4. Comparison of different generation strategies.
fragment database & occurrence
|
wins (by lower mean SCscore)
|
wins (by higher mean novelty)
|
wins (by higher mean diversity)
|
mean/median number of compounds
|
ChEMBL & 0
|
13
|
357
|
347
|
7812 / 3800
|
ChEMBL & 10
|
5
|
2
|
47
|
695 / 323
|
ChEMBL & 100
|
17
|
1
|
7
|
98 / 66
|
ChEMBL & 1000
|
169
|
0
|
4
|
19 / 15
|
SCScore ≤ 3.5 & 0
|
16
|
55
|
61
|
5331 / 2540
|
SCScore ≤ 3 & 0
|
19
|
29
|
24
|
4031 / 1898
|
SCScore ≤ 2.5 & 0
|
56
|
16
|
9
|
2473 / 1122
|
SCScore ≤ 2 & 0
|
205
|
40
|
1
|
1040 / 430
|
total number of compounds:
|
500
|
500
|
500
|
|
Synthetic complexity reduced more pronounced in cases where fragment databases generated from more synthetically accessible compounds were used (Figure 6). In the case of the fragment database created from compounds with SCScore ≤ 2 the average decrease was 0.26 that is comparable to the value 0.25 imposed by the authors of SCScore as an objective during optimization to separate reactants and products. Therefore, it might be expected that compounds having SCScore value lower by 0.25 would require one less step to be synthesized.
Control over chemotypes of generated compounds
Radius of considered fragments context implicitly determines the chemotypes of generated structures. Within the developed approach we cannot create the fragments that did not occur in the input database and have the size equal or less than a chosen radius. Here size means the longest distance between two atoms in a fragment, where the distance is the shortest path between two atoms. So, by increasing the radius one can make a generation of structures more conservative relatively to chemotypes of compounds used for generation of a fragment database. This can be useful in a scenario, where compounds with undesirable patterns are removed from the input database and a fragment database is generated with a reasonably large radius to avoid generation of compounds with these undesirable fragments (chemotypes). At the same time increasing the radius can decrease the number of generated compounds and their diversity and novelty. Therefore, the choice of the radius is a trade-off between these options.
We simulated unrestricted iterative stochastic exploration of a chemical space starting from a benzene molecule to demonstrate this ability. On each iteration the MUTATE operation was applied to input compounds to perform all possible replacements. Generated compounds with molecular mass greater than 500 were discarded. The remaining compounds were ordered by molecular mass and split into 5 bins. One compound was randomly selected from each bin. The selected five compounds passed to the next iteration. Totally 100 iterations were executed.
This procedure was run once per each context radius from 1 to 5 using the PAINS-less fragment database. Then compounds generated on all iterations of each run were collected and examined to contain PAINS fragments. As expected, the number of emerged PAINS patterns decreased with increase of context radius: 102 distinct PAINS patterns for radius 1 were detected, 52 for radius 2, 28 for radius 3, 26 for radius 44 and 1 for radius 5. The last one is dialkyl aniline moiety having a methoxy substituent at the position 4 (Figure 7). The longest distance in this pattern equals to 6 bonds between nitrogen atom and methoxy carbon atom, therefore the context of radius 5 could not fully cover it and it was reconstructed during structural mutations. Only one PAINS pattern was generated due to stochastic nature of the simulation but other PAINS fragments having the size greater than 5 may appear in structures generated with context radius 5. However, no smaller patterns were detected. This demonstrated that one can implicitly control chemotypes of generated compounds by increase of context radius. The full list of determined PAINS patterns is provided in Additional file 1.
Guacamol benchmarking
To demonstrate general applicability of the CReM approach we used Guacamol benchmarks. The Guacamol tasks are diverse and might require diverse search strategies – as it is difficult to expect that one strategy will demonstrate the optimal performance at every task. We implemented a single iterative algorithm and applied it to all Guacamol tasks not tuning it to separate tasks.
If the list of seed structures was empty the seed structures were chosen randomly from the list of SMILES supplied with Guacamol and represented the whole ChMEBL database. The size of a population selected on each iteration was set be equal to the size of the output population but not less than 10 compounds. To make the search adaptive we adjusted fragment size of replacement according to the current score of the population. If the score was equal or less than 0.3 (far from the goal) the replacing fragment can differ at most on ±10 heavy atoms from the replaced one. If the score was greater than 0.8 (close to the goal) the replacing fragment can differ at most on ±4 heavy atoms from the replaced one. Intermediate fragment sizes (5-9) were chosen if the score was within 0.3 - 0.8 range. This allows to quickly explore chemical space in the beginning and better tune structures at the end of generation. For each compound in a population up to 1000 randomly chosen replacements were applied. Compounds, which were already used for structure generation, were stored in a separate list and removed from the list of generated structures. Remaining top scored compounds were selected for the next iteration.
Since the implemented optimization procedure is local and can get stuck in local optima we implemented three levels of “patience”. At the first level if the best score was not improved after three consecutive iterations the fragment size was increased on ±1 and the number of randomly chosen replacements on 100 irrespectively to the current score. This makes small stepwise increase in chemical space exploration. If after 10 iterations no improvement was observed larger changes were applied: the fragment size was increased on ±10 and the number of replacements on 500. This would enable rougher exploration of a chemical space around best candidates. At the third level, if after 33 iteration no improvement was observed new seed compounds were randomly selected to restart the search but best found candidates were kept. This procedure was not applied if the seed structure was supplied with the task. The list of already visited compounds was cleared after any change of generator parameters whether this was caused by improving of the best score or by exceeding of one of “patience” levels.
Some of the target benchmark compounds contain complex ring systems. Therefore, due to current limitation of the implemented CReM approach to generate new ring systems the whole ChEMBL fragment database was used in this study. Maximum execution time of each task was set to 5 hours or maximum 1000 iterations were allowed.
The results demonstrated that the implemented search algorithm based on CReM approach compared well with the published reference approaches by achieving the highest score in 16 out of 20 tasks. However, the total score was slightly lower than the total score of Graph GA approach, which uses genetic algorithm on molecular graphs. This is mainly due to the considerable advantage demonstrated by Graph GA approach (0.891) over CReM-based approach (0.763) in the task of generation of molecules, which were as structurally dissimilar to sitagliptin as possible but had similar lipophilicity and topological polar surface area. Interestingly, the other reference approaches performed even worse in this task. Output results and tuning parameters are available in Additional files 2 and 3.
Table 5. Results for Guacamol benchmarks.
task
|
SMILES LSTM*
|
SMILES GA*
|
Graph GA*
|
Graph MCTS*
|
CReM
|
Celecoxib rediscovery
|
1.000
|
0.732
|
1.000
|
0.355
|
1.000
|
Troglitazone rediscovery
|
1.000
|
0.515
|
1.000
|
0.311
|
1.000
|
Thiothixene rediscovery
|
1.000
|
0.598
|
1.000
|
0.311
|
1.000
|
Aripiprazole similarity
|
1.000
|
0.834
|
1.000
|
0.380
|
1.000
|
Albuterol similarity
|
1.000
|
0.907
|
1.000
|
0.749
|
1.000
|
Mestranol similarity
|
1.000
|
0.79
|
1.000
|
0.402
|
1.000
|
C11H24
|
0.993
|
0.829
|
0.971
|
0.410
|
0.966
|
C9H10N2O2PF2Cl
|
0.879
|
0.889
|
0.982
|
0.631
|
0.940
|
Median molecules 1
|
0.438
|
0.334
|
0.406
|
0.225
|
0.371
|
Median molecules 2
|
0.422
|
0.38
|
0.432
|
0.170
|
0.434
|
Osimertinib MPO
|
0.907
|
0.886
|
0.953
|
0.784
|
0.995
|
Fexofenadine MPO
|
0.959
|
0.931
|
0.998
|
0.695
|
1.000
|
Ranolazine MPO
|
0.855
|
0.881
|
0.92
|
0.616
|
0.969
|
Perindopril MPO
|
0.808
|
0.661
|
0.792
|
0.385
|
0.815
|
Amlodipine MPO
|
0.894
|
0.722
|
0.894
|
0.533
|
0.902
|
Sitagliptin MPO
|
0.545
|
0.689
|
0.891
|
0.458
|
0.763
|
Zaleplon MPO
|
0.669
|
0.413
|
0.754
|
0.488
|
0.770
|
Valsartan SMARTS
|
0.978
|
0.552
|
0.990
|
0.04
|
0.994
|
Deco Hop
|
0.996
|
0.970
|
1.000
|
0.590
|
1.000
|
Scaffold Hop
|
0.998
|
0.885
|
1.000
|
0.478
|
1.000
|
total score
|
17.341
|
14.398
|
17.983
|
9.011
|
17.919
|
* results were taken from the ref [35]