AIDD, an interactive AI-driven drug design system that uses molecular evolution and mechanistic pharmacokinetic simulation to optimize multiple property objectives simultaneously

doi:10.21203/rs.3.rs-3270269/v1

Download PDF

Research Article

AIDD, an interactive AI-driven drug design system that uses molecular evolution and mechanistic pharmacokinetic simulation to optimize multiple property objectives simultaneously

https://doi.org/10.21203/rs.3.rs-3270269/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 19 Mar, 2024

Read the published version in Journal of Computer-Aided Molecular Design →

You are reading this latest preprint version

Computer-aided drug design has advanced rapidly in recent years, and multiple instances of in silico designed molecules advancing to the clinic have demonstrated the contribution of this field to medicine. Properly designed and implemented platforms can drastically reduce drug development timelines and costs. While such efforts were initially focused primarily on target affinity/activity, it is now appreciated that other parameters are equally important in the successful development of a drug and its progression to the clinic, including pharmacokinetic (PK) properties as well as absorption, distribution, metabolic, excretion and toxicological (ADMET) properties. Here, we introduce the Artificial Intelligence-driven Drug Design (AIDD) platform, which automates the drug design process by integrating high-throughput physiologically-based pharmacokinetic (PBPK) simulations (powered by GastroPlus) and ADMET predictions (powered by ADMET Predictor) with advanced generative chemistry algorithms. AIDD uses these and other estimates in iteratively performing multi-objective optimizations to produce novel molecules that are active and lead-like. Here we describe the AIDD workflow and details of the methodologies involved therein. We use a dataset of triazolopyrimidine (TzP) inhibitors of the dihydroorotate dehydrogenase from Plasmodium falciparum (PfDHODH) to illustrate how AIDD generates novel sets of molecules.

artificial intelligence

de novo drug design

evolutionary algorithms

multi-parameter optimization

Pareto optimization

pharmacokinetic simulation

QSAR

QSPR

molecular modeling

generative chemistry

Effective de novo drug design is a long-standing but elusive goal for computational chemistry. Historically, approaches were largely “structure based,” focused on improving binding to a target protein by constructing molecules within the active site of that protein. This typically involved incremental addition of substituents to an existing molecular framework, sometimes augmented with substituent or scaffold transformations[1]–[5] and scoring the modified ligand by docking. Other times it involved algorithmic optimization of very large pre-defined combinatorial libraries such as the genetic algorithm in GADOCK[6] or the factorial design used in CombiDock[7].

Recent work in the area has been ligand-based, focused on using deep neural network (DNN) systems trained against some collection of molecules to generate follow-on candidates[8], [9]. The “chemical space” of stable and synthetically accessible molecules is intrinsically discontinuous, in that one cannot add half of a methylene unit for example. The DNN systems can be configured so as to provide an embedding from structural or property spaces into a nominally continuous space — often one biased in favor of desirable physicochemical and/or biological properties. In many cases, when properties are used, they are themselves predictions based on in silico models.

Such an embedding can be sampled at random, and new chemical structures generated that project to a point near the specified embedded coordinates. This can be challenging if one happens to be working in a particularly chaotic part of the chemistry space, where even a very small change in structure can lead to a large change in properties, so a useful embedding needs to be smooth for efficient optimization. It can also be used to generate valid molecular structures similar to provided lead molecules that will probably have similar (predicted) properties[10]. The “closest” such molecules may not look very similar to a medicinal chemist, however. They might not be synthetically accessible, stable, or even valid structures. Hence the ability to extend or “tweak” the structures of analogs generated by an artificial intelligence (AI) system is key to its making a meaningful contribution to lead discovery and optimization programs.

AI-driven Drug Design (AIDD™) is a new commercial de novo drug design program distributed by Simulations Plus, Inc. as a module of ADMET Predictor®. It applies an evolutionary algorithm directly to chemical structures, evolving novel analogs by applying molecular transforms to seed structures, evaluating the fitness of the structures produced against a set of structural filters and a panel of structure-based objective functions, and then carrying the best analogs forward into the next generation as seed structures.

Internally supported functions include, but are not limited to, molecular attributes like molecular size; quantitative structure/activity relationship (QSAR) models of on-target activity; predictions for a wide range of individual ADMET properties; and high-throughput PBPK simulations (HT-PK). External predictive models – including models based on 3D structure or interaction with a target macromolecule – can also be called. A dataset of 7-aminotriazolopyrimidine (TzP) inhibitors of the dihydroorotate dehydrogenase from Plasmodium falciparum (PfDHODH)[11] is used to illustrate the kinds of auxiliary functions one can optimize against as well how those functions interact with each other and with various program settings.

The AIDD Workflow

The AIDD workflow is shown schematically in Fig. 1. The order in which the workflow steps are discussed below follows the path defined by the heavy green arrows.

The Dataset

The illustrative dataset used here consists of a series of TzP antimalarials that inhibit PfDHODH[12]. Relevant data was compiled from the public literature[12]–[19]. Figure 2 shows five of the early leads from that program, as well as three later analogs that proved more potent, selective, and metabolically stable.

Activity was reported as IC₅₀ values for a solubilized form of PfDHODH lacking the N-terminal tail that anchors it to the mitochondrial membrane in vivo[12]. These were converted to inhibition constants using the Cheng-Prusoff relationship for competitive inhibitors[20] to be consistent with earlier design work targeting PfDHODH[21].

Seed molecules

AIDD requires a set of seed molecules. These can be as arbitrary as an amino acid or benzene, though having some functionality in place – e.g., nicotinamide – will speed up the evolutionary process substantially. A more usual approach is to seed the process with a minimally substituted scaffold. That was done in the AIDD runs described here - the seed molecule was DSM12, in which the 2, 3’, 4’, and 5’ positions all bear hydrogens.

Objective Functions and Pareto Ranking

It is possible to use AIDD to generate molecules that optimize a single algebraic combination of predicted property values. We have found it more productive, however, to use a more general technique – Pareto ranking – applied across several objective functions that can themselves be multi-property. Mathematically speaking, a set S is Pareto optimal when no member of the set is completely “dominated” by any other member. In other words: no member of the set is inferior to any other single member with respect to all criteria[22]. Pareto ranks are assigned to members of a set S by successive extraction of Pareto optimal subsets. Members of the Pareto optimal subset of S are assigned a Pareto rank of 1 and are said to belong to the first “Pareto layer” (L1) or to be on the “Pareto frontier.” Members of the Pareto optimal subset remaining after the first Pareto layer has been removed (S1 = S – L1) have a Pareto rank of 2. The process continues until all members of S have been assigned a rank. This is distinct from Pareto ranking defined as the number of members of a set by which a member is dominated[23]; the latter variation has been used, for example, in partial-match pharmacophore generation[24].

The use of Pareto ranking in AIDD has several advantages. For one, it makes it possible to explore multiple SARs simultaneously. For another, it expands the number of evolutionary paths available to get around “holes” in the chemistry space where no intermediate structures exist that are “good” in terms of individual criteria. It circumvents the need to rationally weight attributes that cannot be put on a single meaningful scale – i.e., that are fundamentally or practically incommensurate. Finally, it can help facilitate optimization challenges posed by the fundamentally chaotic nature of many desirable chemical spaces.

Several different kinds of property predictions can serve as objective functions in AIDD: predictions based on QSARs or QSPRs, synthetic accessibility estimates, Risk scores, and estimates based on HT-PK simulations. “Individual” objectives may themselves be functions of other models. Simple examples are min, max, or average of groups of complementary or related models, but more complex algebraic functions are also supported. The default expectation is that regression models will have been appropriately transformed (e.g., as logK_i or pK_i rather than as K_i, where K_i is a constant e.g., binding, inhibition, etc.) and that Risk models are in an additive space to make optimal use of additive out-of-scope (OoS) penalty factors (see below). Calls to external functions including docking/affinity scores and 3D ligand shape matching are also supported but are not explored here.

One essential feature of AIDD’s implementation of Pareto optimization is that users can specify a specific cap for each property beyond which further improvement (increase or decrease) has no practical advantage. Such caps are important because the program can readily create molecules that are a little superior to others in just one property and inferior in all (or most) other properties. The objective functions used in the experiments described here illustrate the major types available in AIDD:

Activity models

The approach used to construct artificial neural network ensemble (ANNE) models in ADMET Predictor has been described in detail elsewhere[25], [26]. The primary model used to predict PfDHODH inhibition by TzPs, using log-transformed K_i values, was built using 152 active TzPs. Each of the 33 ANNs in the ensemble contained four neurons in a single hidden layer (Fig. 3). Each took the same 10 descriptors as inputs, those descriptors having been selected by a genetic algorithm from among 122 well-represented, reasonably high-variance and relatively uncorrelated ones[26].

The 23 analogs (15%) assigned to the test set were selected by stratified random sampling across the activity range. The observed RMSEs are low enough to make logK_i a reasonably reliable surrogate for experimental activity of follow-up candidates provided those predictions are in-scope. Structures for the analogs used to build the primary logK_i model are provided in the Supplementary Materials, along with K_i’s and values for the descriptors used in the final ANNE models. The partitions between test set and training pools are also provided.

Only active analogs were used to build the logK_i regression model, so new analogs generated might lie outside its applicability domain for reasons that cannot be accounted for in a quantitative model[27], [28]. The companion ActivityClass classification model was built on a subset of 150 analogs that fit a more generalized version of the scaffold shown in Fig. 2 - one that allowed substitution at the 2’ and/or 6’ positions of the phenylamino ring and which included compounds for which only qualified activity values (e.g., “K_i > 1 µM”) were available. Of those 150, the 98 with measured K_i values below 1 µM were assigned to the “active” class, whereas those with K_i values above that were assigned to the “inactive” class. Structures for the analogs used to build the ActivityClass model are provided in the Supplementary Materials. A training pool of 120 compounds was selected by random sampling stratified across activity classes and a genetic algorithm was used to optimize the ANN architecture. The relatively simple model produced had two hidden neurons and seven input descriptors and exhibited good predictive sensitivity (1.00) and specificity (0.90) for the 30 analogs in the held-out test set. The corresponding statistics for performance on the training pool were 0.96 and 0.98.

Risk models

The list of attributes that characterize a good drug candidate is long, including but not limited to: on-target activity, selectivity against related targets, solubility, permeability, bioavailability, synthesizability, and metabolic clearance. Unfortunately, the size of a Pareto frontier increases rapidly as the number of criteria considered increase. This “curse of dimensionality” makes Pareto optimization against more than five separate attributes impractical. Fortunately, most of those characteristics are not independent functions of molecular structure. This is true for many ADMET properties, which makes it useful to group them into lists of rules of thumb like “Lipinski’s Rule of 5”[29]. The ADMET Risk model provided in ADMET Predictor is an extension of Lipinski’s idea from four attributes linked to oral bioavailability (molecular weight, logP, number of NH and OH groups, and number of N and O atoms) to 22 rules that address different aspects of ADMET[25], with each rule violation adding up to one point to the ADMET Risk score. The rules are Boolean combinations of relations between various molecular attributes and model outputs, with risk thresholds calibrated against 2260 compounds selected from the World Drug Index (WDI) in a manner similar to that described by Lipinski et al. The ADMET Risk score obtained exceeds 7 out of a maximum of 22 for 10% of the compounds in that reference subset.

“Fuzzy” risk thresholds[30] are supported in ADMET Predictor, in that thresholds can be specified as intervals as well as “crisp” point values. The penalty weight w_i for each rule is specified as 1 by default and is added to the score for values on the “FALSE” side of the threshold (e.g., on the left for “<” relations), whereas there is no penalty increment for values on the “TRUE” side. Penalties scale linearly across the threshold interval. Rules used in the present work are included in the Supplementary Materials.

Estimated pharmacokinetic properties

Lipinski’s rules of thumb provide convenient broad-brush guidance to medicinal chemists as to factors that limit oral bioavailability, but modern software like GastroPlus® can run physiologically-based pharmacokinetic (PBPK) simulations that accurately estimate fraction bioavailable (%Fb) and fraction absorbed (%Fa)[31]. ADMET Predictor includes an HT-PK module that incorporates the full advanced compartmental and transit (ACAT) model from GastroPlus linked to a liver compartment, renal clearance, and a central compartment representing the remaining organs. The HT-PK implementation uses ADMET Predictor property estimates and requires no user interaction. It is very fast and accurately reflects the PK predictions performed in GastroPlus[31], which makes its outputs suitable for use in AIDD’s panel of objective functions. It nicely complements Risk models, in that it accounts for how ADMET property values interact rather than treating them independently.

Synthetic difficulty

SynthDiff is a generalized implementation of the SA score described by Ertl and Schuffenhauer[32] that provides an estimate of synthetic difficulty – “difficulty” rather than “accessibility” because a higher score indicates that the molecule is harder to make. Briefly, the raw score is a sum of a fragment frequency term and a complexity term. The first is calculated from how frequently a fragment is found in a one million molecule subset of PubChem, where each “fragment” is defined by a circular fingerprint with a topological radius of 3. Complexity calculation is based on the number of rings, stereochemistry, the number of macrocycles, the number of atoms and a molecular symmetry score. That raw score S is rescaled to a range from 0 to 10 by:

SynthDiff = -20*(S – 2.5)/13

An augmented version – SynthDiff+ – is available for AIDD that includes additional penalties for structural features that are particularly undesirable in a drug. To this end, structural alerts derived from the literature[33], [34] were collected in a user-editable file of SMARTS[35] queries whose contents are read into memory at start-up. When SynthDiff + is included in AIDD’s panel of objective functions, the program counts the number of occurrences of such alerts in any candidate molecule and increases its SynthDiff according to:

$$SynthDiff+ =SynthDiff+4*\left[1-{e}^{-\left(\frac{{x}^{2}}{3.4}\right)}\right]$$

where x is the number of structural alerts found, with each occurrence contributing to the count separately. The functional form used is motivated from Fig. 1h of Bickerton et al.[36] and approximately mimics the curve shown there. The increase in the synthetic difficulty estimate as the alert count increases goes to 4 as the exponential term goes to zero.

Parameter Settings

User-supplied parameters are:

Out-of-scope (OoS) penalties for models (1 for risk models, 10 for standard models by default)
pH for pH dependent properties (pH = 7.4 by default)
Species (human, mouse or rat) and dose (in mg) for PK properties
Location of output files (log files, results, etc.)
Frequency with which to write out generational “snapshots” (10% by default)
Number of generations to run (n_gen)
Number of new candidates to consider for each generation (n_cand)
Size of the initial population (n₁);
Minimum size for each generation (n_min)
Multithreading (run using multiple processors to increase speed)

ADMET Predictor flags OoS predictions to indicate that they may not be reliable. When an OoS prediction is used in a Risk rule, that rule is evaluated pessimistically, i.e., as though the prediction was a “bad” one. Any Risk penalty that would otherwise be added to the total Risk score is then attenuated by a multiplicative factor. This OoS Risk multiplier is set to 0.5 by default outside of AIDD, but that global default can be reset by the user to reflect how conservatively they want to treat such predictions. The corresponding penalty is specified separately for each AIDD run, however, because the evolutionary process might otherwise cause even small uncertainties to compromise results. The default AIDD risk penalty of 1 was used here, which means that any potential rule violation was fully penalized.

AIDD applies a separate additive penalty to OoS predictions for regression and classification models used in the Pareto optimization. The default value of 10 was used here; given that most objective functions are on a log scale or have relatively small maximum values, this value makes it easy to filter out compounds with OoS predictions in post-processing. Penalizing compounds in this way is preferable to simply discarding them; so long as the analog is not bested in every other good property by another compound, it will remain in the population and have a chance to produce “descendants” with similarly desirable properties but for which the corresponding activity prediction is in-scope.

The minimum generation size is only enforced after 100 or 0.5*n_gen generations, whichever comes first. The reason for not imposing this constraint earlier is to allow the initial population to grow gradually with only the most Pareto optimal molecules. This generally leads to faster convergence and a better overall solution. A number seed is required to initialize roulette wheel selections. Note that this only completely specifies the trajectory of the run if multithreading is turned off, which is not the case by default and which was not done here. The option is provided for cases where strict reproducibility is required (e.g., in tutorials).

Primary structural filter and transform files

AIDD relies on input files to control how new candidate structures get generated, to make sure molecules that lack key structural features or possess fatally flawed ones do not get into the next generation, and to specify a required scaffold.

There is an option to include a scaffold query that must be satisfied by any new candidate. This query can be uploaded as a file, created in MedChem Designer, or entered as a SMARTS string. Note that SMARTS queries for even relatively simple scaffolds can be quite complicated. The one corresponding to that shown in Fig. 2 – which is included in the query file used for the TzP experiments described here (see Supplemental Information) – is a case in point. Besides supplying a required core structure, it lets the user specify constraints on where and how that scaffold can be substituted, and even allows variation within rings and other variables.

A file of SMARTS structural filter queries specifies substructures that must not appear in any acceptable product. The default set of queries provided are based on exclusion rules described by Brenk et al[33]. Some non-drug-like molecules will get through into the population over the course of the evolutionary process despite these precautions, but that is actually beneficial to a degree. Such “bad” molecules can be – and often are – subsequently transformed into drug-like analogs with good physicochemical and biochemical properties. Unacceptable analogs that do make it through to the final generation can be removed by applying more stringent post-processing filters. The default primary structural filter file provided with ADMET Predictor is well-suited to most AIDD applications, but it is impossible to anticipate all possibilities and business rules. The entries are fully editable to reflect this.

A transform file is required to define the set T of SMIRKS[37] expressions from which molecular transformations are drawn and applied to molecules from the evolving population to generate new candidate structures. The default file provided with ADMET Predictor has over 150 transforms, including “add” transforms that replace hydrogen with a functional group (a chloride, an amide or thiazole ring, for example) as well as others that delete those groups or swap one for another. There are also transforms that create or break rings and that change bond saturations. Each transform is accompanied by an embedded description of what it does and does not do, so the user can make informed decisions about what it might be appropriate to disable it from the interface or comment it out in the transform file. Some may run counter to organizational policies, in which case they can be disabled through the user interface. Likewise, new transforms can be added by editing the file and transforms can be turned on or off through the AIDD interface within ADMET Predictor. At the beginning of each run, AIDD writes out a file (RxnIDs.txt) to the destination folder that contains an indexed list of names for the transforms used in that run; a separate list of transforms disabled during set-up is also generated. The same set of transforms was used for all of the experiments described here; a copy of the RxnIDs.txt file generated is included in the Supplementary Material.

Initialization

The molecular evolution engine at the core of AIDD is primed by populating the zeroth generation (G₀) and the candidate pool P with seed molecules and setting the generation index i to 0. The number of new candidate molecules generated for each generation – k_max – is set initially to n₁; it gets reset to n_cand after the first generation has been populated with n₁ molecules. Roulette wheel sampling is used to bias molecule and transform selection within AIDD; all weights are initially set to 1. Uniform random sampling is used for selecting among any alternative transformation products that may be produced (see below).

The molecular evolution cycle

A parent molecule is randomly chosen from G_i, and a transform is randomly chosen from T. Both selections are by “roulette wheel,” i.e., using weights as described below. Using the selected transform and parent molecule, the program generates all possible products from that combination of parent and transform. Both 1- and 2-chloro propane will be generated as products, for example, when the “add chlorine” transform is applied to propane. Each product molecule generated is checked against the scaffold query Q₀ and subjected to the other filter criteria. It is also checked to make sure it is not in the current population and that it is not a molecule whose properties have already been evaluated.

If multiple alternative products survive the primary filtering step, one is selected at random and added to the current population. The selection weight of its parent is then decremented, since one of its children has been added to P. If nothing survives the filtering step, that combination of parent and transform(s) is never tried again. In the unlikely event that all parent/product combinations fail to generate the required number of candidate molecules, then the program will resort to trying pairs of reaction transforms.

Products that make it through the filters are immediately added back to G₀ to provide more analogs to work from. Once k_max new analogs have been generated, they are submitted for property evaluation. As part of that calculation, any target property for which corresponding to an ADMET Model that is out-of-scope for a given analog is penalized by adding to (for minimized properties) or subtracting from (for maximized properties) the calculated value that property’s OoS penalty. The multiplicative OoS penalties for Risk rules are applied within those models.

Once properties for the k_max new analogs in the population have been calculated, the generational index i gets incremented and the molecules in the first Pareto rank are extracted from P and become the next the next generation (G_i). If i is less than or equal to the minimum of 100 and one-half n_gen (i.e., n_x in Fig. 1), the next round of candidate generation begins immediately. Once i exceeds n_x, however, additional ranks are extracted from P, and successively added to G_i until its size exceeds n_min.

Note that new analogs get added to the candidate pool immediately after they are generated and pass the primary filters but do not pass into the next generation until and unless they survive the Pareto selection pruning process. Hence, they may get selected for transformation while candidates are still being generated. If that happens, their “children” may have been improved in ways that let them survive into the next generation even if their progenitor does not. Indeed, most of the molecules in the first generation will be secondary or tertiary products when a relatively small number (n₀) of seed structures are provided to the program. As a result, the evolutionary process is neither fully generational nor fully steady state in nature but lies somewhere in between. Effects from small differences in the order in which parallel processes complete are the reason that multithreaded runs with the identical seed structures and number seeds may (and generally do) diverge somewhat.

If i is less than n_gen, the roulette wheel selection weights for molecules in G_i are updated. Weights for a molecule are based on the number of its children, its Pareto rank, and the sum of ranks across all objectives (its Borda Rank[38]). The transform selection weights are adjusted downward based on the number of times the molecules they produced failed to pass the scaffold and substructure filters. The cycle then repeats. When i reaches n_gen, the program exits the evolution cycle, and the final generation is written out as an ADMET Predictor file.

Post-processing

The process is not quite complete, as care must be taken to evaluate the final list. These post-processing steps include but aren’t limited to 1) Tautomer standardization and property recalculation; 2) Application of more stringent secondary filters; 3) Generation of classes; 4) Prioritization for testing.

Tautomer standardization is not carried out as part of the AIDD run, in part because it would slow analog generation down, but also because a minor tautomeric form may evolve into more stable analogs. A particular minor tautomer appearing several times when the thermodynamically dominant form does not should encourage a chemist to look for an alternative, more stable bioisosteric replacement that the program did not happen to find. After running tautomer standardization (one click in ADMET Predictor), properties need to be recalculated for any structures that were altered, as well as for any structures whose predicted values were capped to avoid extremes in Pareto sampling. After property re-calculation, it is important to remove analogs from the final population for which predictions are out-of-scope (automatically highlighted in red).

A simple method to conduct post-processing is to use the recalculated AIDD objective function values or – if no structures were changed during tautomer standardization – those which are written out along with the structures. This is accomplished within the ADMET Predictor spreadsheet using slider bars. In this work, we set the filters to require a minimum %Fb of 70, a maximum logK_i of -7.5 (i.e., a minimum pK_i of 7.5, or about 30 nM), a maximum ADMET Risk of 6 and a maximum SynthDiff of 5 worked well for the TzPs. Doing so yielded a fully filtered population of 200–300 analogs, which is a convenient number for manual inspection.

It is also advisable to apply a more stringent substructural filter to remove molecules with undesirable substructures that passed the more lenient primary filters applied during the evolutionary process. Again, the primary filters need to be relatively lenient to keep flawed but otherwise promising candidates in the evolving population. The secondary filtering benefits from incorporating complementary “hard” and “soft” filters. Molecules flagged by the “hard” filter are summarily discarded, whereas “soft” violations are inspected manually. Molecules containing halogens directly bonded to heteroatoms are an example of a “hard” violation, whereas acetals and aminals might be “soft” violations; piperonyl groups and some stable cyclic versions could be acceptable forms of the latter. For simplicity’s sake, “soft” violations were all discarded for the work described here. The structural filter file is supplied in the Supplementary Information.

Part of the challenge of multi-objective optimization of drug candidates is that small structural changes often improve a molecule in one respect while making it less desirable in some other way. This can lead to the generation of clumps of quite similar molecules in the AIDD output – not the “methyl, ethyl, butyl, futile” of blind analog synthesis programs, but still potentially a nuisance. The class generation tool in ADMET Predictor groups products by scaffold, which aids this part of post-processing. Several class generation methods are available, including the method used in this work, which is a “Framework” approach with scaffolds based on ring systems and connecting chains, similar to the “Murcko assemblies” described by Bemis and Murcko (Bemis 1996). Other available methods include classification by Ring-anchored systems, Chain-anchored systems, ECFP fingerprint, and custom scaffolds.

ADMET Predictor provides a multitude of tools to help the drug discovery team select candidates from the final list for synthesis and testing. These include 2- and 3D interactive property plots that are color-coded by a third or fourth property, pop-up structure windows and associated selection tools. Such tools are an important part of the workflow because they afford medicinal chemists useful context when picking out examples for synthesis. In addition, the availability of a linked sketching app to return on-the-fly property predictions helps the chemists identify structural “tweaks” that make molecules more attractive synthesis targets without unduly compromising the combination of predicted properties for which they were selected by AIDD. This step is critical in “the real world”, in part because of how it can engage medicinal chemists from the project team.

The optimization criteria considered here were: logK_i; ADMET Risk; human %Fb at a dose of 1 mg; and SynthDiff+. %Fb was maximized, whereas the other functions were minimized. ADMET Risk was not capped, but logK_i, %Fb and SynthDiff + were capped at -9.35, 90% and 2.0, respectively. Post-processing included both hard and soft structural filters and the following property value cutoffs were applied:

A maximum logK_i of -7.5 (i.e., minimum estimated pK_i of 7.5);
A maximum ADMET Risk of 6;
A minimum %Fb of 70%; and
A maximum SynthDiff + of 5

The ActivityClass model was used as a “determining descriptor” for logK_i, with molecules classified as “active” by ActivityClass returned unmodified logK_i predictions but molecules classified as “inactive” assigned a value of -5 instead.

Risk penalties in AIDD were set to 1, model penalties were set to 5 and pH 7.4 was used in pH-dependent calculations. All experiments were run for 50 generations (n_gen = 50) with an initial population of 100 (n₁ = 100) and 100 candidate molecules created per generation - i.e., n_gen = 50 and n₁ = n_cand = 100. The minimum pruned population size n_min (which was imposed at the 25th generation, i.e., for G25) was set to 100 – i.e., n_gen = 50 and n₁ = n_cand = n_min = 100. Checkpoint population files were written out for every fifth generation.

Several combinations of objective criteria were examined here to illustrate the effect of progressively varying the number of optimization criteria from one for Experiment I to four for Experiment IV. Each was run in triplicate, with the order of property inclusion roughly reflecting the relative weight typically given them in practice. OBJ_logKi alone was used in Experiment I, whereas OBJ_logKi and OBJ_ADMET_Risk were used together in Expt. II. OBJ_SynthDiff + or OBJ_%Fb were added for Expt. IIIa and IIIb, respectively, whereas Expt. IV made use of all four optimization criteria simultaneously. The order of objective inclusion roughly reflects the order in which the corresponding considerations come into play during routine one-factor-a-time optimization projects such as the TzP anti-malarial project. DSM12, in which R_i=H at all four variable positions, was used as a seed molecule for all experiments described here. As mentioned above, a scaffold definition, via SMARTS string, was also used to filter AIDD-generated molecules (see Supplementary Information).

Time courses

We analyzed the population size of total and new compounds over 50 generations of AIDD when optimizing for one to four properties (Fig. 4). The simplest case is that in which there is only one optimization criterion, logK_i in Expt. I (Fig. 4a). This is not a very useful application of AIDD in practice, but it nicely illustrates some important characteristics of the program. As noted above, only the first Pareto layer is used to select members of each new generation until the 100th generation G100) or half-way through the run (Gx, where x = n_gen/2), whichever is reached first. Absent ties, there can be only one molecule in the first Pareto layer when there is only one optimization criterion, so in that case the population doesn’t really start to grow until the 25th generation (G25; Fig. 4a). Up until then, the process is a simple survival-of-the-fittest Monte Carlo optimization of logK_i. That optimization is nonetheless happening is evident from the fact that all compounds are new ones (red square symbols in Fig. 4a). The total population in Expt. I subsequently jumps immediately to 100, then continues increasing up to about 300 total compounds in G27 (Fig. 4a). It then falls back to 100 for G28 and settles there after briefly rising to 150. The two replicate runs shown exhibit nearly identical behavior in this regard. The population rises above the specified k_max because inclusion in the next generation is all-or-none for compounds in a given Pareto layer.

Initial growth is also slow when two properties are being optimized, and a plateau in population size is reached after about 10 generations (Fig. 4b) once the minimum population size requirement is applied at the 25th generation, requiring there to be at least 100 compounds in the population. The algorithm accomplishes this task by taking compounds from Pareto layers beyond the first. In contrast, AIDD runs in which three or four parameters are optimized produce a similar number of new compounds in each generation, leading to a more linear increase in the total number of compounds over time (Fig. 4c-d).

Looking specifically at the AIDD runs with four optimization criteria, we can see how properties evolve individually across generations (Fig. 5). For three of the four properties, logKi, ADMET Risk, and %Fb, there is a clear improvement over time, with the final generation comprising the bulk of the best compounds. SynthDiff is the outlier, but ALL of the molecules, including those from the earliest generations, had very low scores near the capping value, so there wasn’t much room to improve.

Distribution of properties

By plotting the distribution of the various optimization parameters against the number of compounds in final generation before and after secondary property filters were applied, the evolution toward optimized molecules can be visualized. To establish a baseline, we evaluated the property predictions for the literature compounds used to build our QSAR model. Most of the literature compounds passed our secondary substructure filters but not our property filters (Fig. 6a). The handful of analogs that passed both had ADMET Risk scores ranging from 1–6, with most scores lying at the high-risk end. As expected, when optimizing just logK_i, no improvement of ADMET Risk was observed (Fig. 6b). In fact, only one of the analogs produced in Expt. I passed all secondary filters.

ADMET Risk was optimized for each of the other AIDD experiments, which led to more compounds passing both secondary substructure and property filters and a shift to lower ADMET Risk scores, compared to the literature compounds and Expt. I (Fig. 6c-f). Analogous trends were observed for the distributions of logK_i, %Fb, and SynthDiff predictions (Supplementary Figs. 1–3).

The distribution of the analogs generated when all four optimization criteria were applied in Expt. IV can be visualized across all four properties simultaneously in 4D plots such as those shown in Fig. 7. Figure 7a includes all products, whereas Fig. 7B shows the distribution of properties of the analogs remaining after all secondary filters were applied. Such a graphical representation demonstrates how effective those filters can be and is very useful for selecting compounds for follow-up; it is particularly useful in that regard when property filter thresholds can be modified “on-the-fly.”

The structures of analogs generated by AIDD that were actually synthesized, tested and reported on by the Phillips group are shown in Fig. 7c; the points corresponding to them in property space are highlighted in Fig. 7a and 7b. Property distributions for the literature data and the other AIDD experiments described here are provided as Supplementary Material.

Exploration of Chemical Space and Structural Recapitulation

Many recently developed generative systems for de novo drug design try to either create molecules that resemble ones known to be active or to produce molecules with property distributions similar to known actives[10], [39], often to serve as input to secondary optimization schemes. Others have created latent spaces by applying autoencoders to molecules with desirable properties, then projected those spaces onto differentiable high-dimensional probability spaces for subsequent sampling or optimization[40], [41]. AIDD, on the other hand, seeks to expand the structural space directly while simultaneously improving upon the activity and pharmacologically relevant properties of known actives, so it does not fit neatly into any of those categories. Not coincidentally, some of the criteria put forward for assessing the performance of generative models are trivially satisfied – e.g., all structures produced are screened for chemical validity before property predictions are calculated, so no invalid structure can ever be output.

Another important assessment criterion for generative models is the ability to “rediscover” known actives[10]. Unfortunately, this criterion risks punishing AIDD whenever it successfully produces analogs with better properties than literature examples. It is nonetheless intuitively satisfying to see that nearly 5% of the 181 analogs in G50 for the first replicate of Expt. IV (Fig. 7a) are known from the literature and that three of those 9 passed the secondary filters (Fig. 7b).

It is more instructive to look back through checkpoint AIDD generations for known literature compounds – to look, in particular, for known analogs from the dataset used to build the logKi model against which predicted properties were optimized. Table 2 shows the results of doing just that for the three independent replicates of Expt. IV. Overall, AIDD “rediscovered” at least thirty-five different TzPs from the literature across the three replicate runs – “at least” in part because only those analogs that survived their initial Pareto selection were recorded and in part because only every fifth generation was checkpointed. The fact that some analogs (e.g., DSM159 and DSM 161) appeared but only survived for a few generations suggests that the latter scenario may be quite common.

Table 2: Recapitulation of triazolopyrimidines in the three replicates of AIDD Expt. IV. Only analogs from the dataset used to create the logK_i activity model are shown.

ID	Substituents^a			Replication							K_i (µM)
	Substituents^a			Train^b	1		2		3
	R2	R3’	R4’	Train^b	g_o^c	G_x^d	g_o	G_x	g_o	G_x
DSM1	H	:CH:CH:CH:CH:		-	1	(50)^e	---		44	(50)	0.019
DSM69	H	H	Ph	+	5	30	23	(50)	2	(50)	2.1
DSM73	H	CF3	H	+	---		---		4	10	5.6
DSM74	H	H	CF3	-	2	(50)	5	(50)	18	(50)	0.11
DSM75	H	Cl	H	+	10	(50)	5	(50)	2	(50)	0.56
DSM89	H	H	Cl	+	3	(50)	6	(50)	8	(50)	0.64
DSM94	H	CH3	H	+	---		1	10	4	35	2.0
DSM97	H	H	CH3	+	1	(50)	5	(50)	8	(50)	1.7
DSM98	H	H	Br	+	1	(50)	3	(50)	---		0.31
DSM101	H	H	CH2CH3	+	3	(50)	---		40	(50)	0.24
DSM102	H	H	Pr	+	---		---		41	(50)	0.40
DSM104	H	H	Bu	+	---		35	(50)	---		0.11
DSM123	H	CH3	CH3	+	---		---		18	45	0.14
DSM125	H	F	CF3	+	33	(50)	---		22	(50)	0.031
DSM128	H	CF3	Br	+	5	5	---		---		0.18
DSM159	H	-CH2CH2CH2CH2-		+	1	20	---		1	10	0.026
DSM160	H	-CH2CH2CH2-		+	---		---		1	5	0.088
DSM161	H	H	SF5	+	1	5	1	10	13	15	0.052
DSM195	CF3	H	CF3	+	4	(50)	10	(50)	20	(50)	0.014
DSM196	CF3	H	SF5	+	1 (5)	(50)	3	10	---		0.012
DSM215	H	F	Cl	+	21	(50)	20	(50)	46	(50)	0.22
DSM255	CH2OH	Cl	H	+	---		---		16	45	1.4
DSM257	SCH3	H	Cl	+	32	(50)	38	(50)	49	(50)	0.012
DSM258	SCH3	Cl	H	+	43	(50)	7	(50)	42	(50)	0.010
DSM259	SO2Me	H	Cl	+	31		---		---		0.018
DSM260	SO2Me	Cl	H	+	40		---		---		0.064
DSM261	SCH3	H	CF3	+	33	(50)	21	(50)	---		0.014
DSM263	SO2Me	H	CF3	+	6	(50)	---		---		0.036
DSM268	CH2OH	H	Cl	+	---		---		15	(50)	2.1
DSM271	CH2CH3	H	Cl	+	36	(50)	6	(50)	47	(50)	0.11
DSM272	CH2CH3	Cl	H	+	42	(50)	5	(50)	30	(50)	0.096
DSM280	CH2CH3	H	CF3	+	24	(50)	20	(50)	25	(50)	0.035
DSM309	iPr	H	Cl	+	---		38	(50)	---		0.13
DSM317	CH2CH2OH	H	CF3	-	---		16	35	---		0.37
TzP-15^f	H	:CH:CH:CH:N:		+	---		---		1	5	0.48

^a Substituents R2, R3’ and R4’ reference the scaffold shown in Fig. 1; the substituent at R5 is hydrogen in all cases.

^b (+) indicates that a compound was included in the logK_i training pool, whereas (-) indicates that it was in the test set.

^c g _o is the generation in which the compound originated.

^d G _x is the last checkpoint generation was seen, i.e., the generation in which it went “extinct.”

^e A Gx of 50 is means that the compound survived to the final generation. These entries are set off in parentheses to underscore the point that such compounds never went “extinct” in these experiments.

^f We failed to find any report of a DSM number for this compound; it was compound 15 in Phillips et al[12].

About a third of the 35 literature analogs in Table 2 appear at some point in all three replicates, whereas another third of them appear in only one. Moreover, as noted above, those are probably under- and overestimates, respectively. That said, there is a lot of variation between runs in terms of which actives appear and when. DSM1, for example, first appeared in G1 in Rep. 1 and did not appear until G44 in Rep. 3. It survived to G50 in both cases but never appeared in a checkpoint generation from Replicate 2 at all. Hence, the patterns of structural dispersion reflected in Table 2 are similar in much as are different trees of the same species are similar, with the leaves (compounds) being similarly dispersed about the central trunk (scaffold) but with somewhat different patterns of branching (transformational paths) lying beneath the surface.

That many of the most active analogs from the literature appear in Table 2 is to be expected, but the less active examples are in some ways more interesting – they represent classes of chemistry that proved less potent but that the human beings involved in the project thought were worth making. In many cases, those choices shaped the progression of the project in subsequent years because their observed properties were better or worse than hoped. Hence, for AIDD, the examples in Table 2 are better characterized as having been recapitulated as a group rather than being rediscovered as individuals.

The results presented above demonstrate that AIDD generates molecules with good combinations of desirable predicted properties, in many cases better than those of analogs that its activity models are trained on. In the process, it can recapitulate many of the same structures created by the medicinal chemists who built the dataset its activity model was trained on. That it does so is incidental to its primary purpose, but the degree of structural similarity between molecules AIDD generates and actives from the literature provides substantial reassurance that it moves in the right direction.

There is more to a good de novo design program than that, however – it needs to help bridge the gap between the chemistry and DMPK/pharmacology teams. This can only happen when reasonably “good looking” molecules can be generated in a timely manner. As to the first: a typical AIDD run in which several hundred molecules are produced over 50 generations using five optimization parameters takes only a few minutes on a standard 8-processor laptop computer. As to the second: results from the TzP case study presented here show that the suggestions it produces are structurally reasonable, that they explore a “Goldilocks” portion of the relevant property space, and that they do so in a reasonably reproducible manner.

It should be noted that the application of AIDD herein described was almost completely retrospective and that most known active analogs were used to build the activity model used. Such case studies are very useful for characterizing the performance of generative models. They represent at best a “soft” validation of the practical usefulness of such models, however, when used in the quantitative assessment of a generative program’s ability to rediscover compounds used in the model’s training set[10] as opposed to simply remembering them[39]. AIDD’s ability to mimic the range and pattern of a series of literature examples created by human researchers arguably goes beyond that, which is circumstantial evidence of something somewhat closer to “true” artificial intelligence. Matching literature structures and optimizing the predicted properties of novel analog are good proofs of principal but only synthesis and testing of the final products generated in silico can prove that it can move discovery and development projects forward in the “real world.” Such prospective AIDD validation work is ongoing, but the results reported here are a promising start.

A final point to consider is that drug R&D is an intensely human activity and will likely remain so for the foreseeable future. As such, human interaction with the generated analogs is integral to the AIDD workflow, and the program facilitates that process. Secondary structure- and property-based filters are generally applied interactively to triage and sort the molecules in the final generation. Grouping the proposed analogs by substructure provides an important complement to filtering, making it easy to select diverse but representative subsets for synthesis. We hope that the AIDD platform appeals to drug discovery team members who are not cheminformaticians, prompting a mix of responses, including “Of course!”; “That makes sense”; “Well, maybe…”; and “Now that’s interesting…”

Acknowledgements

We would like to thank Eric Martin for his editorial input to this article as well as to his feedback on the program as well as that of his colleagues at Novartis.

Author contributions

The authors are listed alphabetically; the order is not intended to reflect relative contributions to the work. MW conceived of AIDD and was primarily responsible for its implementation. He received key programming and infrastructure support from DWM. RDC and MSL contributed application support and feedback during the program’s development. RDC conceived and carried out the TzP case study. He and JJ worked together on the initial draft of the manuscript and analysis of the results; all authors reviewed the manuscript and provided editorial input.

Funding

Development of AIDD was internally funded by Simulations Plus, Inc. Access to a pre-release version of ADMET Predictor 10.4 was provided to RDC as part of Simulation Plus’ academic collaboration program.

Data availability

All of the data upon which the work described here is based have been drawn from the open literature available. The specific datasets need to used are provided in the Supplemental Information.

Code availability

ADMET Predictor® and GastroPlus® are commercially available from Simulations Plus, Inc., Lancaster CA 93534 (https://www.simulations-plus.com). Access for non-commercial use is available through the company’s academic collaborators program.

Compliance with ethical standards

MSL, DSM and JJ are currently employed by Simulations Plus, Inc., and own stock in the company or options on the stock thereof. RDC and MW are former employees but still hold stock in the company. The authors declare no competing interests beyond those implicit in their employment by their respective institutions.

Bohacek R, Mcmartin C, Glunz P, Rich DH (1999) “Growmol, A De novo Computer Program, and its Application to Thermolysin and Pepsin: Results of the Design and Synthesis of a Novel Inhibitor,” in Rational Drug Design, D. G. Truhlar, W. J. Howe, A. J. Hopfinger, J. Blaney, and R. A. Dammkoehler, Eds., in The IMA Volumes in Mathematics and its Applications. New York, NY: Springer, pp. 103–114. doi: 10.1007/978-1-4612-1480-9_9
Eisen MB, Wiley DC, Karplus M, Hubbard RE (1994) “HOOK: a program for finding novel molecular architectures that satisfy the chemical and steric requirements of a macromolecule binding site,” Proteins, vol. 19, no. 3, pp. 199–221, Jul. doi: 10.1002/prot.340190305
Moon JB, Howe WJ (1991) Computer design of bioactive molecules: A method for receptor-based de novo ligand design. Proteins Struct Funct Bioinforma 11(4):314–328. 10.1002/prot.340110409
Gillet V, Johnson AP, Mata P, Sike S, Williams P (1993) “SPROUT: a program for structure generation,” J. Comput. Aided Mol. Des., vol. 7, no. 2, pp. 127–153, Apr. doi: 10.1007/BF00126441
Böhm HJ (1992) “The computer program LUDI: a new method for the de novo design of enzyme inhibitors,” J. Comput. Aided Mol. Des., vol. 6, no. 1, pp. 61–78, Feb. doi: 10.1007/BF00124387
Martín-Bautista MJ, Vila M-A (1998) Applying genetic algorithms to the feature selection problem in information retrieval. In: Andreasen T, Christiansen H, Larsen HL (eds) in Flexible Query Answering Systems. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp 272–281. doi: 10.1007/BFb0056008.
Sun Y, Ewing TJA, Skillman AG, Kuntz ID (1998) “CombiDOCK: Structure-based combinatorial docking and library design,” J. Comput. Aided Mol. Des., vol. 12, no. 6, pp. 597–604, Nov. doi: 10.1023/A:1008036704754
Wang M et al (Feb. 2022) Deep learning approaches for de novo drug design: An overview. Curr Opin Struct Biol 72:135–144. 10.1016/j.sbi.2021.10.001
Bai Q et al (2022) Application advances of deep learning methods for de novo drug design and molecular dynamics simulation. WIREs Comput Mol Sci 12(3):e1581. 10.1002/wcms.1581
Brown N, Fiscato M, Segler MHS, Vaucher AC (2019) “GuacaMol: Benchmarking Models for de Novo Molecular Design,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, Mar. doi: 10.1021/acs.jcim.8b00839
Phillips MA et al (2016) “A Triazolopyrimidine-Based Dihydroorotate Dehydrogenase Inhibitor with Improved Drug-like Properties for Treatment and Prevention of Malaria,” ACS Infect. Dis., vol. 2, no. 12, pp. 945–957, doi: 10.1021/acsinfecdis.6b00144
Phillips MA et al (2008) “Triazolopyrimidine-based dihydroorotate dehydrogenase inhibitors with potent and selective activity against the malaria parasite Plasmodium falciparum,” J. Med. Chem., vol. 51, no. 12, pp. 3649–3653, doi: 10.1021/jm8001026
Gujjar R et al (2009) “Identification of a metabolically stable triazolopyrimidine-based dihydroorotate dehydrogenase inhibitor with antimalarial activity in mice,” J. Med. Chem., vol. 52, no. 7, pp. 1864–1872, doi: 10.1021/jm801343r
Gujjar R et al (2011) “Lead Optimization of Aryl and Aralkyl Amine-Based Triazolopyrimidine Inhibitors of Plasmodium falciparum Dihydroorotate Dehydrogenase with Antimalarial Activity in Mice,” J. Med. Chem., vol. 54, no. 11, pp. 3935–3949, doi: 10.1021/jm200265b
Deng X et al (2009) “Structural Plasticity of Malaria Dihydroorotate Dehydrogenase Allows Selective Binding of Diverse Chemical Scaffolds,” J. Biol. Chem., vol. 284, no. 39, pp. 26999–27009, doi: 10.1074/jbc.M109.028589
Deng X et al (2014) “Fluorine Modulates Species Selectivity in the Triazolopyrimidine Class of Plasmodium falciparum Dihydroorotate Dehydrogenase Inhibitors,” J. Med. Chem., vol. 57, no. 12, pp. 5381–5394, doi: 10.1021/jm500481t
Coteron JM et al (2011) “Structure-Guided Lead Optimization of Triazolopyrimidine-Ring Substituents Identifies Potent Plasmodium falciparum Dihydroorotate Dehydrogenase Inhibitors with Clinical Candidate Potential,” J. Med. Chem., vol. 54, no. 15, pp. 5540–5561, doi: 10.1021/jm200592f
Marwaha A et al (2012) “Bioisosteric transformations and permutations in the triazolopyrimidine scaffold to identify the minimum pharmacophore required for inhibitory activity against Plasmodium falciparum dihydroorotate dehydrogenase,” J. Med. Chem., vol. 55, no. 17, pp. 7425–7436, doi: 10.1021/jm300351w
Kokkonda S et al (Jun. 2016) Tetrahydro-2-naphthyl and 2-Indanyl Triazolopyrimidines Targeting Plasmodium falciparum Dihydroorotate Dehydrogenase Display Potent and Selective Antimalarial Activity. J Med Chem 59(11):5416–5431. 10.1021/acs.jmedchem.6b00275
Burlingham BT, Widlanski TS (2003) “An Intuitive Look at the Relationship of Ki and IC50: A More General Use for the Dixon Plot,” J. Chem. Educ., vol. 80, no. 2, p. 214, Feb. doi: 10.1021/ed080p214
Clark RD et al (2020) “Design and tests of prospective property predictions for novel antimalarial 2-aminopropylaminoquinolones,” J. Comput. Aided Mol. Des., vol. 34, no. 11, pp. 1117–1132, doi: 10.1007/s10822-020-00333-x
Singh A, Minsker B, Goldberg DE (2012) Combining Reliability and Pareto Optimality—An Approach Using Stochastic Multi-Objective Genetic Algorithms. Apr 1–10. 10.1061/40685(2003)93
Fonseca CM, Fleming PJ (Mar. 1995) An Overview of Evolutionary Algorithms in Multiobjective Optimization. Evol Comput 3(1):1–16. 10.1162/evco.1995.3.1.1
Clark RD, Abrahamian E (2009) “Using a staged multi-objective optimization approach to find selective pharmacophore models,” J. Comput. Aided Mol. Des., vol. 23, no. 11, pp. 765–771, Nov. doi: 10.1007/s10822-008-9227-2
Lawless MS, Waldman M, Fraczkiewicz R, Clark RD (2016) Using Cheminformatics in Drug Discovery. Handb Exp Pharmacol 232:139–168. 10.1007/164_2015_23
Clark RD, Daga PR (1939) “Building a Quantitative Structure-Property Relationship (QSPR) Model,” Methods Mol. Biol. Clifton NJ, vol. pp. 139–159, 2019, doi: 10.1007/978-1-4939-9089-4_8
Amaral Silva D, Pate DW, Clark RD, Davies NM, El-Kadi AOS, Löbenberg R (2020) “Phytocannabinoid drug-drug interactions and their clinical implications,” Pharmacol. Ther., vol. 215, p. 107621, Nov. doi: 10.1016/j.pharmthera.2020.107621
Clark RD (2018) “Predicting mammalian metabolism and toxicity of pesticides in silico,” Pest Manag. Sci., vol. 74, no. 9, pp. 1992–2003, Sep. doi: 10.1002/ps.4935
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (Jan. 1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23(1):3–25. 10.1016/S0169-409X(96)00423-1
Zadeh LA (Jun. 1965) Fuzzy sets. Inf Control 8(3):338–353. 10.1016/S0019-9958(65)90241-X
Naga D, Parrott N, Ecker GF, Olivares-Morales A (2022) “Evaluation of the Success of High-Throughput Physiologically Based Pharmacokinetic (HT-PBPK) Modeling Predictions to Inform Early Drug Discovery,” Mol. Pharm., vol. 19, no. 7, pp. 2203–2216, Jul. doi: 10.1021/acs.molpharmaceut.2c00040
Ertl P, Schuffenhauer A (Jun. 2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminformatics 1(1):8. 10.1186/1758-2946-1-8
Brenk R et al (2008) Lessons Learnt from Assembling Screening Libraries for Drug Discovery for Neglected Diseases. ChemMedChem 3(3):435–444. 10.1002/cmdc.200700139
Rishton GM (Sep. 1997) Reactive compounds and in vitro false positives in HTS. Drug Discov Today 2(9):382–384. 10.1016/S1359-6446(97)01083-0
“Daylight Theory (2023) : SMARTS - A Language for Describing Molecular Patterns.” https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) “Quantifying the chemical beauty of drugs,” Nat. Chem., vol. 4, no. 2, pp. 90–98, Jan. doi: 10.1038/nchem.1243
“Daylight Theory (2023) : SMIRKS - A Reaction Transform Language.” https://www.daylight.com/dayhtml/doc/theory/theory.smirks.html
“Selecting a (2023) voting method: the case for the Borda count | SpringerLink.” https://link.springer.com/article/10.1007/s10602-022-09380-y (accessed Jun. 09,
Renz P, Van Rompaey D, Wegner JK, Hochreiter S, Klambauer G (2019) “On failure modes in molecule generation and optimization,” Drug Discov. Today Technol., vol. 32–33, pp. 55–63, Dec. doi: 10.1016/j.ddtec.2020.09.003
Gómez-Bombarelli R et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4(2):268–276
Kang S-G, Morrone JA, Weber JK, Cornell WD (2022) “Analysis of Training and Seed Bias in Small Molecules Generated with a Conditional Graph-Based Variational AutoencoderInsights for Practical AI-Driven Molecule Generation,” J. Chem. Inf. Model., vol. 62, no. 4, pp. 801–816, Feb. doi: 10.1021/acs.jcim.1c01545

Competing interest reported. MSL, DSM and JJ are currently employed by Simulations Plus, Inc., and own stock in the company or options on the stock thereof. RDC and MW are former employees but still hold stock in the company. The authors declare no competing interests beyond those implicit in their employment by their respective institutions.

Download PDF

Journal Publication

published 19 Mar, 2024

Read the published version in Journal of Computer-Aided Molecular Design →

Editorial decision: Revision requested
11 Jan, 2024
Reviews received at journal
04 Sep, 2023
Reviewers agreed at journal
21 Aug, 2023
Reviewers invited by journal
18 Aug, 2023
Editor assigned by journal
18 Aug, 2023
Submission checks completed at journal
17 Aug, 2023
First submitted to journal
16 Aug, 2023

You are reading this latest preprint version

AIDD, an interactive AI-driven drug design system that uses molecular evolution and mechanistic pharmacokinetic simulation to optimize multiple property objectives simultaneously

Status:

Journal Publication

Version 1

Abstract

Figures

INTRODUCTION

METHODS

The AIDD Workflow

The Dataset

Seed molecules

Objective Functions and Pareto Ranking

Activity models

Risk models

Estimated pharmacokinetic properties

Synthetic difficulty

Parameter Settings

Primary structural filter and transform files

Initialization

The molecular evolution cycle

Post-processing

RESULTS AND DISCUSSION

Time courses

Distribution of properties

Exploration of Chemical Space and Structural Recapitulation

CONCLUSIONS

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1