Data Preparation
Compounds’ SMILES strings were retrieved from the DILIrank database (1036 compounds) (15), and the SIDER 4.1 database (1430 compounds) (16). Side effects are recorded in SIDER using the preferred terms of the MedDRA (Medical Dictionary for Regulatory Activities), which provides a hierarchical organization of adverse events. Starting with the entire SIDER dataset (SIDER 4.1), all compounds with at least one reported side effect contained in the MedDRA’s System Organ Class hepatobiliary disorders were discarded to keep only drugs for which no liver-related side effects have been reported. The SMILES retrieved from DILIrank and SIDER were standardized using the Python package Standardiser of Atkinson et al. (2016) (54). This involved the removal of counterions and solvents and the neutralization of the remaining fragments if necessary. Moreover, tautomers were standardized according to the rules implemented in the standardizer. Subsequently, compounds that fell into at least one of the following categories were discarded: mixtures of more than one active ingredient, inorganic molecules, metal-organic compounds, and compounds with a molecular weight above 1 kDa. If compounds were present in both the DILIrank and the SIDER dataset, the compound from the SIDER inactive dataset was removed to avoid duplicate entries. The final set contained 923 compounds composed as follows: DILIrank: 174 vMost-DILI-Concern, 260 vLess-DILI-Concern, 227 vNo-DILI-Concern, SIDER: 262 compounds without reported liver-related side effects.
ECFP4 (17) hashed to 2,048 bits were generated using the Python library RDKit (version 2019.03.1.0) (55). 1,189 1D and 2D molecular descriptors were generated using the Python package Mordred (18). For the generation of models the values of the molecular descriptors were scaled to a Gaussian distribution with zero mean and unit variance using the StandardScaler function in the scikit-learn Python library (version 0.21.2) (56). Bioactivity for 1,673 human protein targets was predicted using the PIDGINv3 software (19–21). 10 µM was chosen as the bioactivity cut-off to consider highly and marginally active compounds. To get a prediction for every compound-target pair, no threshold for the applicability domain was applied. For 6 out of the 923 drugs (4 of them from SIDER) no protein target prediction was made, since their structures could not be standardized internally in the PIDGINv3 software.
In order to implement a OCO-CV scheme (to ensure similar compounds were not in different folds), we performed hierarchical clustering of compounds. Based on pairwise Tanimoto similarities calculated using ECFP4, a tree was generated using hierarchical clustering with the Nearest Point Algorithm implemented in SciPy (version 1.2.1) (57). Clusters were generated by cutting the hierarchical tree at a distance of 0.5, which resulted in compounds with a Tanimoto similarity of at least 0.5 being in the same cluster.
Model Generation
Overview
We used the SVM and RF implementations in the scikit-learn Python library (version 0.21.2) to train binary classification models for DILI. Models were developed for all three input feature spaces (fingerprints, protein targets, and Mordred molecular descriptors). In addition, we generated models using different subsets of data considering different DILIrank classes as well as additional inactives from the SIDER database (Table 1).
Model Hyperparameter Grid Search
Firstly, for SVMs (58) we used a classifier as implemented in the sklearn Python library and performed a hyperparameter grid search over the following parameters: Kernel: [‘linear’], Class weight: [‘balanced’], Decision function: [‘one vs. rest’], Shrinking: [‘True’], C: [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]. Of the possible SVM kernels implemented sklearn, we only evaluated ‘linear’ as it alone allows for easy interpretation of model feature importances. Secondly, for RFs (59) we used a classifier as implemented in the sklearn Python library and performed a hyperparameter grid search over the following parameters: Bootstrap: [‘True’], Class weight: [‘balanced subsample’], Max. Tree depth: [10, 15, None], Min. samples per leaf: [1,2,3], Number estimators: [100, 200, 300, 400, 500, 750, 1000].
Training procedure
The predictive performance of models was evaluated using a 5-fold cross-validation inside a 10-fold training scheme. The original dataset was split into 10 stratified folds using the StratifiedKFold function in scikit-learn with parameters: n_splits=10, and shuffle=True to assess the impact of different training data on model predictive performance. Within each training fold, an internal grid search over model hyperparameters was conducted using an internal 5-fold cross-validation to select the best model per training fold. This best model in each fold, assessed by balanced accuracy, was then used to predict for the holdout external test set. A LOCO-CV scheme was implemented with the GroupKFold function in scikit-learn with clusters only containing compounds with a ECFP4 Tanimoto similarity (60) greater than or equal to 0.5. This cross-validation scheme was utilized irrespective of the descriptor type. In addition, baseline models were trained and evaluated using the same procedure as mentioned, but with prior y-scrambling of the output labels (3 runs of different random scramblings). To further evaluate model predictive performance, an FDA validation set composed of 49 compounds (a subset of 55 - the CAMDA organizers removed 6, of which were unknown to the authors) previously labeled as vAmbiguous-DILI-Concern, but later relabeled as DILI positive or DILI negative by the FDA was used as an additional test set.
To evaluate the relationship between the chemical similarity of a compound to its 5 nearest neighbors in the training set and model correct classification rate a LOO-CV was conducted for the best performing model (SVM, DILIrank (-vLessConcern, (Figure 2a). This required the calculation of the Tanimoto similarities between the training dataset and the left-out compound using ECFP4 fingerprints and predicting its DILI label within each LOO-CV fold. Furthermore, the Tanimoto 5 nearest neighbor inter-similarity of the FDA validation set to the training set was compared to the corresponding similarities of the external test set (Figure 2b). As we previously evaluated the performance of the model (SVM, DILIrank (-vLessConcern)) with 10 distinct external test sets (in a 10-fold cross-validation scheme; see above), the similarities are averaged across all 10 pairs of training and test set.
Balanced accuracy (equation 1) was primarily used to assess the predictive performance of the models. We also utilized specificity to compare models to those previously published in the literature (Table 3). Those metrics were calculated from a confusion matrix consisting of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).
Interpretation of protein targets
Proteins targets with higher binding probability in DILI positive compounds were determined using a one-sided Wilcoxon rank-sum test with a FDR of <0.05. Among those, proteins features with high median importance across the 10 train-test splits were identified for RF and SVM models, respectively. The feature importance for RF models implemented in scikit-learn (56) describes the decrease of node impurity achieved by a feature, averaged over all trees in the forest, as a fraction, so that the importance of all features included in the model sum up to 1 (59). In SVM models using linear kernels the importance of features is reflected by the magnitude of their coefficients describing the hyperplane (58). The sign indicates which class is favored by the presence of a given feature. The values used for further analysis were the median importance of a feature across the 10 train-test splits.
Over-enrichment analysis was performed using the ReactomePA R package (version 1.26.0) (23) and Reactome pathway maps. To this end, Uniprot IDs were mapped to Entrez gene IDs with the biomaRt package (61) and the whole list of PIDGIN target proteins was used as background. Only gene sets with a containing 10 or more genes were considered and p-values were adjusted using the Benjamini-Hochberg procedure. The analysis was performed using various feature importance thresholds scanning across the top quantile of absolute feature importance values.
Structural alerts
Derivation of structural alerts
Two algorithms for structural alert derivation were used, with the DILIrank (-vLessConcern) dataset as input - MoSS and SARpy. MoSS is a graph-based depth-first search method used for chemical substructure mining (24) and we used a KNIME (v3.7.2.) (63) implementation of MoSS in the current study. It derives potential structural alerts as “subgraphs" with only heavy atoms, which are neither SMILES nor SMARTS. Users might decide to approximate the subgraphs with SMARTS in order to match the substructure to molecules (denoted by SMILES). This program searches for frequent molecular substructures and discriminative fragments in a set of molecule graphs. In a graph, a vertex is a representation of an atom and an edge is a representation of a bond. Each vertex has attributes related to atom type, charge, and whether it is a part of an aromatic ring. Edges indicate the bond type. The search starts from the root of graph tree being a single atom and follows recursively through atoms linked to leaf atoms with subsequent bonds. Substructures are then created based on each state of the graph tree and are pruned if the substructure occurrence in the active class is lower than the defined minimum focus support (MFS).
In order to find a discriminative fragment, two thresholds should be defined by the user. The first one is the aforementioned MFS used for pruning and the second one is minimum complement support (MCS) i.e. the substructure occurrence in the inactive class. The following KNIME MoSS settings were chosen: ‘1%’ MFS (the minimum fraction of the fragment-contained chemicals in the DILI positive class - the true positive rate), ‘0.01%’ MCS (the maximum fraction of the fragment-contained drugs in the DILI negative class - the false positive rate). In addition, only substructures in which the number of bonds ranged from ‘2’ to ‘15’ were kept. Pure carbon fragments were ‘ignored’ and ring mining was ‘applied’.
SARpy is a string-based search method used for chemical substructure mining (25). Briefly, structural alerts in the form of SMARTS strings are generated by recursively breaking every combination of bonds working directly on the SMILES strings of the input dataset. Fragments are then internally validated against all compounds in the dataset, and then a reduced set of substructure “rules” is extracted. In this work’s implementation of SARpy (v.1.0) the fragmentize function parameters minAtoms and maxAtoms were set to 2 and 15 respectively, and the ‘target’ (i.e. DILI positive or DILI negative) was set to None. Structural alerts for DILI were extracted using the extract function with the parameters: 5 minHits, 1 minLR, and 0 minPrecision. These settings are identical to those used by Yang et al. (2017) (62), except that a precision threshold was not applied in order to generate a larger compendium of structural alerts to analyze.
Evaluation of structural alerts
Structural alerts’ SMARTS were matched to compounds’ SMILES using the RDKit HasSubstructMatch function in Python (RDKit 2018.09.3.0). Precision (equation 2) and coverage in DILI positive compounds were both used to assess the predictive performance of the structural alerts. In addition, significance measured by p-value was also calculated for each structural alert using the SciPy (version 1.3.0) stats module fisher_exact function with alternative parameter set to ‘greater’.
As previously mentioned, because MoSS utilizes a graph-based search approach it may not consider the slight difference between aromatic and aliphatic atoms, leading to mismatches when matching its substructures to SMILES. For example, in MoSS, ‘N-C’ can match both aminofuran (NC1=CC=CO1) and aminotetrahydrofuran (NC1CCCO1). However, in SMARTS, ‘C’ is different from ‘c’, so RDKit will not match ‘N-C’ with aminofuran, because the carbons are aromatic. Despite this, the significance of these structural alerts for all structural alerts is based on the presence calculated using RDKit.
To investigate a substructure’s presence in already approved and marketed drugs, substructures were matched to compounds in the DrugBank database (48) (v.5.1.4) using the RDKit (version 2018.09.1) (55) HasSubstructMatch function. This involved firstly standardizing compounds’ SMILES using the Python package Standardiser of Atkinson et al. (2016) (54). As some SMILES could not be standardized, this step reduced the total number of DrugBank compounds in the analysis from 2411 to 2136.