The Blood-Brain Barrier (BBB) is a physiological barrier that maintains brain homeostasis by controlling the exchange of molecules between the blood and the brain [1]. Consequently, the BBB blocs the passage of multiple molecules towards the brain, including administered drugs. This is beneficial when the target of the drug resides outside the brain since it prevents undesirable drug interactions and the ensuing phenotypic side effects. However, in the case of drugs targeting central nervous system (CNS) diseases, transport across the BBB is mandatory [2]. Therefore, the ability of drug candidates to cross the BBB has to be studied by all pharmaceutical companies during drug discovery. In this context, numerous in silico BBB models have been implemented by researchers in order to predict the behavior of drugs across the barrier [3]. These predictive models can be used during the early phases of drug discovery, and hence allow companies to save time and money resulting from failed drug investigations. Two different types of in silico BBB models exist in the literature: binary models which aim at qualitatively predicting whether drugs cross the BBB (BBB+) or not (BBB-), and quantitative models which attempt to qualify the permeability of the barrier to a given drug by computing the logarithm of the ratio of the concentration of the drug in the brain to that in blood (logBB) or its penetration rate (PR) [3]. In this context, K. Raja et al. [4] proposed two different stepwise regression models, one for the prediction of logBB values and the other for PR values. Other quantitative models are reviewed in [3]. While such models assign specific logBB/PR values for each drug, binary models have so far reached a higher prediction accuracy and provide a preliminary insight regarding the behavior of candidate drugs which is sufficient in early drug discovery stages. Predominantly, binarization of drug permeability across the BBB is performed by setting empirical thresholds to logBB values [5–9]. However, S. Kunwittaya et al. [6] have shown that varying logBB thresholds lead to a difference in the prediction accuracy. Therefore, binary BBB models based on logBB values are prone to biases introduced by the thresholds setting. On the other hand, Adenot and Lahana [10] introduced a dataset based on the activity of the drug in the CNS: if a drug is CNS active, then it is necessarily BBB+. However, some drugs can cross the BBB but still show no activity in the CNS. Even though finding BBB- drugs based on CNS activity is consequently a challenging task, CNS activity-based datasets require no threshold setting and hence do not introduce the previously mentioned biases.
Machine learning is ubiquitously applied in the case of binary BBB models. In this context, different types of classifiers were trained in the literature including Support Vector Machines (SVM) [6, 8, 11, 12], Linear Discriminant Analysis (LDA) [13], Artificial Neural Networks (ANN) [6] and Multi-Layer Perceptron (MLP) [8, 9], k-Nearest Neighbors (k-NN) [8], Decision Trees (DT) [6, 7] and Random Forests (RF) [5, 8, 9]. Other studies apply consensus models, by training and combining multiple classifiers [8, 9]. While consensus models mitigate the overfitting problem of single classifiers, they naturally require high computational power, especially when dealing with high dimensional data. The features used to train these classifiers are often molecular descriptors which are chemical properties describing the drugs [3]. Some studies also add the fingerprints of the molecules in order to reach better prediction [8, 9, 12]. On the other hand, novel approaches apply the drug side effects and indications for BBB penetration prediction [14]. The model achieved excellent prediction performance but relies on high-level phenotypes which prevent extraction of significant biological explanations concerning drug interaction with the BBB.
Molecular descriptors remain the staple of classification-based BBB models. However, until today, the high dimensionality of the data based on molecular descriptors is still challenging. The selection of the most relevant features is crucial since it guarantees an improved prediction performance on one hand, and a faster computation on the other; by reducing the size of feature vectors. In order to study the effect of the chosen features on the classification performance, Y. Yuan et al. [12] compared the performance of SVM models trained by feature vectors containing different molecular descriptors, fingerprints or a combination of both. Since trying all possible combinations of feature vectors dramatically increases the required computational time and power, an effective feature selection algorithm is needed. In this context, D. Zhang et al. [9] applied genetic algorithm (GA) for the selection of the appropriate features and optimization of SVM parameters. Nevertheless, choosing the most suitable algorithm for a given application is an important step since different algorithms may lead to convergence to different feature subsets and consequently affect the prediction results. This study hence, compares the effect of GA to that of the sequential feature selection (SFS) algorithm on different classifiers applied in the reported in silico BBB models.