2.1. Dataset overview
The dataset for this study was compiled through a literature review using the Web of Science database, and search keywords “solubility” and “binary system”. This literature search resulted in a dataset of 27,000 solubility data points, including 125 small-molecule solutes, 44 solvents, 110 binary solvent mixtures, and 379 unique solute-binary solvent systems. About 30% of the solutes are FDA approved drugs [35]. The remaining compounds are mainly pharmaceutical intermediates, drug metabolites, and compounds with therapeutic potential, which serve to enrich the dataset’s size. To the best of the authors’ knowledge, this is the most extensive open-access dataset on solubility in binary solvent systems available to date. Figure 2 provides an overview of the dataset. In Fig. 2a, the data points are categorized into quantiles based on their solubility values, with each of the four quantiles represented by a bar that contains approximately 7,000 data points. To illustrate how temperature affects solubility, the data in each quantile bar is marked with a color gradient. This gradient shifts from light to dark hues to represent a range of investigation temperatures, progressing from lower to higher values. As expected, higher temperatures are usually associated with increased solubility [36–38]. For example, in the lowest solubility quantile ([-5.48, -0.42]), only 13.8% of samples were recorded at higher temperatures (above 313.15 K). This proportion nearly doubles to 27.7% in the fourth quantile, which corresponds to higher solubility levels ([1.01, 1.99]).
The dataset was then divided into two distinct parts: 80% for the training set and 20% for the test set, ensuring that solute-binary solvent systems were not repeated across these subsets. The training set was used for feature engineering and model hyperparameter optimization, while the test set for model evaluation to identify the optimal models. Figures 2b to 2e show the distributions of four key attributes of solute-solvent systems including investigation temperature, solute molecular weight, solvent melting temperature, and reported solubility, across both subsets. These plots demonstrate that the test set aligns with the training set in both scope and distribution, ensuring that the model evaluation is representative and thorough.
2.2. Feature engineering
Feature engineering is a critical step in the development of ML models, as it involves identifying the most effective set of features that the models use to learn the relationship between the inputs and the target [39–42]. In the context of this study, the feature set initially collected from the literature was limited to solubility measurement parameters only, including solvent ratios and temperature of investigation. However, to predict the solubility in solvent mixtures, it was necessary to also include descriptors of the solute and the two solvents involved.
This research utilized a combination of experimental and computational features to describe the solutes and solvents under study. The selection of compound melting temperatures as a crucial experimental attribute was informed by the well-established effect of melting temperature on solubility, as noted in literature references [43–45], and its widespread availability from open access sources such as the Chemical Abstracts Service (CAS) database [46] and supplier websites. Beyond melting temperatures, the dataset was enriched with computational descriptors for both solutes and solvents, incorporating MACCS molecular fingerprints and RDKit features to provide a comprehensive characterization. These features were generated by the RDKit package [47], a toolset for cheminformatics that processes molecular structures encoded in SMILES notation. The adoption of these computational descriptors has seen widespread application in various fields, including drug discovery [48–51] and materials science [24, 52–57]. In addition to their prevailing usage, the rationale for selecting MACCS molecular fingerprints and RDKit features was also attributed to their interpretability. These descriptors offer either direct associations with specific molecular structures or yield computational molecular properties. Such interpretability plays a crucial role in understanding the models’ predictions and in gaining insights from the model analysis.
However, the inclusion of these computational features also led to a substantial increase in dataset dimensionality, introducing over 1000 features for each solute-binary solvent system. High-dimensional data can pose challenges, including the risk of overfitting and increased computational demands for model development [58–60]. To mitigate these challenges and maintain the integrity of the information, a standard dataset refinement process based on feature variance and correlation was implemented. Initially, features demonstrating zero variance were removed, as their constant values across all observations provide no discriminative capability, thereby not enhancing the models’ predictive performance. Next, features exhibiting high correlation (Pearson correlation > 0.8 [61–63]) with others in the dataset were also removed to reduce redundancy, which can burden the model and diminish its ability to generalize. Through these steps, the dataset was condensed to a more manageable size of 370 features, effectively reducing it to approximately one-third of its original volume. This refined dataset was utilized in subsequent studies.
Beyond these measures, principal component analysis (PCA) was also employed as an additional step for further feature reduction. PCA is another widely employed technique to manage high-dimensional data by transforming a large set of variables into a smaller number of principal components [64–66]. For instance, as illustrated in Figure S1, 86 principal component features were identified as capable of explaining the majority of the dataset’s variance (> 95%). However, employing PCA for dimensionality reduction comes with certain limitations, including challenges in model interpretability [67, 68] and the potential for dataset information loss during the transformation process [69, 70]. As shown in Figure S2, applying PCA for further feature reduction resulted in decreased model performance. Therefore, the dataset processed by PCA was not utilized further in this study.
2.3. ML model development and evaluation
Using the refined dataset, a selection of ML models was trained and finetuned on the training set before their performance was assessed on the test set. The training process employed a ten-fold group cross-validation technique, a strategy that iterates the training and evaluation process across all folds. This approach ensures that the available data is utilized comprehensively, with the model accuracy from each fold being averaged to determine the model performance. To enhance the model performance, the model hyperparameters were tuned to identify the optimal ML configuration via Bayesian optimization [71]. Bayesian optimization was found to be more efficient than traditional random and full grid search methods as it develops a probabilistic model that maps hyperparameters to a performance metric, facilitating a strategic search compared to brute-force approaches [72–74]. In addition, its capability to manage continuous parameter spaces provided a finer search resolution, unlike grid-based approaches constrained by discrete parameters [75, 76]. This Bayesian optimization process iterated for 100 times to search for the optimal hyperparameters within the search space (Table S1). The hyperparameters identified for each model are summarized in Table S2.
Following the tuning of hyperparameters, all the models configured with their optimized hyperparameters were evaluated using the test set, with their performance shown in Fig. 3. Figure 3a shows the absolute error between the model predictions and the actual targets, arranging the models in ascending order based on their MAE from the lowest to the highest. Notably, the XGB and LightGBM models exhibited the lowest MAE, signifying superior performance relative to the other models analyzed. The performance of these models was further measured by additional model metrics as shown in Fig. 3b, where the XGB and LightGBM models consistently exceeded the performance of the other models across these metrics. This effectiveness of gradient-boosted tree models (e.g., XGB and LightGBM) aligns with literature findings on chemical tabular data within similar data regimes [22, 33, 77–79]. In addition, the accuracy associated with these models are comparable, if not better, to benchmark ML studies in the literature, where LogS MAE values for new solute-solvent combinations are typically between 0.4 to 0.5 [25, 26]. This performance is especially notable given that most prior studies on solubility focus on less complicated systems, such as single solvents or constant temperature. Conversely, our model was trained to predict solubility in binary solvent systems measured over a variety of temperatures.
Figure 3. (a) Illustrates the distribution of absolute error between experimental values and predictions for nine models, based on evaluations with the test set. Each boxplot highlights the mean absolute error (MAE) and median absolute error (MedAE) using white circles and black lines, respectively. Outliers are not included for clarity, and the detailed plots including outliers are available in Figure S3. (b) Summarizes the performance of these models using four metrics: MAE, MedAE, root mean square error (RMSE), and mean square error (MSE). Within the evaluated models, the two gradient-boosted tree models, namely XGB and LightGBM, showed superior performance compared to the rest, standing out across all considered metrics.
2.4. Prospective study
The performance of the developed XGB and LightGBM models was further assessed through an experimental prospective solubility study on four small-molecule compounds: aspirin (ASA), acetaminophen (ACM), 3-indolepropionic acid (IPA), and celecoxib (CXB). Among these compounds, ASA, ACM, and CXB are FDA-approved drugs, and IPA has been evaluated for it neuroprotective and anti-inflammatory properties for use in indications such as brain injury [80–82], stroke [83–85], and neurodegenerative diseases [86–88]. As shown in Fig. 4, these molecules were purposefully selected for their varied structural features, molecular weights, and LogP values. In addition, ACM and CXB were previously included in our dataset, while ASA and IPA represent novel solutes. The solubility measurements for these molecules were performed in ethanol/water mixtures, presenting the models with solute-binary solvent systems they had not previously encountered. To examine the effects of measurement parameters on solubility and model predictions, these studies were performed at two temperatures (298.15 and 313.15 K) and three solvent ratios (ethanol content = 0.2, 0.6, and 0.8; weight ratio). In total, 24 solubility studies were performed in triplicate as detailed in Fig. 4e.
The solubility measurement results are shown in Figs. 5a to 5d, where slashed and non-slashed bars represent results at higher (313.15 K) and lower (298.15 K) temperatures. The solubility values measured spanned from − 3 to 1.5 LogS (g/100 g), which is consistent with the range observed in the dataset. As expected, the results show that solubility measurement parameters significantly influence solute solubility, with an increase in solubility resulting from a rise in temperature or increase in ethanol content in the solvent mixture. Following the solubility measurements, the models’ ability to predict solubility for these solutes was evaluated, using four metrics to assess accuracy for each compound. For the XGB model, Figs. 5e to 5g indicate that the XGB model is effective at predicting solubility for ASA, ACM, and IPA, with MAE of 0.23, 0.18, and 0.39, respectively. The model’s ability to accurately predict the solubility of ACM may be due to its presence in the training dataset. Although ACM’s solubility in ethanol/water mixtures was not included in the dataset, the data on ACM’s solubility in other solvents likely contributed to the model’s proficiency in handling new combinations of solute and solvent. Despite ASA and IPA being new additions to the model, their predicted solubilities were notably accurate and on par with the top-performing models described in existing research, which have focused on solubility in single solvents under both constant and variable temperature conditions [25, 26]. This accuracy in prediction could stem from the fact that most of the key features for ASA and IPA fall within the scope of the dataset (to be detailed in the following paragraph). Such alignment enables the model to effectively generalize to these two new compounds, leveraging the diverse and extensive range of features present in the existing dataset which encompasses more than 100 distinct solutes.
However, the XGB model’s predictions for CXB were less accurate despite CXB being present in the dataset. To explore the underlying reasons for this discrepancy, we examined the top 15 solute features deemed most influential to the XGB model’s decisions (Fig. 6a). The values for these features for the four compounds (ASA, ACM, IPA, and CXB) were then plotted compared with overall distribution of all the solutes within the dataset. As shown in Figure S5, it was observed that CXB deviates significantly from the dataset’s distribution in the majority (11 out of 15) of these crucial features, either falling below the first quantile, exceeding the third quantile, or being considered outliers. This deviation is considerably greater than that observed for ASA (2 out of 15), ACM (3 out of 15), and IPA (4 out of 15), whose solubility predictions were more accurate. Figures 6b to 6e highlight four representative features that illustrate this divergence. For instance, Fig. 6c shows that the majority of solutes in the dataset have a MinEStateIndex ranging from − 2 to 1, with ASA, ACM, and IPA’s MinEStateIndex fitting well within this range, whereas CXB stands out as an outlier. The MinEStateIndex represents an electrotopological state feature calculated using rdkit.Chem.EState.EState module [89]. These features, originating from the work of Kier and Hall [90], have been extensively employed as key compound descriptors in solubility prediction [91–93]. CXB’s deviation in this and other features likely limited the model’s capacity to accurately generate its solubility predictions.
Considering the comparable performance of the LightGBM and XGB models during the development stage, a parallel analysis was conducted for the LightGBM model. This included both validation and interpretation with respect to critical solute attributes. In model validation, LightGBM demonstrated superior accuracy in predicting the solubilities of ASA, ACM, and IPA compared to CXB, mirroring results observed for the XGB model. Despite LightGBM’s marginally lower performance than XGB, marked by a slight increase in MAE of 0.1 to 0.2 for ASA, ACM, and IPA predictions, it surpassed XGB in accuracy for CXB solubility predictions. Further examination of the feature distribution highlighted by LightGBM revealed a pattern consistent with the analysis of the XGB model, indicating significant deviation in CXB for most of the critical features relative to ASA, ACM, and IPA. These results are summarized in Figures S6, S7, S8, and S9.
In summary, the current models exhibit considerable potential in predicting drug solubility, especially for the drugs whose features align with the dataset. This is evidenced by their performance on test sets derived from the literature and through experimental prospective studies. These models not only meet or exceed the capabilities of existing ML models, as has been previously elaborated, but they also offer a compelling alternative to traditional mechanistic prediction methods like COSMO-RS. COSMO-RS, a quantum chemistry-based simulation technique, is frequently employed for solubility predictions [94, 95]. Studies have shown that COSMO-RS still faces challenges in solubility prediction, with accuracy in solubility predictions (LogS, g/100g and mol/L) of around 0.7 for MAE and about 1.0 for RMSE [15, 25, 96, 97]. Additionally, COSMO-RS’s reliance on proprietary software and the need for enthalpy of fusion data further limit its application. In contrast, the ML models developed in this study require only the chemical structures and melting temperatures of the solutes and solvents to generate predictions.
One of the limitations of our dataset is that it largely includes molecules that are of lower molecular weight and relatively high aqueous solubility. However, drugs that are approved and currently under development have a wide range of physico-chemical properties and thus may not be fully represented by the molecules in our dataset. In response to this limitation and to further enhance future model development, we are openly sharing the dataset and source code. This is designed to foster additional research and innovation, with the hope that other research groups will build upon our dataset, thereby broadening its scope to encompass a more extensive range of drug properties. Such expansion is anticipated to refine the models’ predictive accuracy across a more diverse array of compounds. By making these resources available [98], we aim to catalyze advancements in the field, encouraging collaborative efforts to overcome existing challenges and push the boundaries of what is currently achievable in predictive modeling.