Despite the increasing use of omics analysis, few studies have compared features from different profiling modalities in terms of variability, redundancy, and MOA dependency. Various analyses with transcriptomic and phenomic datasets can contribute to enhance our understanding of the strengths and limitations of profiling assays for evaluating the MOAs of compounds6. To meet this purpose, our study examined the diversity of transcriptomic and phenomic profiling by feature extraction and similarity-matric analysis. Furthermore, we analyzed the performance of machine learning to predict MOAs depending on feature types and the MOA of compounds using the L1000 and Cell Painting datasets.
Our analysis revealed that transcriptomic features have higher diversity than phenomic features. Data visualization by tSNE or Isomap algorithm showed that the compounds were the most dispersed by transcriptomic feature extraction, whereas phenomic feature extraction led to the formation of aggregated clusters of multiple compounds. Furthermore, the heatmap of pairwise feature correlations showed an overall pattern of higher scores in phenomic features than in transcriptomic features. These results indicate that Cell Painting contains more redundant measurements than L1000. Feature redundancy in Cell Painting was also observed in another study in which only 1,020 features were selected for analysis1. Currently, phenomic profiling lacks representative platforms, standardized computational pipelines, and comprehensive publicly available datasets. Experimental and computational methods, such as robust analysis of the spatial and functional changes in cell status and the use of packages compatible with images from various platforms, will enhance the usefulness of the morphology datasets3, 23.
As certain MOAs may alter cell morphology with relatively few changes in gene expression and vice versa, we examined whether the compounds associated with the same MOA have higher correlative features than the compounds of random pairs (Fig. 3D). In spite of the variation in the correlation scores of compounds, a tendency toward capturing the MOAs of adrenergic receptor antagonists and glucocorticoid receptor agonists by phenomic features and capturing the MOAs of beta-adrenergic receptor agonists by transcriptomic features were observed. Meanwhile, compound similarity analysis identified new highly correlated compound pairs. Experimental validation for correlative compound pairs is challenging, but previous studies have provided supporting evidence for some of the correlative pairs. Doxylamine was highly correlated with cetirizine, which has the same MOA as the histamine receptor antagonist doxylamine. Doxylamine was also correlated with mepivacaine and LFM-A12 based on the similarity in merged features (Table 1). Both doxylamine and mepivacaine produce spinal motor and sensory blockade24. LFM-A12 is a specific inhibitor of the EGFR tyrosine kinase, and doxylamine potentially inhibits non-homologous end joining pathway 1 (LINP1) expression, which are regulated by the EGF signaling pathways25, 26. Given the common use of similarity-based repurposing, the integration of omics features can aid in the identification of new MOAs for compounds.
To improve the machine learning model for MOA prediction, multiple trials have evaluated distance-based or tree-based algorithms, training with transcriptome, phenome, or merged features, and training on selected features with top 10% importance, but the performances were low (Fig. 4). The application of the XGBoost and Extra Tree algorithms led to overfitting for the training set, which could be produced by the disparity between the number of variables and the number of samples (also called the curse of dimensionality)20. Training with the whole transcriptomic and phenomic features led to similar accuracies of 0.03–0.08 in KNN models to training with selected features with top 10% importance. A recent study also demonstrated a low performance of machine learning models with the L1000 or Cell Painting features of 1,327 compounds with 511 MOAs using deep learning and Ensemble architecture (area under the precision-recall curve of 0.04)1. Thus, a systemic approach encompassing MOA annotation, robust acquisition of experimental results, and computational support for normalization, feature selection, and training algorithm would be required for the use of profiling readouts for MOA prediction. Notably, glucocorticoid receptor agonist was the MOA best predicted by selected features from transcriptomic, phenomic, and concatenating profilings (Table 2), indicating that the MOA-specific response of profiling assay can be used to plan profiling.
The present study had several limitations. First, the two profiling assays used in our study were performed separately, so the differential experimental conditions may have led to increased variance in describing the compound’s perturbations. Currently, comprehensive datasets including a large number of compounds under the relevant cell condition have limited availability. However, data-sharing policies and preference for unbiased screening methods from biotech companies will likely accelerate large-scale analyses of various profiling assays. Second, the lack of a dominant platform for phenomic data processing can lead to additional variation in a compound’s perturbation. Despite the aforementioned limitations, our analysis of transcriptomic and phenomic datasets can aid in the design of profiling assays to evaluate the MOAs of compounds. In summary, our results demonstrated that the L1000 transcriptomic features were more diverse than the Cell Painting phenomic features. The use of unsupervised and supervised machine learning suggests that these profiling assays can identify new drug pairs based on similarity and predict a distinct set of MOAs.