Bio-primed machine learning to enhance discovery of relevant biomarkers

doi:10.21203/rs.3.rs-5139890/v1

Download PDF

Article

Bio-primed machine learning to enhance discovery of relevant biomarkers

https://doi.org/10.21203/rs.3.rs-5139890/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Precision medicine relies on identifying reliable biomarkers for gene dependencies to tailor individualized therapeutic strategies. The advent of high-throughput technologies presents unprecedented opportunities to explore molecular disease mechanisms but also challenges due to high dimensionality and collinearity among features. Traditional statistical methods often fall short in this context, necessitating novel computational approaches that harness the full potential of big data in bioinformatics. Here, we introduce a novel machine learning approach extending the Least Absolute Shrinkage and Selection Operator (LASSO) regression framework to incorporate biological knowledge, such as protein-protein interaction databases, into the regularization process. This bio-primed approach prioritizes variables that are both statistically significant and biologically relevant. Applying our method to multiple dependency datasets, we identified biomarkers which traditional methods overlooked. Our biologically informed LASSO method effectively identifies relevant biomarkers from high-dimensional collinear data, bridging the gap between statistical rigor and biological insight. This method holds promise for advancing personalized medicine by uncovering novel therapeutic targets and understanding the complex interplay of genetic and molecular factors in disease.

Biological sciences/Cancer

Biological sciences/Computational biology and bioinformatics

Health sciences/Biomarkers

Health sciences/Molecular medicine

Health sciences/Oncology

Machine learning

Precision Medicine

Biomarkers

Gene Dependency

LASSO

Protein-Protein Interaction

In precision medicine, identifying reliable biomarkers for gene dependencies is paramount for tailoring individualized therapeutic strategies (Behan et al. 2019; Chan et al. 2019). The advent of high-throughput technologies has ushered in an era of 'big data' in bioinformatics, offering unprecedented opportunities to explore the molecular underpinnings of disease at a granular level (Wafi and Mirnezami 2018). However, this wealth of data also presents significant challenges, particularly due to its high dimensionality and the collinearity among molecular features (Dunkler, Sánchez-Cabo, and Heinze 2011). Traditional statistical methods often fall short in effectively analyzing such complex datasets (“‘Omics’ Data and Levels of Evidence for Biomarker Discovery” 2009), necessitating the development of novel computational approaches that can harness the full potential of this information while mitigating inherent limitations (Behera et al. 2018; Hédou et al. 2024).

Least Absolute Shrinkage and Selection Operator (LASSO) regression has emerged as a powerful tool for feature selection and regularization in high-dimensional data analysis (Tibshirani 2018; Santosa and Symes 2006). By imposing a penalty on the absolute size of the regression coefficients, LASSO facilitates the identification of a subset of predictive features thereby enhancing model interpretability and reducing the risk of overfitting. However, while LASSO can effectively handle datasets with numerous collinear variables, it does not inherently account for the underlying biological context of the features it selects. This limitation underscores the need for methodological advancements that can integrate domain-specific knowledge into the feature selection process, ensuring that the identified biomarkers are both statistically robust and biologically relevant.

For example, the Cancer Dependency Map (DepMap) is a comprehensive resource that aims to identify and catalog genetic dependencies and vulnerabilities across a wide range of cancer cell lines. The DepMap web portal's Predictability tab provides insights into how a given gene dependency or compound sensitivity profile relates to the baseline genomic and molecular (‘omic’) features of cell lines (Dempster et al. 2020). This portal offers two distinct omics models: the first ‘core’ model utilizes the most comprehensive and unbiased sets of available omic features, the second ‘related’ model employs a refined set of features associated with genes related to the target gene. Both of these models have drawbacks: the ‘related’ model cannot discover novel biology, and the ‘core’ model may select features that have slightly stronger statistical evidence yet no relation to the gene of interest.

In this manuscript, we introduce a novel machine learning approach that extends the LASSO regression framework to incorporate biological knowledge, such as protein-protein interaction (PPI) databases into the regularization process. Inspired by previous work (Li and Jackson 2015; Zuo et al. 2017), our approach leverages existing biological information to guide feature selection, prioritizing variables that are not only statistically significant but also biologically interconnected. By doing so, we aim to extract relevant biomarkers for gene dependencies from high-dimensional collinear molecular data, offering a more nuanced understanding of the molecular drivers of disease. This method holds promise for identifying novel therapeutic targets and advancing personalized medicine, where understanding the complex interplay between genetic and molecular factors is crucial for developing effective, individualized treatments.

By bridging the gap between statistical rigor and biological insight, our method represents a significant step forward in unlocking the full potential of 'big data' in bioinformatics and precision medicine.

LASSO implementation

In cases of high-dimensional data, where the observational counts are greatly outnumbered by the number of predictors, traditional regression techniques often produce poor predictive results due to overfitting, collinearity, and sparsity (Dunkler, Sánchez-Cabo, and Heinze 2011). The LASSO framework is a regularization technique which employs a penalization process to remove uninformative parameters by shrinking their coefficients towards zero (Hastie, Tibshirani, and Friedman 2013). The baseline LASSO model has its hyperparameter lambda ($\:\lambda\:$) optimized, using ten-fold cross-validation. For the analyses described in this paper, we fixed alpha, the ridge penalty parameter, equal to zero. Methods described for baseline and bio-primed LASSO processes were developed and implemented in R (Wirtschaftsuniversität Wien Department of Statistics and Mathematics 2008) with the aid of the glmnet (Friedman, Hastie, and Tibshirani 2010) package.

Bio-primed regularization

Expanding upon previous work (Zuo et al. 2017; Li and Jackson 2015), we extend the baseline LASSO model by incorporating prior knowledge into the feature selection procedure. The process is tailored to refine for biologically relevant, not just statistically associated, features. Towards this end, we define a feature-specific regularization factor (µ) that represents the importance of the feature to the outcome variable. Features with strong prior evidence will have a µ value close to 1, while those with no evidence will have a µ value of zero. Features with small values for µ will incur a greater penalty to their coefficients. For the analysis presented here, µ values are derived from the PPI score provided by the STRING database.

We also introduce a second tuning parameter called phi (Φ) which accounts for the overall importance of the prior knowledge. The regularization penalty for the j^th feature is $\:RegPe{n}_{j}\:=\varPhi\:(1-{\mu\:}_{j})$, giving the overall bio-primed lasso penalty as:

$$\:\lambda\:{{{\sum\:}^{p}}_{j=1}RegPe{n}_{j}\left|{\beta\:}_{j}\right|}^{}$$

For the analyses presented here,$\:{\mu\:}_{j}$ values are defined by the j^th PPI score provided by the STRING database. STRING scores are scaled as a proportion of the maximum score, $\:STRIN{G}_{j}=(1-{\mu\:}_{j})$.

Following the standard $\:\lambda\:$ optimization procedure, root mean square error (RMSE) was calculated using a ten-fold cross-validation to derive the optimal Φ. The optimal Φ is identified by the inflection point of the Φ versus RMSE function. The final bio-primed model is fitted using the inferred hyperparameters $\:\lambda\:$ and Φ.

String database

PPI network data was downloaded from the STRING Database (Szklarczyk et al. 2021) website (www.string-db.org) in February 2024. STRING collects and scores evidence from several sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. All interaction evidence that contributes to a given network is benchmarked and scored, and the scores are integrated into a final ‘combined score’. This score is scaled between zero and one and provides an estimate of STRING’s confidence on whether a proposed association is biologically meaningful given all the contributing evidence. This confidence score for the homo sapiens organism was integrated into the LASSO regularization. Protein identifiers were mapped to human gene symbols using the biomaRt R package (Durinck et al. 2009).

Dependency data

Chronos (Dempster et al. 2021) and Demeter2 (McFarland et al. 2018)(D2) dependency data as well as molecular data including copy number (CN) variation and RNA expression profiles were downloaded from the DepMap portal (www.depmap.org) in July 2022 (version 22Q2). Drug sensitivity data was downloaded from the DepMap portal in July 2024 (version 24Q2). Genomic information including chromosome name and location were derived using the biomaRt R package (Durinck et al. 2009).

Statistical analysis

Target genes for the comprehensive CN biomarker analysis were defined using the following criteria: 1) D2 dependency values for a given target gene had less than 10% of missing values, 2) D2 dependency values were lower than − 0.5 in at least 5% of cells and 3) absolute skewness value was greater than 0.5. These selection criteria resulted in 453 target dependencies. Co-dependency was statistically evaluated by calculating the Pearson correlation between two dependency scores.

Method overview

To enhance discovery of relevant biomarkers, we extended the commonly used machine learning approach LASSO to incorporate biological information about the features by applying specialized regularization. In a typical dependency biomarker analysis, the dependent variable is the dependency score of the target gene and the independent variable is a genome-wide molecular profile, such as CN variation (Fig. 1A). Sparsity promoting regularization techniques, such as LASSO, are a popular choice for biomarker discovery because these methods aim to identify a small set of highly informative features in high dimensional data such as molecular profiles (Hédou et al. 2024). Therefore, we adapted the following two-step procedure to the LASSO regularization.

In the standard LASSO model, the regularization parameter $\:\lambda\:$ is optimized using cross-validation (Fig. 1B). The $\:\lambda\:$ parameter defines the amount of shrinkage each input feature receives. Optimization of this parameter concludes the regularization of the traditional LASSO, and we term the resulting model the baseline LASSO model. Upon optimization of $\:\lambda\:$, we introduce a novel parameter Φ which represents the magnitude of prior evidence linking each feature to the target gene. This evidence may be derived from PPI databases such as STRING DB but is not limited to this data. The Φ parameter is optimized following analogous cross-validation procedure. We term the resulting model using the optimized $\:\lambda\:$ and Φ parameters the bio-primed LASSO model. For selecting informative biomarkers, the feature coefficients of each model were assessed and interpreted.

Predict MYC dependency using RNA expression biomarkers

We first applied our method to the Chronos dependency data set derived from genome-wide CRISPR knockout experiments for 17,386 genes across 1,048 cancer cell lines (Dempster et al. 2021). We set out to find RNA expression biomarkers to predict the dependency of oncogene c-Myc (MYC). RNA expression data was filtered to 12,182 expressed genes and subsequently z-score normalized. This set of genes was used as input features to discover relevant biomarkers to predict MYC dependency.

Using 10-fold cross validation, a value of 0.65 was inferred for the Φ parameter (Fig S1). A total of 188 features were assigned non-zero coefficients in the bio-primed model and deemed as relevant biomarkers (Fig. 2A). The largest coefficient, which represents the most informative feature, was assigned to RNA expression of the MYC gene itself. Both the baseline and bio-primed models identified MYC RNA expression as the major predictor, consistent with the paradigm of oncogene addiction (Weinstein 2002). Additionally, significant correlation was observed between the coefficients derived from the two models for the remaining predictors (Fig. 2B). We next calculated the correlation between each input feature and the target dependency. Overlaying this information on top of the coefficients derived from the two models revealed that predictors with positive and negative LASSO coefficients also showed positive or negative correlation, respectively (Fig. 2C). As expected for an oncogene, MYC RNA expression levels were negatively correlated with MYC dependency. Cell lines with elevated RNA expression of MYC were more dependent on MYC (Fig. 2D).

Of note, a fraction of RNA biomarkers including STAT5A and NCBP2 received non-zero coefficients exclusively in the bio-primed and not the baseline model (Fig. 2C). STAT5A, a member of the Signal Transducer and Activator of Transcription (STAT) family, has previously been identified as a potent inducer of MYC (Villarino et al. 2022; Preston et al. 2015; Lord et al. 2000). As an inducer of MYC, the oncogene addiction model suggests that elevated STAT5 should be a biomarker in MYC-driven cancers. Correspondingly, we observed increased MYC dependency in cell lines with high expression of STAT5 RNA (Fig. 2D).

NCBP2 (also known as Nuclear Cap-Binding Protein Subunit 2) is a component of the cap binding complex and is required for the recruitment of splicing machinery to nascent mRNA (Görnemann et al. 2005; Mazza et al. 2001). We, and others, have previously shown that MYC driven cancers are vulnerable to perturbation of the spliceosome (Hsu et al. 2015; Koh et al. 2015). We observed that cell lines with low levels of NCBP2 RNA were more dependent on MYC (Fig. 2D). The observed correlation between the RNA levels of STAT5A and NCBP2 with MYC dependency implies these genes as relevant biomarkers of MYC dependency.

To assess the computational robustness of our method, we performed a second independent run using the same input features and MYC dependency as the outcome variable. The bio-primed model’s coefficients derived from these two independent runs showed strong correlation, demonstrating high reproducibility across runs (Fig S2).

To further assess the robustness of our model to noise in the provided biological network information, we manually set the evidence score for MYC to 0, the minimum evidence score value. In this way, the model will not be encouraged to favor MYC RNA expression as one of the biomarkers. Nonetheless, our model assigned a large coefficient to MYC RNA expression, demonstrating that our model facilitates discovery of novel associations without prior supporting biological data and is robust to incomplete network annotations (Fig S3).

Predict EGFR dependency using copy number biomarkers

We next applied our method to the Demeter2 (D2) dependency data set. This gene dependency data was derived from genome-wide short hairpin RNA screen experiments for 17,309 genes across 707 cancer cell lines (Tsherniak et al. 2017). As a second use case, we set out to discover CN biomarkers predicting EGFR dependency as measured using the D2 score. Linkage disequilibrium (LD) in CN profiles makes it particularly difficult to extract relevant biomarkers since many genes will carry comparable statistical evidence of association.

We first calculated the correlation coefficient between EGFR dependency and the CN estimate of each gene. Visualization of this genome-wide correlation profile revealed a strong negative correlation between EGFR CN and EGFR dependency (Fig. 3A). As expected from an oncogene, amplification of EGFR CN conferred EGFR dependence on the cells (Fig S4).

Interestingly, we observed a second peak with moderate negative correlation on chromosome 11 (highlighted in a purple box). Focusing on this locus revealed many genes with strong negative correlation between CN and EGFR dependency (Fig. 3B). The strong LD structure makes it difficult to select a specific biomarker from this region based on correlation coefficients alone. The baseline model assigned a single non-zero coefficient to the USP35 gene in this region. To the best of our knowledge, there exists no reported connection between USP35 and EGFR. STRING also did not assign an association score for these two genes. We believe that the reason the baseline model picked the USP35 gene is due to a spurious association with the underlying driver gene. The bio-primed model, on the other hand, identified GAB2 CN as the most informative biomarker in this region based on the magnitude of the LASSO coefficient (Fig. 3B). GAB2 (GRB2-associated binding protein 2) is an adaptor protein that plays a critical role in transmitting signals from receptor tyrosine kinases, such as EGFR, to downstream pathways involved in cell proliferation, survival, and migration (Adams, Aydin, and Celebi 2012). Amplification of GAB2 can lead to increased activation of the PI3K/AKT and MAPK pathways, both of which are downstream effectors of EGFR signaling (Gu et al. 1998), supporting the idea that GAB2 amplification may potentiate oncogenic processes driven by EGFR and increase the dependency of cancer cells on EGFR activity. Indeed, stratifying the cell lines by EGFR and GAB2 CN gain revealed that simultaneous gain of EGFR and GAB2 CN significantly increased dependency on EGFR, linking the GAB2 CN to EGFR dependency (Fig. 3C).

These data suggest that patients with GAB2 amplification may be more sensitive to drugs targeting EGFR. To explore this hypothesis, we analyzed existing drug sensitivity data provided by the DepMap resource. We correlated drug sensitivity profiles for 545 drugs from the Cancer Target Discovery and Development Network with GAB2 CN profiles. Several of the most strongly associated drug sensitivities were indeed EGFR inhibitors (Fig. 3D). For example, cell lines with GAB2 amplification but CN neutral EGFR showed increased sensitivity to EGFR inhibitor Afatinib (Fig. 3E).

Biologically informed biomarkers show stronger co-dependency

To systematically evaluate our method, we used the D2 dependency data. We first identified a total of 453 highly selective genes that showed strong dependency in a subset of cell lines. Given their selective dependency profile these genes represent promising cancer drug targets. We hypothesized that the selective dependency profile may be driven by CN aberrations of the gene itself of biologically relevant genes. Therefore, each of these gene dependencies was subjected to CN biomarker analysis using the baseline and bio-primed model approaches, with the goal of identifying the underlying genomic aberrations driving the dependencies. Out of 453 target genes, 432 identified at least one predictive CN biomarker in either approach (Table S1).

Next, we set out to compare the biomarkers derived from the baseline and bio-primed models. For each gene and to ensure discriminative power, two sets of mutually exclusive biomarkers were defined: 1) The top 20 biomarkers with a positive coefficient derived from the bio-primed model and not identified using the baseline model. 2) The top 20 biomarkers with a positive coefficient derived from the baseline model and not identified using the bio-primed model. For each of these gene sets, we calculated the co-dependency between each target and the corresponding biomarkers using Pearson correlation.

For example, UTP4, a key component of the processome, a large ribonucleoprotein complex involved in the early steps of ribosome biogenesis (Freed et al. 2012), showed multiple peaks of correlations between CN and dependency (Fig. 4A). The bio-primed model identified UTP4 in the peak with the strongest genome-wide association on chromosome 16. This result was consistent with the so-called CYCLOPS model, which posits that partial loss of CN yields cancer specific liabilities (Paolella et al. 2017). The baseline approach failed to identify UTP4 and instead selected unrelated biomarkers near the UTP4 locus. To the best of our knowledge, none of these markers have previously been linked to UTP4 biology.

To assess co-dependency, we next calculated the correlation between UTP4 dependency and dependency of CN biomarkers with a positive coefficient derived from the baseline or bio-primed models. Biomarkers from the bio-primed approach showed significantly greater co-dependency compared to biomarkers derived from the baseline approach (Wilcoxon test, p < 0.01, Fig. 4B). None of the baseline biomarkers showed significant co-dependency with UTP4. The bio-primed model identified DDX10 and BRIX1 as relevant CN biomarkers within genome-wide correlation peaks. These two biomarkers were exclusively identified using the bio-primed approach and showed significant correlation between UTP4 dependency and CN as well as co-dependency with UTP4 (Fig. 4C). Of note, DDX10 is a DEAD-box RNA helicase that also plays a role in ribosome biogenesis by participating in the processing of pre-rRNA (Wild et al. 2010) and BRIX1 is also known to be involved in ribosome biogenesis (Eisenhaber, Wechselberger, and Kreil 2001), demonstrating that the bio-primed model selected biomarkers that are directly relevant to the biological function of the target dependency.

We generalized this approach across the remaining 431 genes. This comprehensive analysis demonstrated that biomarkers derived from the bio-primed model showed significantly stronger co-dependency with the target compared to biomarkers derived from the baseline model (Fig. 4D). Thus, the biomarkers identified by the bio-primed approach are more relevant to the target biology compared to the baseline approach. A full list of results may be found in Table S1.

The primary aim of this study was to develop and validate a novel machine learning approach that integrates biological knowledge into the LASSO regression framework, enhancing the identification of biomarkers for gene dependencies in high-dimensional molecular data. By incorporating PPI data into the regularization process, our bio-primed LASSO model addresses the limitations of traditional statistical methods, which often struggle with the high dimensionality and collinearity inherent in 'omics' datasets. Through this biologically informed approach, we sought to prioritize features that are not only statistically significant but also biologically relevant, ultimately facilitating the discovery of novel therapeutic targets and advancing the field of precision medicine.

In our analysis aimed at predicting MYC dependency, RNA levels of STAT5A and NCBP2 emerged as significant predictors. Both STAT5A and NCBP2 had previously been linked to MYC biology and our bio-primed model identified these two genes as relevant biomarkers of MYC dependency while the baseline model failed to do so.

In our second analysis, we found that CN gains of GAB2 significantly enhance EGFR dependency, suggesting a potential synergistic relationship between GAB2 amplification and EGFR signaling in cancer. Of note, we checked the “Predictability” tab on the DepMap website for EGFR dependency. GAB2 CN or any other GAB2 molecular profile was not included in the top predictive features, indicating that our approach discovered a biomarker that was missed by other approaches. Our findings underscore the importance of GAB2 as a modulator of EGFR dependency. Importantly, we observed increased sensitivity to EGFR inhibitors in cell lines with GAB2 amplification even in an EGFR neutral background, suggesting that patients with GAB2 amplification may benefit from EGFR inhibitors.

Our comprehensive analysis demonstrated that the bio-primed model selected biomarkers that are directly relevant to the biological function of the target dependency. We exemplified this by studying CN biomarkers for UTP4. Biologically relevant genes not only represent robust biomarkers but could also be leveraged to find synthetic lethal interactions (O’Neil, Bailey, and Hieter 2017). Synthetic lethality refers to a situation where the simultaneous occurrence of aberrations in two or more genes leads to cell death, whereas an aberration in just one of these genes does not affect cell viability. This concept is particularly important in cancer research, as it offers a strategy to selectively kill cancer cells by targeting a gene that is synthetically lethal with a genomic aberration specific to the cancer. In our study, the biologically relevant genes identified by our bio-primed LASSO model could serve as candidates for synthetic lethal partners, opening avenues for novel cancer treatments.

One limitation of our method is its reliance on pre-existing biological databases, such as the STRING PPI network, for bio-primed regularization. While the incorporation of biological knowledge is a strength, it is dependent on the completeness and accuracy of these external data sources. Any gaps, biases, or errors in these databases may influence feature selection. Further, our method focuses on interaction partners, which might limit the discovery of novel dependencies that arise from more complex, multi-layered biological pathways not captured by current interaction databases. Moreover, the use of STRING confidence scores may oversimplify complex biological relationships, potentially missing subtle yet important interactions.

Another important aspect to consider is the computational efficiency of our implementation. The native glmnet package uses C + + for the optimization of the $\:\lambda\:$ parameter, which is highly efficient for large-scale data processing. In contrast, our R-based implementation optimizes for the Φ parameter directly in R. Future work may involve translating the Φoptimization process into a more efficient language, to improve scalability and execution speed, particularly when analyzing large molecular datasets or when running iterative cross-validation.

Our proposed method represents a generalizable approach that can be applied to various settings beyond the specific examples demonstrated in this study. While we have highlighted the use of PPI networks to inform the regularization process, the framework is flexible and can incorporate different types of biological associations depending on the context and available data. For instance, regulatory networks, gene co-expression networks, or epigenetic modification maps could be integrated to guide feature selection in a manner that reflects the underlying biological processes relevant to the research question. This adaptability allows our method to be tailored to diverse applications, whether it be identifying biomarkers for drug sensitivity, predicting gene dependencies, or understanding complex disease mechanisms. By leveraging relevant biological knowledge, our approach enhances the interpretability and relevance of the selected features, thereby improving the robustness and applicability of the findings in various domains of biomedical research.

Author Contribution

Conceptualization: L.M.S. and A.R. Experiment implementation: D.H. and L.M.S. Result investigation: J.R.Z., J.K.M., N.J.N., D.H., E.B., K.K., and L.M.S. Funding acquisition: T.F.W. Supervision: L.M.S. and T.F.W. Writing—original draft: L.M.S and D.H. Writing—review & editing: A.R., J.R.Z, J.K.M., N.J.N., E.B., K.K., T.F.W. and L.M.S. All authors reviewed the manuscript.

Data Availability

Our method, including data sets analyzed in this study, is freely accessible via Github (https://github.com/dmhenke/BioPrimeLASSO).

Adams, Sarah J., Iraz T. Aydin, and Julide T. Celebi. 2012. “GAB2—a Scaffolding Protein in Cancer.” Molecular Cancer Research: MCR 10 (10): 1265–70.
Behan, Fiona M., Francesco Iorio, Gabriele Picco, Emanuel Gonçalves, Charlotte M. Beaver, Giorgia Migliardi, Rita Santos, et al. 2019. “Prioritization of Cancer Therapeutic Targets Using CRISPR-Cas9 Screens.” Nature 568 (7753): 511–16.
Behera, Himansu Sekhar, Janmenjoy Nayak, Bighnaraj Naik, and Ajith Abraham. 2018. Computational Intelligence in Data Mining: Proceedings of the International Conference on CIDM 2017. Springer.
Chan, Edmond M., Tsukasa Shibue, James M. McFarland, Benjamin Gaeta, Mahmoud Ghandi, Nancy Dumont, Alfredo Gonzalez, et al. 2019. “WRN Helicase Is a Synthetic Lethal Target in Microsatellite Unstable Cancers.” Nature 568 (7753): 551–56.
Dempster, Joshua M., Isabella Boyle, Francisca Vazquez, David E. Root, Jesse S. Boehm, William C. Hahn, Aviad Tsherniak, and James M. McFarland. 2021. “Chronos: A Cell Population Dynamics Model of CRISPR Experiments That Improves Inference of Gene Fitness Effects.” Genome Biology 22 (1): 1–23.
Dempster, Joshua M., John M. Krill-Burger, James M. McFarland, Allison Warren, Jesse S. Boehm, Francisca Vazquez, William C. Hahn, Todd R. Golub, and Aviad Tsherniak. 2020. “Gene Expression Has More Power for Predicting in Vitro Cancer Cell Vulnerabilities than Genomics.” bioRxiv. https://doi.org/10.1101/2020.02.21.959627.
Dunkler, Daniela, Fátima Sánchez-Cabo, and Georg Heinze. 2011. “Statistical Analysis Principles for Omics Data.” Methods in Molecular Biology 719:113–31.
Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor Package biomaRt.” Nature Protocols 4 (8): 1184–91.
Eisenhaber, F., C. Wechselberger, and G. Kreil. 2001. “The Brix Domain Protein Family -- a Key to the Ribosomal Biogenesis Pathway?” Trends in Biochemical Sciences 26 (6): 345–47.
Freed, Emily F., José-Luis Prieto, Kathleen L. McCann, Brian McStay, and Susan J. Baserga. 2012. “NOL11, Implicated in the Pathogenesis of North American Indian Childhood Cirrhosis, Is Required for Pre-rRNA Transcription and Processing.” PLoS Genetics 8 (8): e1002892.
Friedman, Jerome H., Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (February):1–22.
Görnemann, Janina, Kimberly M. Kotovic, Katja Hujer, and Karla M. Neugebauer. 2005. “Cotranscriptional Spliceosome Assembly Occurs in a Stepwise Fashion and Requires the Cap Binding Complex.” Molecular Cell 19 (1): 53–63.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2013. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.
Hédou, Julien, Ivana Marić, Grégoire Bellan, Jakob Einhaus, Dyani K. Gaudillière, Francois-Xavier Ladant, Franck Verdonk, et al. 2024. “Discovery of Sparse, Reliable Omic Biomarkers with Stabl.” Nature Biotechnology, January. https://doi.org/10.1038/s41587-023-02033-x.
Hsu, Tiffany Y-T, Lukas M. Simon, Nicholas J. Neill, Richard Marcotte, Azin Sayad, Christopher S. Bland, Gloria V. Echeverria, et al. 2015. “The Spliceosome Is a Therapeutic Vulnerability in MYC-Driven Cancer.” Nature 525 (7569): 384–88.
Koh, Cheryl M., Marco Bezzi, Diana H. P. Low, Wei Xia Ang, Shun Xie Teo, Florence P. H. Gay, Muthafar Al-Haddawi, et al. 2015. “MYC Regulates the Core Pre-mRNA Splicing Machinery as an Essential Step in Lymphomagenesis.” Nature 523 (7558): 96–100.
Li, Yupeng, and Scott A. Jackson. 2015. “Gene Network Reconstruction by Integration of Prior Biological Knowledge.” G3 5 (6): 1075–79.
Lord, J. D., B. C. McIntosh, P. D. Greenberg, and B. H. Nelson. 2000. “The IL-2 Receptor Promotes Lymphocyte Proliferation and Induction of the c-Myc, Bcl-2, and Bcl-X Genes through the Trans-Activation Domain of Stat5.” Journal of Immunology 164 (5): 2533–41.
Mazza, Catherine, Mutsuhito Ohno, Alexandra Segref, Iain W. Mattaj, and Stephen Cusack. 2001. “Crystal Structure of the Human Nuclear Cap Binding Complex.” Molecular Cell 8 (2): 383–96.
McFarland, James M., Zandra V. Ho, Guillaume Kugener, Joshua M. Dempster, Phillip G. Montgomery, Jordan G. Bryan, John M. Krill-Burger, et al. 2018. “Improved Estimation of Cancer Dependencies from Large-Scale RNAi Screens Using Model-Based Normalization and Data Integration.” Nature Communications 9 (1): 1–13.
“‘Omics’ Data and Levels of Evidence for Biomarker Discovery.” 2009. Genomics 93 (1): 13–16.
O’Neil, Nigel J., Melanie L. Bailey, and Philip Hieter. 2017. “Synthetic Lethality and Cancer.” Nature Reviews. Genetics 18 (10): 613–23.
Paolella, Brenton R., William J. Gibson, Laura M. Urbanski, John A. Alberta, Travis I. Zack, Pratiti Bandopadhayay, Caitlin A. Nichols, et al. 2017. “Copy-Number and Gene Dependency Analysis Reveals Partial Copy Loss of Wild-Type SF3B1 as a Novel Cancer Vulnerability.” eLife 6 (February). https://doi.org/10.7554/eLife.23268.
Preston, Gavin C., Linda V. Sinclair, Aneesa Kaskar, Jens L. Hukelmann, Maria N. Navarro, Isabel Ferrero, H. Robson MacDonald, Victoria H. Cowling, and Doreen A. Cantrell. 2015. “Single Cell Tuning of Myc Expression by Antigen Receptor Signal Strength and Interleukin-2 in T Lymphocytes.” The EMBO Journal 34 (15): 2008–24.
Santosa, Fadil, and William W. Symes. 2006. “Linear Inversion of Band-Limited Reflection Seismograms.” SIAM Journal on Scientific and Statistical Computing, July. https://doi.org/10.1137/0907087.
Szklarczyk, Damian, Annika L. Gable, Katerina C. Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T. Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/measurement Sets.” Nucleic Acids Research 49 (D1): D605–12.
Tibshirani, Robert. 2018. “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 58 (1): 267–88.
Tsherniak, Aviad, Francisca Vazquez, Phil G. Montgomery, Barbara A. Weir, Gregory Kryukov, Glenn S. Cowley, Stanley Gill, et al. 2017. “Defining a Cancer Dependency Map.” Cell 170 (3): 564–76.e16.
Villarino, Alejandro V., Arian Dj Laurence, Fred P. Davis, Luis Nivelo, Stephen R. Brooks, Hong-Wei Sun, Kan Jiang, et al. 2022. “A Central Role for STAT5 in the Transcriptional Programing of T Helper Cell Metabolism.” Science Immunology 7 (77): eabl9467.
Wafi, Arsalan, and Reza Mirnezami. 2018. “Translational -Omics: Future Potential and Current Challenges in Precision Medicine.” Methods 151 (December):3–11.
Weinstein, I. Bernard. 2002. “Cancer. Addiction to Oncogenes--the Achilles Heal of Cancer.” Science 297 (5578): 63–64.
Wild, Thomas, Peter Horvath, Emanuel Wyler, Barbara Widmann, Lukas Badertscher, Ivo Zemp, Karol Kozak, Gabor Csucs, Elsebet Lund, and Ulrike Kutay. 2010. “A Protein Inventory of Human Ribosome Biogenesis Reveals an Essential Function of Exportin 5 in 60S Subunit Export.” PLoS Biology 8 (10): e1000522.
Wirtschaftsuniversität Wien Department of Statistics and Mathematics. 2008. The R Project for Statistical Computing.
Zuo, Yiming, Yi Cui, Guoqiang Yu, Ruijiang Li, and Habtom W. Ressom. 2017. “Incorporating Prior Biological Knowledge for Network-Based Differential Gene Expression Analysis Using Differentially Weighted Graphical LASSO.” BMC Bioinformatics 18 (1): 99.

No competing interests reported.

floatimage2.png
Figure S1. Scatter plot shows the median standardized root mean squared error (y-axis) across a range of Φ values (x-axis). Vertical dashed line represents the inflection point, identified as the optimal Φ.
floatimage4.png
Figure S2. Scatter plot shows the coefficients derived from two independent runs of the bio-primed model.
floatimage5.png
Figure S3. Scatter plot shows the coefficients derived from the original model (x-axis) and a model where the STRING confidence score for MYC was set to 0 (y-axis).
floatimage7.png
Figure S4. Scatter plot with associated Pearson correlation coefficient and p-value shows EGFR CN (x-axis) and EGFR dependency (y-axis).
tables1d2cnvanalysis.csv

Download PDF

Reviewers agreed at journal
21 Nov, 2024
Reviews received at journal
21 Nov, 2024
Reviews received at journal
19 Nov, 2024
Reviewers agreed at journal
28 Oct, 2024
Reviewers agreed at journal
28 Oct, 2024
Reviewers agreed at journal
28 Oct, 2024
Reviewers invited by journal
26 Oct, 2024
Editor assigned by journal
29 Sep, 2024
Submission checks completed at journal
27 Sep, 2024
First submitted to journal
23 Sep, 2024

You are reading this latest preprint version

Bio-primed machine learning to enhance discovery of relevant biomarkers

Status:

Version 1

Abstract

Figures

Introduction

Methods

LASSO implementation

Bio-primed regularization

String database

Dependency data

Statistical analysis

Results

Method overview

Biologically informed biomarkers show stronger co-dependency

Discussion

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Supplementary Files

Status:

Version 1