Evaluation of large language models for discovery of gene set function

doi:10.21203/rs.3.rs-3270331/v1

Download PDF

Analysis

Evaluation of large language models for discovery of gene set function

https://doi.org/10.21203/rs.3.rs-3270331/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 28 Nov, 2024

Read the published version in Nature Methods →

Version 1

posted

You are reading this latest preprint version

Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI’s GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in ‘omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.

Biological sciences/Genetics/Functional genomics/Gene expression profiling

Biological sciences/Molecular biology/Proteomics

Biological sciences/Molecular biology/Transcriptomics

Biological sciences/Computational biology and bioinformatics/Gene ontology

Yes there is potential Competing Interest. TI is a co-founder, member of the advisory board, and has an equity interest in Data4Cure and Serinus Biosciences. TI is a consultant for and has an equity interest in Ideaya BioSciences and Light Horse Therapeutics. The terms of these arrangements have been reviewed and approved by UC San Diego in accordance with its conflict of interest policies.

Download PDF

Journal Publication

published 28 Nov, 2024

Read the published version in Nature Methods →

Version 1

posted

You are reading this latest preprint version

Evaluation of large language models for discovery of gene set function

Status:

Journal Publication

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1