Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

doi:10.21203/rs.3.rs-199254/v1

Download PDF

Research Article

Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

https://doi.org/10.21203/rs.3.rs-199254/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results: We observe that number and quality of the Peptide-to-Spectra-Matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that previously have been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence in proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions: The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular also capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration with transcriptomics data and other source of genome-level information.

Bioinformatics

small Proteins

metaproteogenomics

Peptide-to-Spectra-Matches

microbial communitities

Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and

accessed as a PDF.

No competing interests reported.

protmapsupplement.pdf

Download PDF

Editorial decision: Major revision
09 Mar, 2021
Reviews received at journal
05 Mar, 2021
Reviewers agreed at journal
25 Feb, 2021
Reviews received at journal
24 Feb, 2021
Reviewers agreed at journal
17 Feb, 2021
Reviewers invited by journal
17 Feb, 2021
Editor assigned by journal
17 Feb, 2021
Editor invited by journal
17 Feb, 2021
Submission checks completed at journal
16 Feb, 2021
First submitted to journal
02 Feb, 2021

You are reading this latest preprint version

Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

Status:

Version 1

Abstract

Figures

Full Text

Additional Declarations

Supplementary Files

Status:

Version 1