quantms: A cloud-based pipeline for proteomics reanalysis enables the quantification of 17521 proteins in 9,502 human samples.

doi:10.21203/rs.3.rs-3002027/v1

Download PDF

Research Article

quantms: A cloud-based pipeline for proteomics reanalysis enables the quantification of 17521 proteins in 9,502 human samples.

https://doi.org/10.21203/rs.3.rs-3002027/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Public proteomics data is rapidly increasing, creating a computational challenge for large-scale reanalysis. Here, we introduce quantms, an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 56 of the largest datasets, comprising 26801 instrument files from 9502 human samples, to quantify 17521 based on 1.02 million unique peptides. Based on standard file formats improves the reproducibility and deposition of the data to ProteomeXchange.

Bioinformatics

Mass Spectrometry

Computational Biology

proteomics

bioinformatics

mass spectrometry

computational biology

In recent years, the field of proteomics has seen unprecedented growth in publicly available datasets, with a trend towards studies that analyze a more significant number of samples. As of April 2023, the number of public datasets stored in the PRIDE database exceeded 22,000, including a remarkable increase in large datasets containing more than 100 instrument files, from 100 in 2014 to 600 submissions in 2022. In parallel, a range of transformative improvements in proteomic data processing software has been introduced, enabling a deeper and more precise look into the proteome. Reprocessing old data with the new tools, therefore, yields additional biological and biomedical insights ^{1, 2}. However, the growth in dataset size presents a significant computational bottleneck, making it challenging to re-analyze large experiments on conventional workstations. The automated analysis of publicly accessible quantitative proteomics data is further limited by the lack of metadata for the phenotypes, the samples, and the instrument operation. Although some of these challenges are tackled in earlier works ^3–5, research groups still cannot perform automated large-scale quantitative analysis in the cloud and on distributed architectures. To address this challenge, the field requires innovative and scalable bioinformatics solutions that leverage sample metadata to automatically quantify peptides and proteins, perform absolute or differential expression analysis, and provide extensive quality control output.

Here we introduce quantms (https://github.com/bigbio/quantms), the first open-source cloud-based pipeline for massively parallel proteomic data analysis. It supports three major types of experiments - data-dependent acquisition label-free (DDA-LFQ), isobaric tandem mass tag-based (DDA-plex), and data-independent acquisition (DIA-LFQ) - and is highly flexible and modular, to accommodate the diversity of quantitative proteomics approaches. To enable traceable and reproducible analysis, quantms is entirely based on standardized open file formats and reproducible execution environments, adhering strictly to FAIR principles ⁶. The pipeline is fully documented following nf-core guidelines ⁷, making it a valuable resource in the field of proteomics.

A quantms analysis starts with the instrument files in the standard mass spectrometer format (mzMLs) and the protein sequence database (Fig. 1A). The workflow uses the Sample and Data Relationship Format (SDRF) ⁸, to ensure the execution of workflow modules with all relevant internal parameters, including the sample variables under study and mass spectrometry-related parameters. The quantms pipeline branches into three sub-workflows for DDA-LFQ, DDA-plex (Fig. 1B), and DIA experiments (Fig. 1C). Unlike conventional desktop tools like MaxQuant or ProteomeDiscover, quantms automatically distributes computation using the nextflow workflow engine ⁹ on one or more computers, depending on the number of instrument files and samples. To parallelize the steps that can be performed independently, the workflow streams each instrument file as annotated in the SDRF tab-delimited file to individual nodes of the computing infrastructure, such as a cloud or high-performance computing cluster (HPC). In the final step, quantms perform the aggregation of the processed data to globally infer proteins, estimate protein false-discovery rates, and quantify proteins. All sub-workflows export the final results into the mzTab standard format facilitating the submission of the results to ProteomeXchange (Fig. 1D). All these analyses run automatically, fully reproducibly, and without manual intervention. quantms is integrated with MSstats ¹⁰ and a new Python tool (pmultiqc) enabling the differential expression analysis and the generation of quality control reports (Fig. 1E).

We extensively benchmarked quantms in comparison to MaxQuant on DDA-LFQ (Supplementary Note 1, 2) ¹¹ and DDA-plex datasets (Supplementary Note 3). MaxQuant has been previously used for public data reanalysis as the most popular tool for intensity-based quantitation by ProteomicsDB ¹², MassIVE.quant ⁴ and ExpressionAtlas reanalyses ¹³. In summary, quantms can quantify a higher number of proteins compared with MaxQuant for all datasets with the same accuracy (lower coefficients of variation), however, for low concentrations, quantms underestimated the true fold changes (Supplementary Note 2, Fig. 3). In terms of scalability and performance, major differences are observed between MaxQuant and quantms. When the number of instrument files and samples grows (over 1000 ms runs) quantms can perform 40 times faster than MaxQuant (Supplementary Note 4). quantms benefits for the parallelization and distribution of MS runs in some of the processing steps (peptide search, percolator, multiple search engine merge), decreasing the time to process big submissions. In addition, we benchmarked the DIA workflow using the dataset PXD026600 (Supplementary Note 5) and found quantms can accurately quantify spike-in UPS proteins on different concentrations.

To demonstrate quantms performance and scalability, we analyzed 91 human datasets, 35 differential expression and 56 intensity-based absolute quantification (IBAQ-based) datasets (Supplementary Note 7, Table 6). The absolute (IBAQ) quantification datasets include 56 public human datasets, 9,502 samples, and 26801 instrument files. Among these datasets are the multiple large-scale human studies (PXD000561, PXD000865, PXD010154, PXD016999, PXD030304). For all the DDA studies, both search engines were used (comet and MSGF+), stringent FDR at 1% was applied at PSM and protein level at the dataset level, and at least two unique peptides were needed to quantify a protein. Figure 2A shows a barplot with the number of unique peptides for the 17652 quantified (Fig. 2B). From this number, 17378 correspond were quantified in experiments from normal tissues, and 15159 in cell lines experiments. It is worth highlighting that 5453 proteins were quantified in human plasma experiments, an increase of approximately 24% compared to the PeptideAtlas Plasma identification build (Supplementary Note 7, Table 7). The IBAQ values computed with the quantms are highly correlated for all tissues with the proteomicDB ¹⁰ (Fig. 2C). Moreover, the present study yielded more than 499 proteins not previously quantified in proteomicsDB or PaxDB ¹⁴ (Fig. 2D). The samples were analyzed on a high-performance computer cluster (EMBL-EBI Cluster), taking an average of 10 hours per dataset and about 2 minutes per instrument run on average (Supplementary Note 7, Table 6).

quantms not only allows data processing of three different major quantification approaches; but also automates the deployment and installation of the tools employed by the workflow, converts all the output formats to standard file formats; improving the reproducibility, portability, and deposition of the data to PRIDE and ProteomeXchange. It also supports direct quantification reprocessing of any publicly available dataset in ProteomeXchange, in any cloud or HPC computer infrastructure. Finally, quantms is a modular and open-source workflow which enables the inclusion and extension of new (sub-)workflows, and pipelines for proteomics data processing. Additional documentation about the workflow, the parameters and examples can be found at: https://quantms.readthedocs.io/en/latest/.

DDA peptide identification

All branches of the workflow start with parsing the SDRF and additional user-specified options to split input files by their acquisition and the labelling type and to check and infer necessary parameters. For both LFQ and plex workflows, input files are then potentially converted into mzML and indexed. The peptide identification step for DDA-LFQ and DDA-plex approaches is shared in quantms, and two search engines are supported: Comet and MSGF+. These tools can be used separately or in tandem to increase the number of identifications by 5% on average (Supplementary Note 1). The workflow offers a distribution-fitting approach (reminiscent of PeptideProphet) and Percolator as methods to calculate a posterior (error) probability for each PSM. Then, the ConsensusID tool combines the PSMs from multiple search engines into a final score for each peptide spectrum match (PSM). After ConsensusID, file-wide PSM-level q-values are taken from Percolator or calculated according to OpenMS’ target-decoy strategy based on the output probabilities. The workflow performs protein inference using multiple algorithms two algorithms (Bayesian approach ¹⁵, or aggregation) and FDR filtering using pickedFDR ¹⁶, with the same underlying algorithms as in the LFQ branch. For post-translational modification studies, the LuciPHOr2 tool ¹⁷ can be used to compute a site-level localization score and the associated false localization rate.

DDA Label-free protein quantification

Two methods are available for label-free peptide/protein quantification: spectral counting and intensity-based quantification. We developed a tool proteomicsLFQ as part of the OpenMS framework ¹⁸ to perform LFQ-based quantification. For intensity-based quantification, proteomicsLFQ uses a hybrid quantification strategy that combines targeted extraction of elution profiles based on the precursors of identified peptides with an untargeted, averaging model-based feature detection approach. Chromatographic retention time alignment leverages the sample fraction annotation from the experimental design file to reduce chromatographic shifts between corresponding fractions in different instrument files. If match-between-runs is applied, peptide annotations are transferred from identified peptides in one run to unidentified features. An optional quantification step aims to fill the remaining missing quantitative values by running a targeted extraction based on peptide precursors that have been quantified successfully in most runs. Quantified peptides and inferred proteins are written to standardized mzTab format, MSstats and Triqler output for statistical downstream analysis.

DDA-plex protein quantification

quantms quantification of isobaric-labelled peptides and proteins starts by reading the DDA peptide identification results into the OpenMS tool IsobaricAnalyzer. Using isotope correction matrices, this tool extracts and normalizes reporter ion intensities from MS2 and MS3 spectra. quantms currently supports 4-plex and 8-plex iTRAQ labelling, as well as TMT 6-plex, 10-plex, 11-plex, and 16-plex. After protein inference and quantification, the results are again stored in standardized output formats and forwarded to downstream analysis. Three gold-standard datasets previously evaluated by TMT quantification tools were used to benchmark quantms (Supplementary Note 3). In all benchmarks, quantms performs comparably to MaxQuant and the other tools used for quantification, such as ProteomeDiscover or IsoProt (Supplementary Note 3, dataset PXD005486). In addition, we evaluated the dataset PXD007683, a two-proteome mixture in known concentrations analyzed using TMT and LFQ approaches. For both approaches, quantms quantified more proteins than MaxQuant, and both tools separated human and yeast proteins equally well (Supplementary Note 3).

DIA protein identification and quantification

For data-independent acquisition data analysis, quantms parallelizes the DIA-NN tool ¹⁹, distributing the multiple steps that DIA-NN performs on a dataset across compute nodes (Fig. 1C). The first step of the pipeline converts the protein sequence database (FASTA) into an in –silico-predicted spectral library. Each instrument file in mzML is then searched against this library (First DIA assembly), resulting in a set of precursors identified. A full library of identified precursors is then created by merging all the individual searches (experimental library). A final fast identification/quantification step runs in one single node, where all the MS runs are searched against the merged experimental library (Fig. 1D). We evaluated the DIA workflow on the dataset PXD026600, an E. coli sample with UPS1 proteins spiked in different concentrations (Supplementary Note 5). The workflow achieved nearly perfect performance (quantified all 48 UPS proteins) at 4 high concentrations. In addition, in most concentrations, the workflow achieved a perfect distinction between the two classes compared, namely UPS1 proteins (differentially expressed) and E. coli proteins (fixed background), but the accuracy naturally drops for lower concentrations, due to fewer identifications achieved and noisier quantification (Supplementary Note 5).

Downstream analysis and quality control

MSstats and quantms are fully integrated for differential expression data analysis. The workflow generates input for the MSstats R package, and if differential expression analysis is performed, the MSstats plots and output files are automatically produced. MSstats was selected after benchmarking MSstats and other R packages with quantms for multiple LFQ datasets ¹¹. Factor values/conditions, and biological and technical replicates under study are translated from the original SDRF (provided as input format) to MSstats columns. In cases where multiple SDRFs are being used to study multiple conditions or factor values, the pipeline will reuse steps that have already been executed with no changes in parameters, and only execute the step that differs due to the SDRF being used (such as the quantification step in proteomicsLFQ). Users can automatically perform the differential expression analysis using MSstats (https://quantms.readthedocs.io/en/latest/msstats.html). The workflow will detect if the pipeline is LFQ (DIA or DDA) or TMT and will use the corresponding MSstats package (MSstats or MSstatsTMT), accordingly. The MSstats step will generate by default a list of plots, including a volcano plot, quality control QC plot, and comparison plot (e.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/differential-expression/PXD004683/msstatstmt/). Configurable parameters for MSstats data processing step includes the summary method, the log fold-change threshold, etc.

To ensure high-quality data, we developed pmultiqc (https://github.com/bigbio/pmultiqc), which is part of the quantms tool ecosystem (Fig. 1E, Supplementary Note 5). pmultiqc generates a quality control report for each analyzed dataset, using the mzTab, SDRF, and other intermediate files. The report includes different plots that display the number of peptides identified per protein, the distribution of PSM posterior error probabilities and search engine scores, or the MS2/MS3 identification rate.

Portability and Deployment

All quantms tools are available as versioned BioConda packages and BioContainers and the workflow has been developed using the nextflow and nf-core ⁷ guidelines enabling compatibility with an ecosystem of infrastructures including Amazon Web Services (AWS), Google Cloud Platform (GCP), Kubernetes, and HPC clusters. Due to its implementation as an nf-core/nextflow workflow, quantms allows resuming failing process executions as well as re-allocation of resources (e.g., memory and CPU), depending on the demands of the tool, and workflow monitoring.

Interoperability and ProteomeXchange support

quantms processing steps are based on standard file formats. The input formats are SDRF and mzML and the main result files are exported into mzTab. To export DIA and DDA results into mzTab, new controlled vocabulary terms and external reference files were introduced. In addition, the pipeline automatically generates other file formats that can be used for downstream analysis, such as MSstats and Triqler inputs. Results from quantms can be readily submitted to PRIDE and ProteomeXchange as COMPLETE submissions.

Data availability

The datasets reanalyzed in the present study can be found in the PRIDE database ¹³ FTP (http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/).

Code availability

All software, algorithms and tools are available on GitHub: quantms (https://github.com/bigbio/quantms), pmultiqc (https://github.com/bigbio/pmultiqc). The full documentation of quantms is available at (https://quantms.readthedocs.io/en/latest/).

Acknowledgements

Y.PR. was funded by the EU H2020 project EPIC-XS [823839], Wellcome grants (208391/Z/17/Z, 223745/Z/21/Z) and EMBL core funding. M. B. and C. D. were funded by the National Key Research and Development Program of China (2018YFA0507504). V.D. was supported by the Federal Ministry of Education and Research (BMBF), as part of the National Research Initiatives for Mass Spectrometry in Systems Medicine (“MSCoreSys”), under grant agreement 161L0221.

Levitsky, L.I. et al. Massive Proteogenomic Reanalysis of Publicly Available Proteomic Datasets of Human Tissues in Search for Protein Recoding via Adenosine-to-Inosine RNA Editing. J Proteome Res (2023).
Jarnuczak, A.F. et al. An integrated landscape of protein expression in human cancer. Sci Data 8, 115 (2021).
Feng, J. et al. Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis. Nat Biotechnol 35, 409–412 (2017).
Choi, M. et al. MassIVE.quant: a community resource of quantitative mass spectrometry-based proteomics datasets. Nat Methods 17, 981–984 (2020).
Vaudel, M. et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat Biotechnol 33, 22–24 (2015).
Wilkinson, M.D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).
Ewels, P.A. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020).
Dai, C. et al. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun 12, 5854 (2021).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316–319 (2017).
Choi, M. et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 30, 2524–2526 (2014).
Bai, M. et al. LFQ-Based Peptide and Protein Intensity Differential Expression Analysis. J Proteome Res (2023).
Lautenbacher, L. et al. ProteomicsDB: toward a FAIR open-source resource for life-science research. Nucleic Acids Res 50, D1541-D1552 (2022).
Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res 50, D543-D552 (2022).
Wang, M., Herrmann, C.J., Simonovic, M., Szklarczyk, D. & von Mering, C. Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell-lines. Proteomics 15, 3163–3168 (2015).
Pfeuffer, J. et al. EPIFANY: A Method for Efficient High-Confidence Protein Inference. J Proteome Res 19, 1060–1072 (2020).
Savitski, M.M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets. Mol Cell Proteomics 14, 2394–2404 (2015).
Fermin, D., Avtonomov, D., Choi, H. & Nesvizhskii, A.I. LuciPHOr2: site localization of generic post-translational modifications from tandem mass spectrometry data. Bioinformatics 31, 1141–1143 (2015).
Rost, H.L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods 13, 741–748 (2016).
Demichev, V., Messner, C.B., Vernardis, S.I., Lilley, K.S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods 17, 41–44 (2020).

Competing interests: The authors declare no competing interests.

supplementary.pdf
Supplementary Notes

Download PDF

Version 1

posted

You are reading this latest preprint version

quantms: A cloud-based pipeline for proteomics reanalysis enables the quantification of 17521 proteins in 9,502 human samples.

Status:

Version 1

Abstract

Figures

Main

Methods

DDA peptide identification

DDA Label-free protein quantification

DDA-plex protein quantification

DIA protein identification and quantification

Downstream analysis and quality control

Portability and Deployment

Interoperability and ProteomeXchange support

Data availability

Code availability

Declarations

Acknowledgements

References

Additional Declarations

Supplementary Files

Status:

Version 1