In recent years, the field of proteomics has seen unprecedented growth in publicly available datasets, with a trend towards studies that analyze a more significant number of samples. As of April 2023, the number of public datasets stored in the PRIDE database exceeded 22,000, including a remarkable increase in large datasets containing more than 100 instrument files, from 100 in 2014 to 600 submissions in 2022. In parallel, a range of transformative improvements in proteomic data processing software has been introduced, enabling a deeper and more precise look into the proteome. Reprocessing old data with the new tools, therefore, yields additional biological and biomedical insights 1, 2. However, the growth in dataset size presents a significant computational bottleneck, making it challenging to re-analyze large experiments on conventional workstations. The automated analysis of publicly accessible quantitative proteomics data is further limited by the lack of metadata for the phenotypes, the samples, and the instrument operation. Although some of these challenges are tackled in earlier works 3–5, research groups still cannot perform automated large-scale quantitative analysis in the cloud and on distributed architectures. To address this challenge, the field requires innovative and scalable bioinformatics solutions that leverage sample metadata to automatically quantify peptides and proteins, perform absolute or differential expression analysis, and provide extensive quality control output.
Here we introduce quantms (https://github.com/bigbio/quantms), the first open-source cloud-based pipeline for massively parallel proteomic data analysis. It supports three major types of experiments - data-dependent acquisition label-free (DDA-LFQ), isobaric tandem mass tag-based (DDA-plex), and data-independent acquisition (DIA-LFQ) - and is highly flexible and modular, to accommodate the diversity of quantitative proteomics approaches. To enable traceable and reproducible analysis, quantms is entirely based on standardized open file formats and reproducible execution environments, adhering strictly to FAIR principles 6. The pipeline is fully documented following nf-core guidelines 7, making it a valuable resource in the field of proteomics.
A quantms analysis starts with the instrument files in the standard mass spectrometer format (mzMLs) and the protein sequence database (Fig. 1A). The workflow uses the Sample and Data Relationship Format (SDRF) 8, to ensure the execution of workflow modules with all relevant internal parameters, including the sample variables under study and mass spectrometry-related parameters. The quantms pipeline branches into three sub-workflows for DDA-LFQ, DDA-plex (Fig. 1B), and DIA experiments (Fig. 1C). Unlike conventional desktop tools like MaxQuant or ProteomeDiscover, quantms automatically distributes computation using the nextflow workflow engine 9 on one or more computers, depending on the number of instrument files and samples. To parallelize the steps that can be performed independently, the workflow streams each instrument file as annotated in the SDRF tab-delimited file to individual nodes of the computing infrastructure, such as a cloud or high-performance computing cluster (HPC). In the final step, quantms perform the aggregation of the processed data to globally infer proteins, estimate protein false-discovery rates, and quantify proteins. All sub-workflows export the final results into the mzTab standard format facilitating the submission of the results to ProteomeXchange (Fig. 1D). All these analyses run automatically, fully reproducibly, and without manual intervention. quantms is integrated with MSstats 10 and a new Python tool (pmultiqc) enabling the differential expression analysis and the generation of quality control reports (Fig. 1E).
We extensively benchmarked quantms in comparison to MaxQuant on DDA-LFQ (Supplementary Note 1, 2) 11 and DDA-plex datasets (Supplementary Note 3). MaxQuant has been previously used for public data reanalysis as the most popular tool for intensity-based quantitation by ProteomicsDB 12, MassIVE.quant 4 and ExpressionAtlas reanalyses 13. In summary, quantms can quantify a higher number of proteins compared with MaxQuant for all datasets with the same accuracy (lower coefficients of variation), however, for low concentrations, quantms underestimated the true fold changes (Supplementary Note 2, Fig. 3). In terms of scalability and performance, major differences are observed between MaxQuant and quantms. When the number of instrument files and samples grows (over 1000 ms runs) quantms can perform 40 times faster than MaxQuant (Supplementary Note 4). quantms benefits for the parallelization and distribution of MS runs in some of the processing steps (peptide search, percolator, multiple search engine merge), decreasing the time to process big submissions. In addition, we benchmarked the DIA workflow using the dataset PXD026600 (Supplementary Note 5) and found quantms can accurately quantify spike-in UPS proteins on different concentrations.
To demonstrate quantms performance and scalability, we analyzed 91 human datasets, 35 differential expression and 56 intensity-based absolute quantification (IBAQ-based) datasets (Supplementary Note 7, Table 6). The absolute (IBAQ) quantification datasets include 56 public human datasets, 9,502 samples, and 26801 instrument files. Among these datasets are the multiple large-scale human studies (PXD000561, PXD000865, PXD010154, PXD016999, PXD030304). For all the DDA studies, both search engines were used (comet and MSGF+), stringent FDR at 1% was applied at PSM and protein level at the dataset level, and at least two unique peptides were needed to quantify a protein. Figure 2A shows a barplot with the number of unique peptides for the 17652 quantified (Fig. 2B). From this number, 17378 correspond were quantified in experiments from normal tissues, and 15159 in cell lines experiments. It is worth highlighting that 5453 proteins were quantified in human plasma experiments, an increase of approximately 24% compared to the PeptideAtlas Plasma identification build (Supplementary Note 7, Table 7). The IBAQ values computed with the quantms are highly correlated for all tissues with the proteomicDB 10 (Fig. 2C). Moreover, the present study yielded more than 499 proteins not previously quantified in proteomicsDB or PaxDB 14 (Fig. 2D). The samples were analyzed on a high-performance computer cluster (EMBL-EBI Cluster), taking an average of 10 hours per dataset and about 2 minutes per instrument run on average (Supplementary Note 7, Table 6).
quantms not only allows data processing of three different major quantification approaches; but also automates the deployment and installation of the tools employed by the workflow, converts all the output formats to standard file formats; improving the reproducibility, portability, and deposition of the data to PRIDE and ProteomeXchange. It also supports direct quantification reprocessing of any publicly available dataset in ProteomeXchange, in any cloud or HPC computer infrastructure. Finally, quantms is a modular and open-source workflow which enables the inclusion and extension of new (sub-)workflows, and pipelines for proteomics data processing. Additional documentation about the workflow, the parameters and examples can be found at: https://quantms.readthedocs.io/en/latest/.