Datasets
Nine-species dataset: We acquired the nine-species dataset from Tran et al.18, comprising approximately 1.52 million PSMs generated by HCD following trypsin digestion. Identified through a rigorous database search, this dataset underwent stringent filtering to maintain a 1% PSM-level FDR, ensuring high data integrity. Modification settings included carbamidomethylation of cysteine (C) residues as a fixed modification, alongside oxidation of methionine (M), and deamidation of asparagine (N) and glutamine (Q) as variable modifications. Isoleucine (I) and leucine (L) residues remained indistinguishable in this analysis. To evaluate the model’s performance, we implemented a leave-one-out cross-validation framework, designating one species as the test set and the remainder as the training set. Additional details about the dataset are documented in Supplementary Table 3.
GraphNovo dataset: We acquired the GraphNovo dataset from Mao et al.21, meticulously structured into two distinct parts: the training set includes samples from HeLa and Cerebellum, totaling 1,659,763 PSMs, while the test set encompasses samples from A. thaliana, C. elegans, and E. coli, each contributing 12,500 PSMs. All spectra were generated using HCD after trypsin digestion. The PSMs were rigorously identified through database searches and validated to a stringent 1% PSM-level FDR. The protocol set carbamidomethylation of cysteine (C) residues as a fixed modification and oxidation of methionine (M) as a variable modification. Notably, isoleucine (I) and leucine (L) residues were not differentiated in this analysis. Extensive details about the dataset are elaborated in Supplementary Table 3.
MassIVE-KB dataset: We acquired the MassIVE-KB dataset26, which encompasses 30 million high-quality PSMs filtered to maintain an almost perfect (~ 0%) PSM-level FDR, limiting each charge and peptide combination to no more than 100 PSMs. Generated predominantly through trypsin digestion, the dataset incorporates carbamidomethylation of cysteine (C) residues as a fixed modification and includes seven variable modifications: oxidation of methionine (M), deamidation of asparagine (N) and glutamine (Q), N-terminal acetylation, N-terminal carbamylation, N-terminal NH3 loss, and combinations of N-terminal carbamylation and NH3 loss. To assess the π-xNovo model's performance, we utilized the A. thaliana, C. elegans, and E. coli data from the GraphNovo dataset. The overlap in PSMs for these species was 31, 130, and 6 respectively, with an average error margin of no more than 0.42%. This minimal discrepancy, well within the bounds of experimental error, reinforces the robustness of our findings and underscores the validity of our conclusions.
DHP dataset: We downloaded 2,460 ‘.mzML’ files from the deep human proteome (DHP) sequencing project (ProteomeXchange ID: PXD024364) from Sinitcyn et al.27, encompassing approximately 161 million spectra. These spectra were generated from six human cell lines: hES1, HeLa S3, HepG2, GM12878, K562, and HUVEC. Samples underwent enzymatic digestion with six parallel enzymes—LysC, LysN, AspN, chymotrypsin, GluC, and trypsin—and were analyzed using three fragmentation methods: HCD, collision-induced dissociation (CID), and ETD. Spectra from electron-transfer/higher-energy collisional dissociation (ETHCD) were classified as ETD. From the ‘msms.txt’ files for each cell line, we extracted 16,746,362 PSMs for HCD, 2,389,865 for ETD, and 1,482,573 for ETHCD. To evaluate the π-xNovo model, we utilized test sets comprising 10000 randomly selected spectra from each cell line and enzyme combination, resulting in 360,000 PSMs for HCD and 300,000 PSMs for ETD. Additionally, we retrieved 174,945 PSMs from ‘msms.txt’ contained within ‘txt.zip’ that are not associated with downloadable ‘.mzML’ files, bringing the total PSMs identified in the project to 20,793,745. Fixed modification included carbamidomethylation of cysteine (C) residues, with variable modifications such as oxidation of methionine (M) and N-terminal acetylation. We also acquired ‘proteins.fa’ for these cell lines from ‘variationExtraction.zip.’ By integrating this data with UniProt canonical (version 2017_02; UP000005640_9606), UniProt isoform (UP000005640_9606_additional), Ensembl canonical (version 86; GRCh38.pep.all), and Ensembl isoform (GRCh38.pep.abinitio), we established a comprehensive protein database comprising 143,897 protein sequences.
Protein identification
The π-xNovo models, specifically trained on HCD and ETD spectra, were applied to a dataset comprising approximately 140 million previously unidentified spectra. We retained PSMs ranging from 5 to 40 amino acids in length. Using π-xNovo-QC, we identified a total of 18,982,605 high-confidence PSMs. Subsequent sequencing alignment using BLAST33 (version 2.5.0) compared peptides derived from both traditional database search methods and de novo peptide sequencing. This analysis yielded 10,352,965 matches for peptides identified through database searches and 16,122,348 for those obtained via de novo sequencing. Among the peptides identified through database searches, 1,408,794 matched known protein sequences, with 1,353,978 achieving a perfect identity (100% matched). Conversely, de novo sequencing identified 2,259,435 peptides matching known protein sequences, with 735,719 achieving a perfect identity, highlighting the enhanced capability of de novo methods in expanding the proteomic landscape.
Protein coverage calculation
Utilizing the designated protein ID, we retrieved all corresponding peptides with 100% matching percentage. Protein sequence coverage was calculated as the proportion of unique amino acids within the retrieved peptides relative to the total amino acid count of the full protein sequence. Each amino acid was counted only once towards coverage, regardless of its recurrence across different peptides.
Single amino acid polymorphism (SAP) analysis
We retrieved data on potential positions for nonsynonymous mutations and their corresponding amino acid substitutions from the ‘proteins.fa’ file for each cell line. We then cross-referenced peptides from the alignment results that exhibited single amino acid variations with the nonsynonymous mutation data. This rigorous approach identified a distinct subset of peptides with mutations, thereby providing valuable insights into the mutation landscape of the proteins analyzed.
Exon-skipping splicing event analysis
We devised a methodology to identify and analyze potential exons from the transcript's cDNA (GRCh38.cdna.all.fa), utilizing annotations from the hg38 reference human genome (GRCh38111.gff3). Our objective was to construct a comprehensive database of all splicing sequences, focusing on splicing between two exons. The methodology comprised the following steps:
-
Identification of Exons: Using the transcript ID, we cataloged all exons, detailing their start and end positions, and extracted each exon sequence.
-
Exon Splicing: We concatenated non-contiguous exons, limiting our analysis to connections between two exons, and performed a three-frame translation to identify all possible open reading frames (ORFs).
-
Translation to Peptide Sequences: Utilizing the Biopython library34, we translated all identified ORF nucleotide sequences into peptide sequences, preserving information about exon junctions (positioned precisely between two amino acids or at a specific amino acid), thereby establishing a database for splicing event peptides.
-
Sequence Alignment: We employed BLAST33 (version 2.5.0) to align peptide sequences derived from both database searches and de novo peptide sequencing.
-
Detection of Splicing Events: We scrutinized matches exhibiting a perfect identity percentage (100%) to verify the occurrence of peptides at exon junctions. Confirmation of peptides at these junctions signified successful detection of cross-exon splicing events.
Data preprocessing
All spectra underwent stringent preprocessing to ensure data quality and computational efficiency:
-
Removal of Outliers: Peaks outside the m/z range of 50.0 Da to 2500.0 Da were excluded to eliminate anomalies and enhance analysis reliability.
-
Removal of Low-Intensity Peaks: Peaks with intensities less than 10% of the highest peak intensity within the spectrum were excluded to focus on the most significant signals.
-
Retention of Top 150 Peaks: Only the top 150 peaks by intensity were retained for each spectrum, optimizing data management and computational efficiency.
-
Normalization of Intensity: The intensities of the retained peaks were normalized to a uniform scale to facilitate consistent comparative analysis across all samples.
To rigorously assess noise presence and backbone ion absence in spectral data, we used the Pyteomics framework35 to categorize spectral peaks systematically. For each fragmentation site, we considered a comprehensive set of 30 potential backbone ion types. Specifically, for HCD spectra, this categorization includes two charge states (1 + and 2+), five types of neutral losses (NH3, H2O, NH3-H2O, H2O-H2O), and three primary fragmentation types (a, b, y). For ETD spectra, additional fragmentation types (c, y, z + 1) were included. Peaks matching a specific backbone ion type were identified if the m/z discrepancy was less than 0.1 Da. Peaks that did not correspond to any recognized backbone ion type were classified as noise. The absence of a corresponding backbone ion at a fragmentation site was noted as a missing site. The noise factor (\(\:{n}_{f}\)) was quantitatively defined as the ratio of noise peaks to the total number of peaks within a spectrum. An amino acid was deemed to lack sufficient fragmentation data if the fragmentation sites preceding and following it were absent. Peptide sequence coverage (PSC) was defined as the ratio of amino acids directly inferred from the spectral data to the total amino acids within the peptide. The PSM score, which evaluated the match quality between observed spectra and theoretical peptides, was formulated as follows:
$$\:PSM\:score=\frac{PSC}{1+\left(k*\left(1-PSC\right)+c\right)*{n}_{f}}$$
where \(\:k\) and \(\:c\) are hyperparameters adjusted to modulate the impact of the noise factor, with default values of 1.0 and 0, respectively.
Evaluation metrics
Peptide recall, amino acid precision, and amino acid recall are employed to evaluate the performance of the π-xNovo model. Accuracy, sensitivity, specificity, and precision are utilized to assess the discriminative capability of π-xNovo-QC in relation to peptide predictions. The definitions and formulas for these metrics are as follows:
$$\:Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
$$\:Sensitivity=\frac{TP}{TP+FN}$$
$$\:Specificity=\frac{TN}{TN+FP}$$
$$\:Precision=\frac{TP}{TP+FP}$$
Where:
TP (True Positives): Samples that are credible and correctly predicted.
TN (True Negatives): Samples that are credible but incorrectly predicted.
FP (False Positives): Samples that are not credible and incorrectly predicted.
FN (False Negatives): Samples that are not credible but correctly predicted.
The framework of π-xNovo model
The π-xNovo model leverages the transformer architecture14 (Fig. 2a). The model begins by encoding the preprocessed spectrum, represented as \(\:{\{{s}_{j}=({m}_{j},{I}_{j}\left)\right\}}_{j=1}^{L}\), where \(\:{m}_{j}\) denotes the m/z and \(\:{I}_{j}\) represents the intensity of each peak. This process involves mapping and aggregating these values to construct the initial spectral representation \(\:S\). A specialized sine mass encoder, \(\:f\), projects each peak’s m/z into a \(\:d\)-dimensional feature vector. The mapping function \(\:{f}_{i}\) for the \(\:i\)-th dimension of the feature vector is defined as follows:
$$\:{f}_{i}=\left\{\begin{array}{c}\begin{array}{cc}\text{s}\text{i}\text{n}({m}_{j}/{\left(\frac{{r}_{min}}{2\pi\:}\right(\frac{{\lambda\:}_{max}}{{r}_{min}})}^{2i/{d}_{model}}))&\:if\:i\le\:d/2\end{array}\\\:\begin{array}{cc}\text{c}\text{o}\text{s}({m}_{j}/(\frac{{r}_{min}}{2\pi\:}{\left(\frac{{\lambda\:}_{max}}{{r}_{min}}\right)}^{2j/{d}_{model}}\left)\right)&\:if\:i>d/2\end{array}\end{array}\right.$$
where \(\:{\lambda\:}_{max}=\text{10,000}\) represents the maximal anticipated m/z value in the spectrum, and \(\:{r}_{min}=0.001\) signifies the finest achievable resolution within the spectrum. Additionally, a linear layer converts each peak’s intensity into a corresponding \(\:d\)-dimensional feature vector.
The MS/MS data enrichment process enhances the spectrum’s completeness by incorporating critical carboxyl-terminal and precursor information into the initial feature representation, \(\:S\). The feature vector for the carboxyl-terminal data is computed by adding a sine-encoded mass of 19.018 to a \(\:d\)-dimensional hidden vector, thereby improving the spectrum's biochemical fidelity. The precursor encoder constructs the feature by mapping and integrating the precursor mass and charge information. This is achieved by employing the same sine mass encoder to project the precursor mass into a \(\:d\)-dimensional feature vector and an embedding layer to map the charge information into a \(\:d\)-dimensional feature vector. The culmination of this enriched data integration is represented in the final spectral representation, \(\:\widehat{S}\), which is subsequently processed through a sophisticated nine-layer spectrum encoder (Fig. 2b).
The joint masking mechanism integrates masks into both the final spectral representation, \(\:\widehat{S}\), and the initial peptide representation, \(\:{D}^{0}\), thereby enhancing the model's feature extraction capabilities. During training, a multilayer perceptron processes \(\:\widehat{S}\) to generate two complementary sets of spectrum masks, \(\:{M}_{1}\) and \(\:{M}_{2}\). These masks are employed concurrently to optimize peptide prediction and improve the model's learning efficiency. The final optimization objective is defined by the following loss function:
$$\:loss=0.5\times\:(CE\left(D\left({A}_{\le\:k1},\:\widehat{S},\:{M}_{1}\right),\:{A}_{K}\right)+CE\left(D\left({A}_{\le\:k2},\:\widehat{S},\:{M}_{2}\right),\:{A}_{K}\right))$$
In this formulation, \(\:{A}_{\le\:k1}\) and \(\:{A}_{\le\:k2}\) represent the amino acid sequences predicted under the influence of the spectrum masks \(\:{M}_{1}\) and \(\:{M}_{2}\), respectively. The function \(\:CE\) denotes the cross-entropy loss, \(\:D\:\)represents the decoder translating spectral data into peptide sequences, and \(\:{A}_{K}\) corresponds to the true peptide label. This approach ensures that each component of the masking mechanism effectively contributes to minimizing the prediction error, thereby enhancing the model's predictive accuracy.
The amino acid encoder integrates both the content and positional data of amino acids into the initial peptide representation, \(\:{D}^{0}\), through an embedding layer. This layer projects the amino acids into a \(\:d\)-dimensional feature vector, capturing both the chemical properties and the sequence positioning of the amino acids. Concurrently, the sine encoder, also used for processing the spectrum, projects the positional information of the amino acids into a \(\:d\)-dimensional feature vector with predefined parameters \(\:{\lambda\:}_{max}=\text{10,000}\) and \(\:{r}_{min}=1\). The precursor is incorporated into \(\:{D}^{0}\) as a start marker, facilitating accurate peptide sequence modeling. To enhance feature extraction capabilities, a random mask mechanism is applied to \(\:{D}^{0}\)(Fig. 2c). This mechanism randomly masks segments of the input data, simulating various sequence scenarios and thereby improving the model’s generalization capabilities. During testing, the previously predicted amino acids are sequentially fed into the decoder to predict the subsequent amino acid. A greedy search algorithm selects the highest-scoring peptide sequence from all possible candidates, thereby optimizing prediction accuracy. Additionally, relative position encoding is implemented in both the spectrum encoder and the decoder to maintain a stable context for model training, mitigating overfitting and enhancing the model's consistency and learning efficiency36.
Interpretable matrix calculation
In the model's decoder, the encoder-decoder attention mechanism, implemented via a multi-head attention module, enables focused interaction between each amino acid and all positions within the spectrum representation \(\:\widehat{S}\). This dynamic focusing capability allows the decoder to selectively highlight peaks most relevant to the amino acid under consideration, leveraging comprehensive spectral information. The attention mechanism operates in a multi-head format, facilitating parallel processing across multiple representational subspaces. Each attention head applies distinct linear transformations to both spectrum and peptide representations, subsequently computing independent cross-correlation matrices between them. Specifically, the operation for the \(\:i\)-th head in layer \(\:l\) is defined as:
$$\:{head}_{i}^{l}\:score=Attention\:score({D}^{{\prime\:}l}{W}_{i}^{lQ},\widehat{S}{W}_{i}^{lK})$$
where \(\:{W}_{i}^{lQ}\) and \(\:{W}_{i}^{lK}\)denote the learnable weight matrices for the query, key, and value components, respectively. The intermediate representation \(\:{D}^{{\prime\:}l}\) is generated as follows:
$$\:{D}^{{\prime\:}l}=Attention({D}^{l-1}{W}^{Q},{D}^{l-1}{W}^{K},{D}^{l-1}{W}^{V}\:)$$
where \(\:{W}^{Q},{W}^{K}\) and \(\:{W}^{V}\) are the corresponding learnable matrices for the previous layer. The attention function and attention score function are formulated as:
$$\:Attention\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$
$$\:Attention\:score\left(Q,K\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)$$
where \(\:{d}_{k}\) represents the dimensionality of the key vectors set to 512. To assess the interpretability of amino acid predictions, the model aggregates an interpretable matrix from 72 cross-correlation matrices, using either a maximum value or average value method. This interpretable matrix, essential for the QC's evaluation of amino acid credibility, is mathematically represented as:
$$\:Interpretable\:Matrix=Max\left(Concat\right({head}_{i}^{l}\:score\left)\right)$$
or
$$\:Interpretable\:Matrix=Mean\left(Concat\left({head}_{i}^{l}\:score\right)\right).$$
Theoretical backbone ion set generation rules
π-xNovo-QC rigorously evaluates the plausibility of amino acid predictions by analyzing theoretical backbone ions corresponding to amino acids within a defined radius, 𝑅, around the 𝑁-th amino acid (ranging from 𝑁−𝑅 to 𝑁+𝑅). Detailed methodologies for these calculations are provided in Supplementary Fig. 22. The system adjusts the dependency radius 𝑅 in response to differential backbone ion loss observed under various experimental conditions, such as the notable loss of b1 and b2 ions in HCD spectra. This adjustment accounts for positional amino acids (PosAA). Ion loss considerations for both N-terminal and C-terminal segments of peptides are uniformly applied. Each fragmentation site is scrutinized for 30 potential types of backbone ions, incorporating two charge states (1 + and 2+), five neutral loss types (NH3, H2O, H2O-NH3, H2O-H2O), and three primary fragmentation types (a, b, y) pertinent to HCD spectra. For ETD spectra, additional fragmentation types (c, y, z + 1) are considered.
The π-xNovo-prob QC system based on amino acid probabilities
The π-xNovo-prob QC system leverages the average of all amino acid probabilities generated by the π-xNovo model to derive a confidence score for each peptide. By setting an appropriate threshold, this system effectively classifies predicted peptides based on their confidence scores, thereby enhancing the model’s discriminative capacity.
Training strategy
By default, the training configuration for the π-xNovo model includes a batch size of 128, an embedding dimension (d) of 512, a feed-forward network dimension of 1024, and 8 attention head, over 30 epochs. The learning rate is managed through a combined approach of linear warm-up followed by cosine annealing, defined as:
$$\:lr=bas{e}_{lr}\times\:\text{min}\left(\frac{i}{0.5N},\:0.5\times\:\text{cos}\left(\frac{\pi\:\times\:i}{\alpha\:\times\:N}\right)\right),\text{i}=\text{0,1},\dots\:,N$$
where \(\:N\) represents the total number of iterations, \(\:bas{e}_{lr}\)(the peak learning rate) is set at 0.0005, and \(\:\alpha\:\) is an empirically determined factor set at 1.1, ensuring the model retains its learning efficiency in the latter stages of training. Detailed specifications and parameter settings for the π-xNovo model across various experimental setups are provided in Supplementary Table 8. Each model’s training and testing procedures were conducted on a single Tesla V100 GPU equipped with 32GB of memory.