Transforming de novo peptide sequencing by explainable AI

doi:10.21203/rs.3.rs-4716013/v1

Download PDF

Article

Transforming de novo peptide sequencing by explainable AI

https://doi.org/10.21203/rs.3.rs-4716013/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

De novo peptide sequencing is crucial for identifying novel proteins, yet its broader application is constrained by the lack of a robust quality control system. In response, we developed a transformer-based model, π-xNovo, that accurately predicts peptides. By analyzing the model's attention matrix, we elucidated the contribution of spectral peaks to amino acid predictions, thus making de novo sequencing results explainable. Leveraging these insights, we designed a quality control system, π-xNovo-QC, which distinguishes peptide predictions with an accuracy exceeding 80% and a sensitivity above 90%. Applying this system to a large-scale deep human proteome dataset resulted in the identification of 1,931,761 additional peptides, marking a 137% increase over traditional database search results. These newly identified peptides with high confidence facilitated a 17.9% increase in protein identification, a 23.59% increase in the detection of single amino acid polymorphism events, and a 20.02% increase in exon-skipping splicing events. The deployment of this explainable AI system holds significant potential for expanding the application of de novo peptide sequencing, particularly in exploring the darker matter of the entire proteome universe.

Biological sciences/Computational biology and bioinformatics/Proteome informatics

Biological sciences/Computational biology and bioinformatics/Machine learning

Protein identification serves as an essential preliminary step in large-scale proteomic studies^1,2. Database search methods, which score and rank the peptide-spectrum matches (PSMs) among numerous theoretical and experimental spectra, have achieved remarkable precision in overall peptide identification^3,4 (Fig. 1a). However, these methods are inherently limited by their inability to identify peptides absent in protein databases, thereby constraining their utility in exploring proteomes with unknown sequences. In contrast, de novo peptide sequencing independently deduces

Figure 1: a, Schematic overview of peptide identification. Traditional database search methods use protein databases alongside target-decoy approaches for overall quality control of identified peptides. Conversely, de novo sequencing techniques lack a rational quality control (QC) system to assess peptide prediction reliability. b-c, Two potential QC systems for filtering plausible peptides from de novo sequencing predictions. Both systems assume that true peptides corresponding to experimental spectra are a minor fraction of the full search space and are generally plausible within the model’s predictions. In the first QC system (b) (e.g., Casanovo²⁰, PepNet²⁴), peptide predictions are distinguished by imposing stringent thresholds on the combined probabilities of amino acids. Nevertheless, the black-box nature of deep learning models hinders direct reliability evaluation, resulting in a lower proportion of true peptides within the plausible subset. The second QC system (c) enhances the plausibility of amino acid predictions by constructing a cross-correlation matrix between amino acids and spectra peaks. This method, directly shrinking the overall size of plausible peptides, significantly increases the proportion of correctly predicted peptides (higher sensitivity).

amino acid sequences from tandem mass spectrometry (MS/MS) data^5,6, making it a powerful tool for identifying novel proteins⁷ and elucidating post-translational modifications (PTMs)⁸. Conventional de novo sequencing models^{9,10,11,12,13} predominantly utilize combinatorial optimization and other sophisticated techniques to identify optimal candidate peptides. However, their efficiency is undermined by noise peaks, the lack of informative peaks, and their inability to meet the computational efficiency demands of high-throughput MS/MS data analyses.

In recent years, deep learning technologies, particularly transformer-based models¹⁴, have significantly advanced data analytics in the life sciences, leading to breakthroughs in structural prediction¹⁵ and functional annotation^16,17. These models, known for discerning long-range dependencies and complex patterns in large datasets, have also driven pivotal advancements in de novo peptide sequencing. Implementations like DeepNovo¹⁸, PointNovo¹⁹, Casanovo²⁰, GraphNovo²¹, and π-HelixNovo²² have focused on refining peptide prediction accuracy. Despite their efficiency, a major challenge remains: evaluating and explaining peptide predictions to provide transparent and interpretable results, limiting their broader adoption. Some studies use false discovery rate (FDR) control to estimate overall performance but not for individual peptide identification²³. Recent methodological advances have addressed this by setting thresholds based on amino acid combination probabilities^20,24 (Fig. 1b) or by incorporating models to estimate the Levenshtein distance among candidate peptides²⁵. However, these methods still face constraints in discriminative abilities, and the ‘black box’ nature of deep learning models complicates reliability assessments for de novo results. We demonstrate that understanding how spectral peaks influence amino acid predictions can foster the development of a rational quality control (QC) system, substantially enhancing the reliability of peptide identification (Fig. 1c).

To address prevailing challenges in peptide sequencing, we introduced the π-xNovo model, which leverages MS/MS data enrichment and a joint masking mechanism to achieve high-precision peptide predictions. π(Pi) stands for Proteomic Navigator, derived from the Proteomic Navigator of the Human Body (π-HuB) project, a global initiative to better understand the human proteome (https://www.pi-hub.ac/), and x represents explainable AI. Our detailed analysis of the model's attention matrix reveals a markable reliance on the backbone ions during amino acid predictions, aligning with the foundational principles of theoretical spectrum generation in database searches and providing a transparent rationale for distinguishing de novo results. Building on these insights, we developed a de novo peptide sequencing QC system named π-xNovo-QC. This system has uncovered novel proteins, single amino acid polymorphisms (SAPs), and alternative splicing events from previously uncharacterized spectra, significantly enriching our understanding of proteomic complexities. The overall de novo sequencing identification performance has been elevated, with an accuracy exceeding 80% and a sensitivity above 90%. This advancement enhances individual peptide identification and expands our ability to explore the dark matter of the entire proteomic universe, presenting substantial implications for biomedical research.

The interpretability of the π-xNovo model and its QC system

To elucidate the intricate relationship between spectra peaks and amino acid predictions, we developed the π-xNovo model (Fig. 2a). This model combines MS/MS data enrichment with a joint masking mechanism to refine peptide prediction accuracy. Specifically, the MS/MS data enrichment strategy integrates precursor and carboxyl-terminal information into the initial spectral representation, increasing amino acid prediction accuracy at both peptide termini. Concurrently, the joint masking mechanism applies masks to both the final spectral and initial peptide representations, improving the model's feature extraction capabilities. The decoder merges the final spectral representation with existing amino acid representations to facilitate precise predictions of

Figure 2: Model details and interpretability of $\:{\pi\:}$-xNovo. a, The architecture of the $\:{\pi\:}$-xNovo model. During training, MS/MS data enrichment and a joint masking mechanism enhance performance, where $\:S$, $\:\widehat{S}$, and $\:{M}_{i}$ denote the initial spectral representation, the final spectral representation, and the masks, respectively. During inference, the joint masking mechanism is abandoned, and a greedy search algorithm predicts amino acids sequentially from the carboxyl end (C-terminal) of peptides. b, Structural details of the spectral encoder. c, Structural details of the transformer decoder. d, Schematic of the interpretable matrix showing the relationship between the peptide and the experimental spectrum. The matrix is derived by aggregating cross-correlation matrices from each layer of the transformer decoder and using the theoretical spectrum of the peptide to categorize peak types within the experimental spectrum (see Methods section for details). The color bar represents the significance of the correlation between amino acids and spectra peaks.

subsequent amino acids. Each decoder layer employs a multi-head attention module that calculates cross-correlation matrices between known amino acids and all spectral peaks, enabling the synthesis of peptides closely resembling the experimental spectrum. By aggregating these matrices, an interpretable heatmap is generated (Fig. 2d), revealing that peaks critical to amino acid predictions often align with adjacent backbone ions. For instance, in predicting the 10th amino acid ‘N’ in the peptide ‘DVLDLANVLNSK,’ peaks of the ‘y₂’, ‘y₃’, and ‘y₄’ ions show the most significant dependency. This correlation validates the model’s efficiency and aligns with the foundational principles of theoretical spectrum generation in database searches, thereby bridging the gap between traditional database search and de novo computational approaches. Building on these insights, we Figure 3: Workflow overview of $\:\pi\:$-xNovo-QC. The first step determines the type of peptide-spectrum match (PSM). Variations in tandem mass spectrometry conditions result in diverse mass-to-charge ratios (m/z) distributions. Spectra are classified based on the abundance of low m/z peaks into positively skewed and normal distributions, with $\:{N}_{low\_mz}$, $\:{N}_{total\_mz}$, $\:{N}_{low\_aa}$, and $\:{N}_{total\_aa}$ representing the number of peaks with m/z less than 403 Da, total peaks, the four amino acids, and total amino acids in the peptide, respectively. The second step checks amino acid-level prediction reliability. The interpretable matrix reveals amino acid predictions rely on adjacent backbone ions. After deriving the theoretical backbone ion set (detailed in the Methods section), a cross-match with the predominant peak types in the interpretable matrix for each amino acid is performed. A match (illustrated as a red circle) confirms the reliability of the amino acid prediction. For example, for the 10th amino acid "N" in 'DVLDLANVLNSK,' peak types 'y₂,' 'y₃,' and 'y₄' show the highest relevance. The third step checks peptide-level prediction reliability. After verifying individual amino acid predictions, $\:\pi\:$-xNovo-QC evaluates the entire amino acid sequence’s reliability using PSM score control and precursor matching criteria. A peptide prediction is deemed reliable if the PSM score (as described in Methods) exceeds the threshold $\:\delta\:$ (0.5) and the discrepancy between the peptide mass and precursor mass is under the threshold $\:ϵ$ (0.5 Da). Only peptides meeting these criteria at both levels are classified as correct.

developed the π-xNovo-QC system to efficiently distinguish peptide predictions. This system quantifies the alignment between candidate peptides and corresponding spectra at both amino acid and peptide levels by utilizing the contribution mechanism of peaks to amino acid predictions and peptide properties (such as peptide mass), as shown in Fig. 3.

Model performance: the driving force of the QC system

The robustness of π-xNovo-QC critically depends on the performance of the π-xNovo model. We conducted extensive evaluations using the GraphNovo dataset and a commonly-used benchmark dataset (i.e., the nine-species dataset¹⁸). On the GraphNovo dataset, π-xNovo demonstrated peptide recall (amino acid precision) of 59.4% (79.8%), 62.3% (82.4%), and 74.3% (88.4%) for A. thaliana, C. elegans, and E. coli, respectively (Fig. 4a and Supplementary Fig. 1a, b). Compared to Casanovo, the second-ranked model, π-xNovo showed substantial improvements on the nine-species dataset, with peptide recall increasing by an absolute range of 7.00–24.47% (average 11.78%) and amino acid precision by an absolute range of 4.03–15.35% (average 9.93%) (Supplementary Fig. 1c, d). Ablation studies highlighted the significant impact of MS/MS data enrichment and the joint masking mechanism on prediction accuracy (Fig. 4b and Supplementary Fig. 2a-d). MS/MS data enrichment led to average peptide recall increases of 1.61% and 2.58% across datasets, while the joint masking mechanism improved performance by 2.44% and 2.27%. Optimal mask rates for peptide random mask (used for initial peptide representation) were determined to be 50% and 40% for the respective datasets, with a soft mask applied to the spectrum notably boosting accuracy (Supplementary Fig. 3e-h). However, introducing random masks for spectra showed a noticeable reduction in peptide recall, especially as the masking rate escalated (Supplementary Table 2).

The primary factors influencing model predictions were identified as the proportion of informative peaks, peptide length, and noise factor (the percentage of noise peaks). Notably, spectra often contain many noise peaks in addition to backbone ions (Supplementary Table 3). In Higher-energy Collisional Dissociation (HCD) spectra, a pronounced disparity is observed in the loss of b ions compared to y ions, particularly for b₁ and b₂ ions, as shown in Fig. 4c and Supplementary Fig. 5, highlighting the N-terminal amino acid coverage rate (N-TAAC). Comparative analysis of b₁ and b₂ proportions in spectra from the GraphNovo dataset (nine-species dataset) demonstrates that π-xNovo consistently improves accuracy by 26.64% and 12.5% (35.58% and 23.14%, respectively). Our findings also reveal a decline in peptide recall with increasing peptide length, while amino acid precision remains consistently high across varying peptide lengths (Fig. 4d and Supplementary Fig. 4). The noise factor’s distribution appears uniform across different peptide lengths, but the model's proficiency in interpreting spectra diminishes as the noise factor escalates (Fig. 4e and Supplementary Fig. 5). The integrity of backbone ions is crucial; thus, the peptide-spectrum match (PSM) score, reflecting the proportion of informative peaks, can be employed to evaluate the model's performance across various peptide lengths at the amino acid level. In cases of poor spectrum quality (low PSM score), the model accurately predicts some specific amino acids but struggles to reconstruct the complete sequence, likely due to a lack of informative content in the spectrum. Furthermore, using the explainable heatmap, we verified the different fragmentation patterns between HCD and Electron Transfer Dissociation (ETD). For amino acid prediction in HCD spectra, the peaks with the highest dependency in π-xNovo predictions, in descending order of significance, were + 1y, +1b, +1y-H₂O, +1y-NH₃, and + 1a (Fig. 4f). For ETD spectra, the order was + 1c, + 1z + 1, and + 1y (Supplementary Fig. 6).

Evaluation of π-xNovo-QC and its fuel: high-quality training datasets

The discriminative capabilities of π-xNovo-QC were rigorously evaluated using accuracy, sensitivity, specificity, and precision metrics. As the number of highly relevant peaks (Topk) in the interpretable matrix increased, the performance metrics of π-xNovo-QC gradually reach optimal (Fig. 5a and Supplementary Fig. 7). With Topk set at 10, the system achieved peptide identification accuracies of 87.6% for A. thaliana, 89.3% for C. elegans, and 88.0% for E. coli, alongside precisions of 84.2%, 86.8%, and 88.1%, respectively. Implementing a PSM score control threshold of 0.5 enabled π-xNovo-QC to sustain high performance across all metrics (Fig. 5b). Figure 5c highlights the pivotal roles of precursor matching and PSM score control in improving classification accuracy for incorrectly predicted peptides. Aggregation method of cross-correlation matrices using the average value yielded smoother relationships, while the maximum value method provided richer information (Supplementary Fig. 8), leading to a higher misclassification rate of negative samples

under precursor matching conditions. Nevertheless, the latter method generally surpassed the former in most cases. Comparative analyses revealed that π-xNovo-QC matched the peak accuracies of π-xNovo-prob, which is based on amino acid probabilities to distinguish peptide predictions, achieving an average sensitivity exceeding 97% (Supplementary Table 5). Notably, the absence of comprehensive explanations for predictions and the variability in optimal thresholds across different species (Supplementary Fig. 9) may limit the broader applicability of π-xNovo-prob in scenarios involving unknown sequences. Moreover, the classification performance of the feature-based QC (π-xNovo-QC and π-xNovo-prob), designed based on the model’s information, significantly outperformed that of the distance estimator²⁵ (Supplementary Fig. 10).

The performance of a de novo QC system is significantly influenced by the composition, scale, and quality of its training sets. Peptide prediction accuracy is closely related with the specificity of the enzyme cleavage sites in the training set. For instance, when trained exclusively with peptides ending in lysine (K), π-xNovo struggles to accurately predict peptides terminating with arginine (R) (Fig. 5d). Furthermore, incorporating spectra from multiple enzymes into the training datasets can result in the predominant enzyme’s cleavage sites influencing the distribution in spectra generated by other enzymes (Supplementary Fig. 11). The size of the training set is critical: models trained on smaller datasets with high-quality spectra (high PSM score) consistently outperform those trained with low-quality spectra (low PSM score) across various PSM scores (Fig. 5e, Supplementary Fig. 12b-c, and Supplementary Fig. 13d). However, this disparity diminishes as the volume of spectra increases, implying that models can adequately learn amino acid representations from lower-quality spectra if the dataset is sufficiently large. Peptide prediction accuracy correlates directly with PSM score—the higher the score, the more accurate the prediction (Fig. 4e and Supplementary Fig. 5). Models trained on datasets dominated by low-quality spectra exhibit significantly lower peptide recall (Supplementary Fig. 12d), supporting the hypothesis that a lack of informative peaks in the spectra compromises peptide prediction accuracy. Error samples from low-quality spectra reveal that approximately 60% of amino acids lacking informative peaks are inaccurately predicted (Supplementary Fig. 14), while for high-quality spectra, undetected amino acids constitute less than 40% on average (Supplementary Fig. 15).

When the size of training sets is small, evaluation metrics for systems trained on high-quality spectra exhibit improved stability (Fig. 5f, g and Supplementary Fig. 13b, c). In a comparative analysis, π-xNovo, trained on 1.6 million PSMs from the MassIVE-KB dataset²⁶, notable for its higher PSM score (Fig. 5h), achieved an average improvement of 6.73% in peptide recall over the test set, surpassing GraphNovo’s performance. However, as the training set scale and the number of training epochs increase, performance enhancements show diminishing returns, indicating a potential saturation point in the model’s ability to integrate effective information (Fig. 5i, j and Supplementary Fig. 18). Notably, π-xNovo-MassIVE-KB-30M-QC, derived from π-xNovo trained on 30 million PSMs from the MassIVE-KB dataset, demonstrated significant advancements in accuracy and precision compared to π-xNovo-Graph-QC, derived from π-xNovo trained on the GraphNovo dataset. This highlights the impact of larger, more refined datasets on system efficiency (Supplementary Table 7).

π-xNovo-QC helps to expand the human protein universe.

In our evaluation of the π-xNovo-QC within the deep human proteome (DHP) sequencing project²⁷, we observed a distinct departure in spectral distribution from those seen in the GraphNovo and MassIVE-KB dataset (Supplementary Fig. 19). Notably, 55.5% of HCD spectra and 99.8% of ETD spectra in the DHP dataset exhibited a positive skew distribution, likely reflecting

experimental variations. Models trained solely on the DHP dataset demonstrated superior performance in peptide recall for HCD spectra compared to those augmented with pre-trained models (Fig. 6a). This superior performance is partly due to a significant loss of informative peaks within spectra, indicated by low PSM scores (Fig. 6b), necessitating a reliance on the remaining backbone ions for accurate predictions (Supplementary Fig. 20). To further refine the efficiency of the π-xNovo-QC under these challenging conditions, we removed the PSM score control and adjusted the Topk parameter to 20, substantially improving the system’s accuracy and sensitivity. These modifications resulted in a peptide recall of 56% and an identification accuracy of 84.9% on HCD spectra, and impressively higher values of 69.6% and 91.5% on ETD spectra (Fig. 6c). Our analysis revealed that categorizing spectra based on the abundance of low mass-to-charge ratio (m/z) peaks improves system performance, underscoring the adaptability and robustness of our approach in varied experimental conditions.

π-xNovo-QC meticulously screened 18,982,605 PSMs from approximately 140.7 million previously unidentified spectra in the DHP dataset (Fig. 6d). Using de novo sequencing, the diversity of detected peptides increased by 137.12%, including peptides identical and functionally analogous to those cataloged in the protein database (Fig. 6e). Traditional database search methods had identified peptides linked to 19,295 proteins with a median sequence coverage of 74.8%. In contrast, de novo sequencing methods increased the total number of identifiable proteins to 22,746, a 17.9% increase (Fig. 6f), and raised the median sequence coverage to 76.6% (Fig. 6g). Moreover, the average number of peptides identified per protein rose from a median of 66 to 98 peptides, with the average increasing from 101 to 163. The proportion of proteins with sequence coverage between 80% and 100% grew by 26.11% (Supplementary Fig. 21).

In our comprehensive analysis, we leveraged de novo sequencing techniques to evaluate potential SAPs and exon-skipping splicing events. Our study identified a total of 10,574 SAPs, with de novo sequencing contributing 2,018, marking a 23.59% increase compared to traditional database search methods. The most significant increment observed within a single cell line was 611 (Fig. 6h). Additionally, when benchmarked against a database constructed from the hg38 reference human cDNA, our approach uncovered 4,903 exon-skipping splicing events, with de novo sequencing identifying 818 of these events, accounting for a 20.02% increase (Fig. 6i). Most detected events at the protein level were traced to the first reading frame of cDNA (Fig. 6j), underscoring the potential of de novo sequencing to reveal intricate genetic variations that traditional methods might overlook.

In this study, we introduced the π-xNovo model, leveraging MS/MS data enrichment and a joint masking mechanism to refine peptide prediction accuracy. This novel approach can accurately identify amino acids at both terminal ends of peptides, thus boosting the overall efficiency of de novo peptide sequencing. Further, its QC system, devised from an exhaustive analysis of the model’s attention matrix, robustly distinguishes peptide predictions in an interpretable manner. This explainable AI approach deepens our understanding of the intricate relations between spectral peaks and amino acids during peptide predictions and integrates a sophisticated QC layer into the de novo sequencing process. These pivotal discoveries improve the reliability and versatility of de novo sequencing substantially, facilitating its application across a broad spectrum of biomedical research.

In de novo peptide sequencing, the focus is increasingly on refining peptide prediction accuracy. Peptide identification, assessing the reliability of prediction results, has long been overlooked and stills lack effective solutions. Our proposed QC system effectively addresses individual peptide identification. Our research identifies key factors that influence the efficiency of π-xNovo-QC, particularly model architectures and training strategies. Strategic selection of architecture and training strategies, according to scaling laws for large models²⁸, can harness larger datasets and computational resources to improve performance. This highlights the need for continuous refinement of model architectures and methodologies to enhance the accuracy and efficiency of de novo peptide sequencing. Additionally, the configuration of this QC system—including the selection of the most relevant peaks (Topk) and the method for aggregating the cross-correlation matrices—crucially impacts the reliability of peptide screening. Our interpretable matrix often shows that highly relevant peaks, essential for amino acid predictions, include backbone ions and so-called ‘dark matter’ elements, such as satellite ions or isotope peaks, which contain vital information²⁹. This suggests that a refined interpretable matrix could lead to more advanced QC mechanisms. Moreover, the quality and diversity of datasets play a pivotal role in advancing both the model's and the system's performance. Our findings demonstrate that high-quality training data markedly improve the model’s generalization ability, thereby boosting its precision in predicting unidentified peptides. This underscores the necessity of establishing and maintaining extensive, diverse proteomic spectral databases, which are essential for training and refining deep learning models in this field.

The efficiency of the π-xNovo-QC model was rigorously validated within the deep human proteome sequencing project. Leveraging the high-precision screening capabilities of π-xNovo-QC, we successfully identified a substantial number of novel peptides from previously uncharacterized spectra. This expansion of known peptide sequences enriches our understanding of protein diversity and provides unprecedented insights into protein functions and mutation processes. Notably, the model’s proficiency in detecting SAPs and exon-skipping splicing events highlights the unique advantages of π-xNovo-QC. These findings emphasize the critical role of de novo peptide sequencing technologies in proteomics, driving a deeper understanding of complex biological mechanisms and potentially revolutionizing protein research.

Despite the significant advancements achieved with π-xNovo and π-xNovo-QC, their implementations still face notable challenges and limitations. While we have improved model transparency through detailed interpretability analysis, interpreting ‘dark matter’ in spectra remains a substantial obstacle, which still denotes unexplained variances and challenges our QC’s identification accuracy. Moreover, existing de novo peptide sequencing models cannot identify unknown amino acid letters absent in the training set, such as those involved in PTMs, highlighting a gap in detecting novel biological findings. Future developments may enable the inference of such elusive information through more sophisticated analyses of the interpretable matrix. This evolution is expected to significantly enhance the utility and applicability of de novo peptide sequencing technologies, broadening their impact across various scientific disciplines.

We posit that by harnessing the analysis of correlations between spectral peaks and amino acids, de novo peptide sequencing technologies can be effectively applied in fields characterized by unknown sequences. This innovation holds substantial promise for advancing proteomic applications in diverse areas, such as antibody sequencing³⁰, microbiome research³¹, and paleobiology³². Continued refinement and innovation in these technologies will propel the field of proteomics forward while unlocking profound insights into the unknowns within the life sciences. Such developments could revolutionize our understanding of biological complexity and disease mechanisms, ultimately contributing to major advancements in diagnostics and therapeutic strategies.

Datasets

Nine-species dataset: We acquired the nine-species dataset from Tran et al.¹⁸, comprising approximately 1.52 million PSMs generated by HCD following trypsin digestion. Identified through a rigorous database search, this dataset underwent stringent filtering to maintain a 1% PSM-level FDR, ensuring high data integrity. Modification settings included carbamidomethylation of cysteine (C) residues as a fixed modification, alongside oxidation of methionine (M), and deamidation of asparagine (N) and glutamine (Q) as variable modifications. Isoleucine (I) and leucine (L) residues remained indistinguishable in this analysis. To evaluate the model’s performance, we implemented a leave-one-out cross-validation framework, designating one species as the test set and the remainder as the training set. Additional details about the dataset are documented in Supplementary Table 3.

GraphNovo dataset: We acquired the GraphNovo dataset from Mao et al.²¹, meticulously structured into two distinct parts: the training set includes samples from HeLa and Cerebellum, totaling 1,659,763 PSMs, while the test set encompasses samples from A. thaliana, C. elegans, and E. coli, each contributing 12,500 PSMs. All spectra were generated using HCD after trypsin digestion. The PSMs were rigorously identified through database searches and validated to a stringent 1% PSM-level FDR. The protocol set carbamidomethylation of cysteine (C) residues as a fixed modification and oxidation of methionine (M) as a variable modification. Notably, isoleucine (I) and leucine (L) residues were not differentiated in this analysis. Extensive details about the dataset are elaborated in Supplementary Table 3.

MassIVE-KB dataset: We acquired the MassIVE-KB dataset²⁶, which encompasses 30 million high-quality PSMs filtered to maintain an almost perfect (~ 0%) PSM-level FDR, limiting each charge and peptide combination to no more than 100 PSMs. Generated predominantly through trypsin digestion, the dataset incorporates carbamidomethylation of cysteine (C) residues as a fixed modification and includes seven variable modifications: oxidation of methionine (M), deamidation of asparagine (N) and glutamine (Q), N-terminal acetylation, N-terminal carbamylation, N-terminal NH₃ loss, and combinations of N-terminal carbamylation and NH₃ loss. To assess the π-xNovo model's performance, we utilized the A. thaliana, C. elegans, and E. coli data from the GraphNovo dataset. The overlap in PSMs for these species was 31, 130, and 6 respectively, with an average error margin of no more than 0.42%. This minimal discrepancy, well within the bounds of experimental error, reinforces the robustness of our findings and underscores the validity of our conclusions.

DHP dataset: We downloaded 2,460 ‘.mzML’ files from the deep human proteome (DHP) sequencing project (ProteomeXchange ID: PXD024364) from Sinitcyn et al.²⁷, encompassing approximately 161 million spectra. These spectra were generated from six human cell lines: hES1, HeLa S3, HepG2, GM12878, K562, and HUVEC. Samples underwent enzymatic digestion with six parallel enzymes—LysC, LysN, AspN, chymotrypsin, GluC, and trypsin—and were analyzed using three fragmentation methods: HCD, collision-induced dissociation (CID), and ETD. Spectra from electron-transfer/higher-energy collisional dissociation (ETHCD) were classified as ETD. From the ‘msms.txt’ files for each cell line, we extracted 16,746,362 PSMs for HCD, 2,389,865 for ETD, and 1,482,573 for ETHCD. To evaluate the π-xNovo model, we utilized test sets comprising 10000 randomly selected spectra from each cell line and enzyme combination, resulting in 360,000 PSMs for HCD and 300,000 PSMs for ETD. Additionally, we retrieved 174,945 PSMs from ‘msms.txt’ contained within ‘txt.zip’ that are not associated with downloadable ‘.mzML’ files, bringing the total PSMs identified in the project to 20,793,745. Fixed modification included carbamidomethylation of cysteine (C) residues, with variable modifications such as oxidation of methionine (M) and N-terminal acetylation. We also acquired ‘proteins.fa’ for these cell lines from ‘variationExtraction.zip.’ By integrating this data with UniProt canonical (version 2017_02; UP000005640_9606), UniProt isoform (UP000005640_9606_additional), Ensembl canonical (version 86; GRCh38.pep.all), and Ensembl isoform (GRCh38.pep.abinitio), we established a comprehensive protein database comprising 143,897 protein sequences.

Protein identification

The π-xNovo models, specifically trained on HCD and ETD spectra, were applied to a dataset comprising approximately 140 million previously unidentified spectra. We retained PSMs ranging from 5 to 40 amino acids in length. Using π-xNovo-QC, we identified a total of 18,982,605 high-confidence PSMs. Subsequent sequencing alignment using BLAST³³ (version 2.5.0) compared peptides derived from both traditional database search methods and de novo peptide sequencing. This analysis yielded 10,352,965 matches for peptides identified through database searches and 16,122,348 for those obtained via de novo sequencing. Among the peptides identified through database searches, 1,408,794 matched known protein sequences, with 1,353,978 achieving a perfect identity (100% matched). Conversely, de novo sequencing identified 2,259,435 peptides matching known protein sequences, with 735,719 achieving a perfect identity, highlighting the enhanced capability of de novo methods in expanding the proteomic landscape.

Protein coverage calculation

Utilizing the designated protein ID, we retrieved all corresponding peptides with 100% matching percentage. Protein sequence coverage was calculated as the proportion of unique amino acids within the retrieved peptides relative to the total amino acid count of the full protein sequence. Each amino acid was counted only once towards coverage, regardless of its recurrence across different peptides.

Single amino acid polymorphism (SAP) analysis

We retrieved data on potential positions for nonsynonymous mutations and their corresponding amino acid substitutions from the ‘proteins.fa’ file for each cell line. We then cross-referenced peptides from the alignment results that exhibited single amino acid variations with the nonsynonymous mutation data. This rigorous approach identified a distinct subset of peptides with mutations, thereby providing valuable insights into the mutation landscape of the proteins analyzed.

Exon-skipping splicing event analysis

We devised a methodology to identify and analyze potential exons from the transcript's cDNA (GRCh38.cdna.all.fa), utilizing annotations from the hg38 reference human genome (GRCh38111.gff3). Our objective was to construct a comprehensive database of all splicing sequences, focusing on splicing between two exons. The methodology comprised the following steps:

Identification of Exons: Using the transcript ID, we cataloged all exons, detailing their start and end positions, and extracted each exon sequence.

Exon Splicing: We concatenated non-contiguous exons, limiting our analysis to connections between two exons, and performed a three-frame translation to identify all possible open reading frames (ORFs).

Translation to Peptide Sequences: Utilizing the Biopython library³⁴, we translated all identified ORF nucleotide sequences into peptide sequences, preserving information about exon junctions (positioned precisely between two amino acids or at a specific amino acid), thereby establishing a database for splicing event peptides.

Sequence Alignment: We employed BLAST³³ (version 2.5.0) to align peptide sequences derived from both database searches and de novo peptide sequencing.

Detection of Splicing Events: We scrutinized matches exhibiting a perfect identity percentage (100%) to verify the occurrence of peptides at exon junctions. Confirmation of peptides at these junctions signified successful detection of cross-exon splicing events.

Data preprocessing

All spectra underwent stringent preprocessing to ensure data quality and computational efficiency:

Removal of Outliers: Peaks outside the m/z range of 50.0 Da to 2500.0 Da were excluded to eliminate anomalies and enhance analysis reliability.

Removal of Low-Intensity Peaks: Peaks with intensities less than 10% of the highest peak intensity within the spectrum were excluded to focus on the most significant signals.

Retention of Top 150 Peaks: Only the top 150 peaks by intensity were retained for each spectrum, optimizing data management and computational efficiency.

Normalization of Intensity: The intensities of the retained peaks were normalized to a uniform scale to facilitate consistent comparative analysis across all samples.

To rigorously assess noise presence and backbone ion absence in spectral data, we used the Pyteomics framework³⁵ to categorize spectral peaks systematically. For each fragmentation site, we considered a comprehensive set of 30 potential backbone ion types. Specifically, for HCD spectra, this categorization includes two charge states (1 + and 2+), five types of neutral losses (NH₃, H₂O, NH₃-H₂O, H₂O-H₂O), and three primary fragmentation types (a, b, y). For ETD spectra, additional fragmentation types (c, y, z + 1) were included. Peaks matching a specific backbone ion type were identified if the m/z discrepancy was less than 0.1 Da. Peaks that did not correspond to any recognized backbone ion type were classified as noise. The absence of a corresponding backbone ion at a fragmentation site was noted as a missing site. The noise factor ($\:{n}_{f}$) was quantitatively defined as the ratio of noise peaks to the total number of peaks within a spectrum. An amino acid was deemed to lack sufficient fragmentation data if the fragmentation sites preceding and following it were absent. Peptide sequence coverage (PSC) was defined as the ratio of amino acids directly inferred from the spectral data to the total amino acids within the peptide. The PSM score, which evaluated the match quality between observed spectra and theoretical peptides, was formulated as follows:

$$\:PSM\:score=\frac{PSC}{1+\left(k*\left(1-PSC\right)+c\right)*{n}_{f}}$$

where $\:k$ and $\:c$ are hyperparameters adjusted to modulate the impact of the noise factor, with default values of 1.0 and 0, respectively.

Evaluation metrics

Peptide recall, amino acid precision, and amino acid recall are employed to evaluate the performance of the π-xNovo model. Accuracy, sensitivity, specificity, and precision are utilized to assess the discriminative capability of π-xNovo-QC in relation to peptide predictions. The definitions and formulas for these metrics are as follows:

$$\:Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

$$\:Sensitivity=\frac{TP}{TP+FN}$$

$$\:Specificity=\frac{TN}{TN+FP}$$

$$\:Precision=\frac{TP}{TP+FP}$$

Where:

TP (True Positives): Samples that are credible and correctly predicted.

TN (True Negatives): Samples that are credible but incorrectly predicted.

FP (False Positives): Samples that are not credible and incorrectly predicted.

FN (False Negatives): Samples that are not credible but correctly predicted.

The framework of π-xNovo model

The π-xNovo model leverages the transformer architecture¹⁴ (Fig. 2a). The model begins by encoding the preprocessed spectrum, represented as $\:{\{{s}_{j}=({m}_{j},{I}_{j}\left)\right\}}_{j=1}^{L}$, where $\:{m}_{j}$ denotes the m/z and $\:{I}_{j}$ represents the intensity of each peak. This process involves mapping and aggregating these values to construct the initial spectral representation $\:S$. A specialized sine mass encoder, $\:f$, projects each peak’s m/z into a $\:d$-dimensional feature vector. The mapping function $\:{f}_{i}$ for the $\:i$-th dimension of the feature vector is defined as follows:

$$\:{f}_{i}=\left\{\begin{array}{c}\begin{array}{cc}\text{s}\text{i}\text{n}({m}_{j}/{\left(\frac{{r}_{min}}{2\pi\:}\right(\frac{{\lambda\:}_{max}}{{r}_{min}})}^{2i/{d}_{model}}))&\:if\:i\le\:d/2\end{array}\\\:\begin{array}{cc}\text{c}\text{o}\text{s}({m}_{j}/(\frac{{r}_{min}}{2\pi\:}{\left(\frac{{\lambda\:}_{max}}{{r}_{min}}\right)}^{2j/{d}_{model}}\left)\right)&\:if\:i>d/2\end{array}\end{array}\right.$$

where $\:{\lambda\:}_{max}=\text{10,000}$ represents the maximal anticipated m/z value in the spectrum, and $\:{r}_{min}=0.001$ signifies the finest achievable resolution within the spectrum. Additionally, a linear layer converts each peak’s intensity into a corresponding $\:d$-dimensional feature vector.

The MS/MS data enrichment process enhances the spectrum’s completeness by incorporating critical carboxyl-terminal and precursor information into the initial feature representation, $\:S$. The feature vector for the carboxyl-terminal data is computed by adding a sine-encoded mass of 19.018 to a $\:d$-dimensional hidden vector, thereby improving the spectrum's biochemical fidelity. The precursor encoder constructs the feature by mapping and integrating the precursor mass and charge information. This is achieved by employing the same sine mass encoder to project the precursor mass into a $\:d$-dimensional feature vector and an embedding layer to map the charge information into a $\:d$-dimensional feature vector. The culmination of this enriched data integration is represented in the final spectral representation, $\:\widehat{S}$, which is subsequently processed through a sophisticated nine-layer spectrum encoder (Fig. 2b).

The joint masking mechanism integrates masks into both the final spectral representation, $\:\widehat{S}$, and the initial peptide representation, $\:{D}^{0}$, thereby enhancing the model's feature extraction capabilities. During training, a multilayer perceptron processes $\:\widehat{S}$ to generate two complementary sets of spectrum masks, $\:{M}_{1}$ and $\:{M}_{2}$. These masks are employed concurrently to optimize peptide prediction and improve the model's learning efficiency. The final optimization objective is defined by the following loss function:

$$\:loss=0.5\times\:(CE\left(D\left({A}_{\le\:k1},\:\widehat{S},\:{M}_{1}\right),\:{A}_{K}\right)+CE\left(D\left({A}_{\le\:k2},\:\widehat{S},\:{M}_{2}\right),\:{A}_{K}\right))$$

In this formulation, $\:{A}_{\le\:k1}$ and $\:{A}_{\le\:k2}$ represent the amino acid sequences predicted under the influence of the spectrum masks $\:{M}_{1}$ and $\:{M}_{2}$, respectively. The function $\:CE$ denotes the cross-entropy loss, $\:D\:$represents the decoder translating spectral data into peptide sequences, and $\:{A}_{K}$ corresponds to the true peptide label. This approach ensures that each component of the masking mechanism effectively contributes to minimizing the prediction error, thereby enhancing the model's predictive accuracy.

The amino acid encoder integrates both the content and positional data of amino acids into the initial peptide representation, $\:{D}^{0}$, through an embedding layer. This layer projects the amino acids into a $\:d$-dimensional feature vector, capturing both the chemical properties and the sequence positioning of the amino acids. Concurrently, the sine encoder, also used for processing the spectrum, projects the positional information of the amino acids into a $\:d$-dimensional feature vector with predefined parameters $\:{\lambda\:}_{max}=\text{10,000}$ and $\:{r}_{min}=1$. The precursor is incorporated into $\:{D}^{0}$ as a start marker, facilitating accurate peptide sequence modeling. To enhance feature extraction capabilities, a random mask mechanism is applied to $\:{D}^{0}$(Fig. 2c). This mechanism randomly masks segments of the input data, simulating various sequence scenarios and thereby improving the model’s generalization capabilities. During testing, the previously predicted amino acids are sequentially fed into the decoder to predict the subsequent amino acid. A greedy search algorithm selects the highest-scoring peptide sequence from all possible candidates, thereby optimizing prediction accuracy. Additionally, relative position encoding is implemented in both the spectrum encoder and the decoder to maintain a stable context for model training, mitigating overfitting and enhancing the model's consistency and learning efficiency³⁶.

Interpretable matrix calculation

In the model's decoder, the encoder-decoder attention mechanism, implemented via a multi-head attention module, enables focused interaction between each amino acid and all positions within the spectrum representation $\:\widehat{S}$. This dynamic focusing capability allows the decoder to selectively highlight peaks most relevant to the amino acid under consideration, leveraging comprehensive spectral information. The attention mechanism operates in a multi-head format, facilitating parallel processing across multiple representational subspaces. Each attention head applies distinct linear transformations to both spectrum and peptide representations, subsequently computing independent cross-correlation matrices between them. Specifically, the operation for the $\:i$-th head in layer $\:l$ is defined as:

$$\:{head}_{i}^{l}\:score=Attention\:score({D}^{{\prime\:}l}{W}_{i}^{lQ},\widehat{S}{W}_{i}^{lK})$$

where $\:{W}_{i}^{lQ}$ and $\:{W}_{i}^{lK}$denote the learnable weight matrices for the query, key, and value components, respectively. The intermediate representation $\:{D}^{{\prime\:}l}$ is generated as follows:

$$\:{D}^{{\prime\:}l}=Attention({D}^{l-1}{W}^{Q},{D}^{l-1}{W}^{K},{D}^{l-1}{W}^{V}\:)$$

where $\:{W}^{Q},{W}^{K}$ and $\:{W}^{V}$ are the corresponding learnable matrices for the previous layer. The attention function and attention score function are formulated as:

$$\:Attention\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$

$$\:Attention\:score\left(Q,K\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)$$

where $\:{d}_{k}$ represents the dimensionality of the key vectors set to 512. To assess the interpretability of amino acid predictions, the model aggregates an interpretable matrix from 72 cross-correlation matrices, using either a maximum value or average value method. This interpretable matrix, essential for the QC's evaluation of amino acid credibility, is mathematically represented as:

$$\:Interpretable\:Matrix=Max\left(Concat\right({head}_{i}^{l}\:score\left)\right)$$

$$\:Interpretable\:Matrix=Mean\left(Concat\left({head}_{i}^{l}\:score\right)\right).$$

Theoretical backbone ion set generation rules

π-xNovo-QC rigorously evaluates the plausibility of amino acid predictions by analyzing theoretical backbone ions corresponding to amino acids within a defined radius, 𝑅, around the 𝑁-th amino acid (ranging from 𝑁−𝑅 to 𝑁+𝑅). Detailed methodologies for these calculations are provided in Supplementary Fig. 22. The system adjusts the dependency radius 𝑅 in response to differential backbone ion loss observed under various experimental conditions, such as the notable loss of b₁ and b₂ ions in HCD spectra. This adjustment accounts for positional amino acids (Pos_AA). Ion loss considerations for both N-terminal and C-terminal segments of peptides are uniformly applied. Each fragmentation site is scrutinized for 30 potential types of backbone ions, incorporating two charge states (1 + and 2+), five neutral loss types (NH₃, H₂O, H₂O-NH₃, H₂O-H₂O), and three primary fragmentation types (a, b, y) pertinent to HCD spectra. For ETD spectra, additional fragmentation types (c, y, z + 1) are considered.

The π-xNovo-prob QC system based on amino acid probabilities

The π-xNovo-prob QC system leverages the average of all amino acid probabilities generated by the π-xNovo model to derive a confidence score for each peptide. By setting an appropriate threshold, this system effectively classifies predicted peptides based on their confidence scores, thereby enhancing the model’s discriminative capacity.

Training strategy

By default, the training configuration for the π-xNovo model includes a batch size of 128, an embedding dimension (d) of 512, a feed-forward network dimension of 1024, and 8 attention head, over 30 epochs. The learning rate is managed through a combined approach of linear warm-up followed by cosine annealing, defined as:

$$\:lr=bas{e}_{lr}\times\:\text{min}\left(\frac{i}{0.5N},\:0.5\times\:\text{cos}\left(\frac{\pi\:\times\:i}{\alpha\:\times\:N}\right)\right),\text{i}=\text{0,1},\dots\:,N$$

where $\:N$ represents the total number of iterations, $\:bas{e}_{lr}$(the peak learning rate) is set at 0.0005, and $\:\alpha\:$ is an empirically determined factor set at 1.1, ensuring the model retains its learning efficiency in the latter stages of training. Detailed specifications and parameter settings for the π-xNovo model across various experimental setups are provided in Supplementary Table 8. Each model’s training and testing procedures were conducted on a single Tesla V100 GPU equipped with 32GB of memory.

Data availability

The nine-species dataset is available via the PRIDE database with identifiers: PXD005025, PXD004948, PXD004325, PXD004565, PXD004536, PXD004947, PXD003868, PXD004467, and PXD004424. The GraphNovo dataset is accessible at https://doi.org/10.5281/zenodo.8000316. The MassIVE-KB dataset was obtained by downloading the Raw files and the filtered identification results from the “All Candidate library spectra” section of the MassIVE Knowledge Base spectral library v1(https://massive.ucsd.edu/ProteoSAFe/static/massive-kb-libraries.jsp). The DHP dataset is available via the ProteomeXchange datasets with the dataset identifier PXD024364.

Code availability

Source codes are available on GitHub at https://github.com/PHOENIXcenter/pi-xNovo. (They will be public once the paper is accepted).

Competing Interests

The authors have declared they have no conflict of interest.

Author Contributions

Z.L. designed the architecture of π-xNovo, performed the experiments, and wrote the initial manuscript. T.L. and T.Y. collected the data and performed the data analysis. L.X. helped to improve the model. Y.H. coordinated the study. Y.W. and C.C. designed the study, analyzed the results, and co-supervised the project. All authors helped to revise the manuscript and approved the final manuscript.

Acknowledgements

We would like to acknowledge Beinn Purvis (Kunming Institute of Botany, Chinese Academy of Sciences) for the helpful comments and suggestions. This work has been supported by a Direct national funding from the National Key Research and Development Program of China (2021YFA1301603), the Chinese Ministry of Technology to Peng Cheng Laboratory, Research and Development Program of Guangzhou Laboratory (SRPG22-001), and the National Natural Science Foundation of China (32088101).

Kenyon, G. L. et al. Defining the mandate of proteomics in the post-genomics era: workshop report. Molecular & Cellular Proteomics 1, 763–780 (2002).
Wilm, M. Quantitative proteomics in biological research. Proteomics 9, 4590–4605 (2009).
Keller, A., Nesvizhskii, A. I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry 74, 5383–5392 (2002).
Cox, J. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of proteome research 10, 1794–1805 (2011).
Fernandez-de‐Cossio, J. et al. Automated interpretation of high‐energy collision‐induced dissociation spectra of singly protonated peptides by ‘seqms', a software aid for de novo sequencing by tandem mass spectrometry. Rapid communications in mass spectrometry 12, 1867–1878 (1998).
Lu, B. & Chen, T. Algorithms for de novo peptide sequencing using tandem mass spectrometry. Drug Discovery Today: BioSilico 2, 85–90 (2004).
Frank, A. M., Savitski, M. M., Nielsen, M. L., Zubarev, R. A. & Pevzner, P. A. De novo peptide sequencing and identification with precision mass spectrometry. Journal of proteome research 6, 114–123 (2007).
Searle, B. C. et al. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Analytical chemistry 76, 2220–2230 (2004).
Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E. & Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. Journal of computational biology 6, 327–342 (1999).
Tabb, D. L., Ma, Z.-Q., Martin, D. B., Ham, A.-J. L. & Chambers, M. C. DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. Journal of proteome research 7, 3838–3846 (2008).
Ma, B. et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid communications in mass spectrometry 17, 2337–2342 (2003).
Ma, B. Novor: real-time peptide de novo sequencing software. Journal of the American Society for Mass Spectrometry 26, 1885–1894 (2015).
Chi, H. et al. pNovo: de novo peptide sequencing and identification using HCD spectra. Journal of proteome research 9, 2713–2724 (2010).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Duong, D. et al. Annotating gene ontology terms for protein sequences with the transformer model. bioRxiv, 2020.2001. 2031.929604 (2020).
Zhang, S. et al. Applications of transformer-based language models in bioinformatics: a survey. Bioinformatics Advances 3, vbad001 (2023).
Tran, N. H., Xu, J. & Li, M. A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction. Briefings in Bioinformatics 23, bbab493 (2022).
Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nature Machine Intelligence 3, 420–425 (2021).
Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. in International Conference on Machine Learning. 25514–25522 (PMLR).
Mao, Z., Zhang, R., Xin, L. & Li, M. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nature Machine Intelligence 5, 1250–1260 (2023).
Yang, T. et al. Introducing π-HelixNovo for practical large-scale de novo peptide sequencing. Briefings in Bioinformatics 25, bbae021 (2024).
Tran, N. H. et al. NovoBoard: a comprehensive framework for evaluating the false discovery rate and accuracy of de novo peptide sequencing. bioRxiv, 2024.2004. 2016.589668 (2024).
Liu, K., Ye, Y., Li, S. & Tang, H. Accurate de novo peptide sequencing using fully convolutional neural networks. Nature Communications 14, 7974 (2023).
Klaproth-Andrade, D. et al. Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing. Nature Communications 15, 151 (2024).
Wang, M. et al. Assembling the community-scale discoverable human proteome. Cell systems 7, 412–421. e415 (2018).
Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nature biotechnology 41, 1776–1786 (2023).
Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
Michalski, A., Neuhauser, N., Cox, J. & Mann, M. A systematic investigation into the nature of tryptic HCD spectra. Journal of proteome research 11, 5479–5491 (2012).
Peng, W., Pronker, M. F. & Snijder, J. Mass spectrometry-based de novo sequencing of monoclonal antibodies using multiple proteases and a dual fragmentation scheme. Journal of Proteome Research 20, 3559–3566 (2021).
Kleikamp, H. B. et al. Database-independent de novo metaproteomics of complex microbial communities. Cell Systems 12, 375–383. e375 (2021).
Cappellini, E. et al. Ancient biomolecules and evolutionary inference. Annual review of biochemistry 87, 1029–1060 (2018).
Camacho, C., Boratyn, G. M., Joukov, V., Vera Alvarez, R. & Madden, T. L. ElasticBLAST: accelerating sequence search via cloud computing. BMC bioinformatics 24, 117 (2023).
Cock, P. J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422 (2009).
Goloborodko, A. A., Levitsky, L. I., Ivanov, M. V. & Gorshkov, M. V. Pyteomics—a Python framework for exploratory data analysis and rapid software prototyping in proteomics. Journal of The American Society for Mass Spectrometry 24, 301–304 (2013).
Csordás, R., Irie, K. & Schmidhuber, J. The devil is in the detail: Simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284 (2021).

There is NO Competing Interest.

Supplementaryv1.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Transforming de novo peptide sequencing by explainable AI

Status:

Version 1

Abstract

Figures

Main

Results

The interpretability of the π-xNovo model and its QC system

Model performance: the driving force of the QC system

Evaluation of π-xNovo-QC and its fuel: high-quality training datasets

Discussion

Methods

Datasets

Protein identification

Protein coverage calculation

Single amino acid polymorphism (SAP) analysis

Exon-skipping splicing event analysis

Data preprocessing

Evaluation metrics

The framework of π-xNovo model

Interpretable matrix calculation

Theoretical backbone ion set generation rules

The π-xNovo-prob QC system based on amino acid probabilities

Training strategy

Declarations

Data availability

Code availability

Competing Interests

Author Contributions

Acknowledgements

References

Additional Declarations

Supplementary Files

Status:

Version 1