Sequence tag prediction models learn from MS2 peak intensities and show high accuracy
In this section the prefix and suffix tag prediction models (Methods) implemented in ionbot are evaluated on the testing set by the Area under the ROC Curve (AUC) and the Average Precision (AP) computed from the Precision-Recall (PR) curve. All models show very high predictive accuracy with suffix tag models (AUC=99.9/AP=98.2 for HCD and AUC=99.9/AP=99.4 for HCDTMT) performing better than prefix tag models (AUC=99.8/AP=93.1 for HCD and ACU=99.9/AP=99.2 for HCDTMT) (Supplementary Fig. 1). It is worth noting that TMT trained models show highest predictive accuracy, especially for the prefix tags.
Furthermore, scoring a HCD testing set with a HCDTMT model and vice versa substantially decreases predictive accuracy. For the models trained on HCD and evaluated on HCDTMT, the prefix model reduced to AUC=98.9 and AP=58.7, while the suffix model shows a slight decrease to AUC=99.8 and AP=95.6. Notably, for models trained on HCDTMT and evaluated on HCD, the prefix models prediction performance decreased much further to AUC=87.7 and AP=7, with a smaller decrease for the suffix model (AUC=99.5/AP=92.2).
To further evaluate the models, the true PSMs identified by an open search were analyzed. For each true PSM a prefix and suffix tag ranking was computed by scoring all tags with the corresponding predictive models. The highest rank between the true prefix and suffix tag (determined by the matched peptide) is then recorded as a metric for how well the predicted sequence tags can reduce the search space (Methods). The vast majority of these ranks were within the top-10, with many ranked first for one of the two models (Supplementary Fig. 2).
Expanding the search space is crucial; we recommend to no longer use closed searches
Open searches can match peptidoforms never considered in closed searches. However, at the same time, a larger search space leads to higher scoring decoy matches, thereby potentially increasing the PSM score threshold required to maintain a 1% FDR 16. In this section we investigate the difference between a closed and open ionbot search for the five evaluation datasets (Methods).
Our findings confirm previous research showing that open searches considerably increase proteome coverage. The increase in PSM identifications is up to 56% for HEK239 (Fig. 1a), and even 74% for TMTCPTAC (Supplementary Fig. 3). At the unique peptide level, identification gains go up to 20% (HEK239). It is worth noting that this overall increase in the number of identifications will play an important role in the accurate downstream protein inference and quantification, as well as in increasing the power of the succeeding differential analyses.
Counting PSM and peptide identifications does not reveal all the differences between a closed and open search. We found that a substantial number of closed search identifications are no longer called in the corresponding open search. This is 11% of the closed search identifications for HEK239 (Fig. 1b) and goes up to 21% for the Breast dataset (Supplementary Fig. 5). Furthermore, many of these ‘lost’ matches are overruled by a better match in the open search, as indicated in the figures. It is likely that most of these overruled peptide matches are incorrect and have been forced upon closed search identifications due to the absence of the otherwise higher scoring, true peptide 17.
Notably, the majority of PSMs gained in open searches are explained by the wide (7.5Da) precursor mass error tolerance that ionbot allows for matches without unexpected modification. These precursor errors show a periodic pattern at 1Da intervals (Supplementary Fig. 6). It is therefore possible that incorrect peak picking at the isotope level accounts for these.
Prediction models trained on specific experimental conditions improve identification
For each predictive model implemented in ionbot (tag-models, MS2PIP and DeepLC), there is a version trained on unlabeled HCD data, and a version trained on TMT labeled HCD data. In this section we apply ionbot with TMT specific prediction models on the non-TMT labeled evaluation datasets. Similarly, we applied ionbot not using TMT-specific models on the TMTCPTAC dataset.
We observed a 19% decrease in PSM and a 16% decrease in peptide identifications when employing non-optimal predictive models in HEK239 (Fig. 1a). For the other datasets the loss amounts 16% for CD8T, 18% for Brain, 26% for Breast, and 16% for TMTCPTAC. At the peptide level these losses are repeated, with a loss of 13% for CD8T, 17% for Brain, 22% for Breast, and 20% for TMTCPTAC (Supplementary Fig. 3-4).
Predicted retention time and fragment ion intensities provide decisive PSM information
In this section we investigate the relevance of the DeepLC retention time predictions (RT-pred-error) and MS2PIP peak intensity predictions (intensity-correlation) features in the ionbot PSM scoring function (Methods). Grouping PSMs by peptidoform (Methods) to compute the corrected observed retention time to compute RT-pred-error clearly reduced long elution time windows peptidoforms identified by multiple spectra (Fig. 1d-e). This is especially true for peptides at the end of an LC run, where the issue can be even more problematic for the RT-pred-error feature.
Comparing open searches with and without using the RT-pred-error feature in the PSM scoring function showed that consistently more PSMs were identified when the feature is added to the scoring function. At first, this gain appears to be relatively small, at 3.2% for HEK239 (Fig. 1a) and 2.4% for CD8T, 2.9% for Brain, 7.2% for Breast, and 1% for TMTCPTAC (Supplementary Fig. 1), but it is worth noting that the vast majority of true PSMs are (also) confirmed by the other sources of matching information, which leaves the retention time feature to correct only ambiguous situations that cannot be distinguished by any of the other sources. We found that PSM identifications unique to the search not using RT-pred-error show high retention time error in general, and, that many PSMs that were overruled when adding the feature show high prediction error as well (Fig. 1f, Supplementary Fig. 8).
Similarly, we compared open searches with and without using the intensity-correlation feature in the scoring function. The latter search also does not use this correlation information as a biased PSM scoring function. We saw an increase in PSMs (6%) identified when adding the feature for HEK239 (Fig. 1a). For the other datasets the gain is 5% for CD8T, 6% for Brain and Breast, and 8s% for TMTCPTAC (Supplementary Fig. 1). Yet here again, this information only gains importance in ambiguous situations that cannot be distinguished by any of the other sources. For HEK239, identifications called only by using the intensity-correlation feature show high correlations (Fig. 1g), while PSMs that are eliminated when the correlation feature is used show low overall correlations (Fig. 1i). Also, for overruled matches when intensity-correlation is used, the difference in correlation can be large, even though the vast majority shows only small differences in the higher correlation range (Fig. 1h). In these cases, it becomes difficult to decide on the correct match based on correlation and other matching information is required to decide on the true match. The same conclusions were made for the other evaluation datasets (Supplementary Fig. 9).
We repeated the same experiment with RT-pred-error and intensity-correlation omitted from the scoring function. This resulted in a much more substantial increase in the number of PSM identifications, with 11% for HEK239 (Fig 1a), 9% for CD8T, 11% for Brain, 16% for Breast, and 10% for TMTCPTAC (Supplementary Fig. 1). At the peptide level the gains amount to 9% for CD8T, 12% for HEK239, 14% for Brain, 12% for Breast, and 15% for TMTCPTAC (Supplementary Fig. 2).
Considering retention time error and intensity correlation in the PSM scoring function not only increases the number of identifications, but also corrects and overrules incorrect matches based on the additional matching information that becomes available. For instance, for HEK239, 2.8% of the PSMs identified omitting both features were overruled by a better match when using them (Fig. 1c). Similar results were found for the other datasets (Supplementary Fig. 10).
Entrapment peptides confirm accuracy and stability of the ionbot open search FDR estimates
For the FDR estimates to be meaningful, the ionbot PSM scoring function should treat false matches against the decoy and target database equally, i.e., decoy matches should be representative random matches. For ionbot, this conveys that the PSM scoring function learned from experimental data should not be biased towards favoring matches against the target database.
To estimate a potential matching bias, we adopted the entrapment peptides approach (Methods). If a bias exists, we should observe more than the expected number of matches against the entrapment compared to the decoy database. In our experiment, there are about 10% more decoy peptides compared to entrapment peptides, so we expect to see this difference in the data. For the CD8T and HEK239 datasets, we observed about 6.2% and 6.6% more decoy matches respectively (Supplementary Fig. 11). As this is well below 10%, we conclude that the accuracy of the FDR estimates is high, implying that the ionbot PSM scoring function is not biased towards matches against the target database.
The ionbot engine compares favorably to other state-of-the-art open modification engines
Many open search engines exist, but few can produce sensitive identification results for large datasets that contain hundreds of thousands of spectra, mainly due to computational limitations. Two recent engines stand out in terms of performance: MSFragger and open-pFind (Methods).
At the PSM level, MSFragger is slightly more sensitive than open-pFind (Fig. 2a, Supplementary Fig. 12), mainly due to the 7.5Da wide error matches (data not shown) that are considered only in ionbot and MSFragger. At the peptide level, these differences become much smaller, with no obvious ranking for the identification engines. However, plotting PSM and peptide identification overlap reveals a notable level of disagreement between the search engines (Fig. 2b, Supplementary Fig. 13).
To obtain more insight into this disagreement, we looked at the intensity-correlation computed for PSMs uniquely identified by one of the engines. To avoid discussion about the unknown effect of specific modifications on the peak intensity pattern, we limited this investigation to identifications without an unexpected modification. We found that many identifications unique to open-pFind and/or MSFragger are questionable. For HEK239 (Fig. 2c), when we look at the PSMs unique to ionbot, the 25% with the lowest intensity-correlation still show correlation within [0.52,0.77] (excluding outliers). For the PSMs unique to open-pFind and MSFragger this interval is [0.12,0.58] and [0.15,0.57] respectively. The same observations were made for the other datasets, most extreme for TMTCPTAC with more than 25% of the matches in open-pFind and MSFragger showing a correlation less than 0.59 and 0.63 respectively. While for ionbot, all matches (except very few outliers) have correlations higher than 0.68 (Supplementary Fig. 14).
For PSMs identified by just two engines, the intersection between ionbot and open-pFind shows highest intensity-correlation values, while the intersection between open-pFind and MSFragger clearly shows the lowest correlations, with 25% of these identifications having a correlation withing [0.15,0.61] for the HEK239 dataset (Fig. 2c).
For the RT-pred-error information these differences are much smaller (Supplementary Fig. 15). For the intersections there are no obvious differences. But for the matches unique to each engine, the median retention time error for ionbot tends to be half of the median retention time error observed in the other engines.
Identification sensitivity is substantially increased by considering lower ranked, highly plausible co-eluting matches
To determine the best match for an MS2 spectrum, ionbot learns the PSM scoring function from the candidate match set and then selects the first-ranked match for each spectrum based on the computed scores. Next, a more accurate PSM score is computed from these first-ranked matches, and the statistical significance is determined for these first-ranked PSM scores (Methods).
Nevertheless, the candidate match set explicitly contains multiple candidates for many spectra (Methods). In this section we investigate computing the statistical significance of scores from all PSMs in the candidate match set, which can then result in multiple candidate peptides passing the 1% FDR threshold for a given spectrum. We found that even though the FDR threshold does not impose a limit on the number of possible matches, the vast majority of spectra with multiple identified matches had just two (Supplementary Fig. 16). The maximum number of different matches observed for an MS2 spectrum was six.
Next, we focused on those MS2 spectra that had exactly two matches passing the 1% FDR threshold. For each such spectrum, we computed the edit distance (Levenshtein distance) between the two matched peptide sequences. Fig. 3a plots these edit distances for HEK239 and reveals a clear bimodal distribution, with one distribution tightly centered around distance 1. The exact same observation was made for the other datasets (Supplementary Fig. 17). PSMs in this distribution are examples of highly similar peptide matches that either co-elute, or that cannot be distinguished using the available matching information. Table 1 shows representative examples of PSMs with edit distance 1 (IDs 1-5). For instance, for the spectrum with ID=1 there is a difference of one amino acid between the two peptide matches, with the difference in mass compensated by a methylation.
We then continued to investigate matches which we consider to be highly plausible co-eluting PSMs (edit distance > 2). In Table 1, the spectra with IDs 6 to 11 show very dissimilar co-eluting PSMs, all with low RT-pred-error and high intensity-correlation. Table 2 shows examples of spectra with four or even five different, highly plausible co-eluting matches, each providing potential evidence for different proteins. Universal Spectrum Identifiers 18 to spectrum annotations can be found in the Supplementary Notes. These further confirm the high plausibility of these co-eluting matches.
Considering all co-eluting matches (edit distance > 2) greatly increases identification sensitivity. For instance, for the HEK239 (Fig. 3b) and Brain (Supplementary Fig. 18) datasets, more than 26% additionals unique peptide sequences were identified.
For HEK239, comparing different sources of matching information between the first-ranked PSMs and the lower-ranked co-eluting PSMs showed that the number of b- and y-ions matched is slightly lower for lower-ranked PSMs (Fig. 3c). However, lower-ranked PSMs also tend to be smaller in length (Fig. 3d). For intensity-correlation (Fig. 3f) and RT-pred-error (Fig. 3g), the lower-ranked PSMs again show highly plausible values as compared to the first-raked PSMs. The difference is much more pronounced when we look at the intensity explained by the b- and y-ions (Fig. 3e), which is much lower for the lower-ranked PSMs, resulting in a lower PSM score (Fig. 3h). Nevertheless, based on the other sources of matching information, these PSMs are still highly plausible matches. The same conclusions could be made for the other datasets (Supplementary Fig. 19).
Finally, we found that a substantial percentage of lower-ranked co-eluting matches were identified as first-ranked match by open-pFind and/or by MSFragger. For HEK23 we found that this was 42% (Fig. 3i). For the other datasets this was 42% for CD8T, 41% for Brain, 34% for Breast, and 30% for TMTCPTAC (Supplementary Fig. 20a). We also found that for PSMs unique to ionbot, the vast majority of matched peptides were identified as a first-ranked PSM by open-pFind and/or by MSFragger, but from another MS2 spectrum. For HEK239 this was 77% (Fig. 3j). For the other datasets this was 68% for CD8T, 77% for Brain, 62% for Breast, and 72% for TMTCPTAC.
We believe these findings provide strong evidence for the presence of co-eluting PSMs, and that it is remarkably straightforward to discover and study these in ionbot due to its data-driven peptide identification approach.