Machine learning-based exploration of enzyme-substrate networks: SET8-mediated methyllysine and its changing impact within cancer proteomes

doi:10.21203/rs.3.rs-3771179/v1

Download PDF

Article

Machine learning-based exploration of enzyme-substrate networks: SET8-mediated methyllysine and its changing impact within cancer proteomes

https://doi.org/10.21203/rs.3.rs-3771179/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The exploration of post-translational modifications (PTMs) within the proteome is pivotal for advancing disease and cancer therapeutics. However, identifying genuine PTM sites amid numerous candidates is challenging. Integrating machine learning (ML) models with high-throughput in vitro peptide synthesis has introduced an ML-hybrid search methodology, enhancing enzyme-substrate selection prediction. In this study we have developed a ML-hybrid search methodology to better predict enzyme-substrate selection. This model achieved a 37.4% experimentally validated precision, unveiling 885 SET8 candidate methylation sites in the human proteome—marking a 19-fold accuracy increase over traditional in vitro methods. Mass spectrometry analysis confirmed the methylation status of several sites, responding positively to SET8 overexpression in mammalian cells. This approach to substrate discovery has also shed light on the changing SET8-regulated substrate network in breast cancer, revealing a predicted gain (376) and loss (62) of substrates due to missense mutations. By unraveling enzyme selection features, this approach offers transformative potential, revolutionizing enzyme-substrate discovery across diverse PTMs while capturing crucial biochemical substrate properties.

Biological sciences/Biochemistry/Enzymes/Transferases

Biological sciences/Computational biology and bioinformatics/Protein function predictions

Biological sciences/Computational biology and bioinformatics/Machine learning

In the current era, in which the human genome¹ has been decoded for nearly two decades, unraveling the functional intricacies of the vast majority of human proteins remains a mystery. This challenge predominantly arises from the influence of post-translational modifications (PTMs), which are reversible chemical alterations with the potential to profoundly shape a function of a modified protein. With more than 500 distinct PTMs identified to date, the functional proteome transcends the approximately 20,000 proteins encoded by the human genome. This dynamic process involves covalently attaching functional chemical groups, such as methyl, phosphate, or acetyl, to specific amino acid (AA) residues within proteins. PTMs, which are mediated by specific modifying enzymes, can lead to substantial alterations in a protein's activity, stability, and folding². Notably, proteins leverage PTMs as a cellular messaging system, responding to various external cues and stressors. This dynamic interplay between proteins and PTMs facilitates cellular adaptation to diverse environments (i.e., ensuring the maintenance of cellular homeostasis). However, the dysregulation of PTMs has been implicated in conditions ranging from cancer to inflammatory and immune disorders, highlighting their critical role in both health and disease^3,4.

Central to understanding PTMs is recognizing the interaction networks between the modifying enzymes and their corresponding PTM-modified substrates (i.e., enzyme-substrate networks). This recognition illuminates the extent to which a modifying enzyme impacts the proteome. Yet, conventional discovery methodologies face significant challenges that have historically hindered the growth of these substrate networks. Peptide arrays and mass spectrometry (MS) analysis, although valuable, have their own set of limitations and biases^5,6. Peptide arrays offer high-throughput representation of protein segments, but fall short of capturing the full scope of PTM function as new modification sites are identified^7,8. In contrast, MS analysis provides a comprehensive view of cellular mechanics, but often necessitates challenges to affinity or chemical enrichment steps, particularly for the discovery of lysine methylation and methyltransferase (KMT) substrates^1,9,10.

Amid these challenges, the integration of artificial intelligence has emerged as a novel approach to dissecting protein function and enzyme-substrate selection. Although generalized in silico methods have paved the way for predicting PTMs^11–13, the application of deep learning, as seen in MusiteDeep¹⁴, is a transformative leap forward. However, the prediction of specific substrates for PTM-inducing enzymes remains a starkly unexplored frontier^15–17. The innovative fusion of in vitro experiments with in silico predictions, exemplified by methodologies such as the Bayesian framework employed to characterize the substrates of protein-tyrosine phosphatase, PTP1B, using protein–protein interaction prediction, offer a promising avenue for substrate prediction¹⁸. However, such methods are limited by their reliance on databases that can be poorly representative of the enzyme of interest or of uncertain quality¹⁹. Often machine learning (ML)-based PTM prediction methods that are enzyme-specific require details about structure or the metabolic networks of the enzyme^19,20. In this study, we transcend traditional techniques by adopting a ML-hybrid ensemble approach to enzyme-substrate identification that is generalizable across diverse enzyme classes. We show that this paradigm can successfully identify enzyme-catalyzed PTM sites for lysine methylation (SET8) and (de)acetylation (SIRT1-7) modifying enzymes.

SET8, a member of the Su(var)3–9, Enhancer of zeste, Trithorax-homology (SET) family of KMT enzymes, serves as a representative model in this study. The previously identified recognition site of SET8 manifests a notable specificity towards lysines located in unfolded regions of proteins²¹, positioned ± 4 amino acids from the central lysine. This heightened specificity poses challenges in distinguishing SET8 methylation sites solely from peptide arrays, as the involvement of biophysical features beyond a simple sequential representation is anticipated in substrate recognition. Considering these complexities, SET8 emerges as a prime candidate for a systematic machine learning-based approach to substrate identification. SET8 mono-methylates histone H4 lysine 20 (H4K20), an event implicated in DNA damage repair, DNA replication, and cell cycle control^22,23. SET8 also targets non-histone proteins, including K382 in the C-terminal protein domain of the p53 tumor suppressor, K248 of the proliferating cell nuclear antigen (PCNA), and two sites on the mitosis-associated protein Numb, K158 and K163^21,23,24. Additional substrates have been proposed, including UHRF1-K385 and α-tubulin-K311^25,26. SET8 has been shown to be overexpressed in bladder cancer, non-small cell and small cell lung carcinomas, pancreatic cancer, leukemia, and other diseases²³.

To broadly apply our ML-hybrid approach in delineating substrate specificity across diverse enzyme families and evaluating model performance in various enzyme classes, we investigate the substrate networks associated with the sirtuin (SIRT) family of nicotinamide adenine dinucleotide (NAD+)-dependent deacetylases. Comprising seven homologs denoted as SIRT1 to SIRT7, the SIRT family plays a pivotal role in diverse physiological processes, including inflammation, glucose and lipid metabolism, oxidative stress response, cell apoptosis, autophagy, cell proliferation, as well as cell migration and invasion²⁷. This family is broadly involved in histone and non-histone deacetylation, with significant variations in subcellular localization and catalytic activity levels observed among its members. In the context of cancer, the SIRT family has been implicated in a wide spectrum of malignancies, autoimmune disorders, cardiovascular diseases, and respiratory disorders²⁷. Consequently, the identification of substrates governing SIRT function holds timely and substantial relevance.

Here, we improved on conventional in vitro and in silico techniques of substrate discovery by employing a novel ML approach trained on a complete peptide representation of the modified methyl-lysine and acetyl-lysine proteomes. Unlike most ML predictors of PTMs, our “hybrid” approach begins with the experimental generation of enzyme-specific training data, rather than relying purely on an online database, in which just a handful of substrates exist²⁸. By chemically synthesizing a representative PTM proteome using peptide arrays, then subjecting them to in vitro enzymatic activity, we can characterize enzymatic PTM activity in a facile way. With the development of a machine learning model, augmented by generalised PTM-specific prediction^11,29, we created ML-hybrid ensemble models unique to each enzyme that demonstrates that significantly enhanced predictive accuracy in cell models (Fig. 1). The application of the ML-hybrid ensemble model to a proteome bearing missense mutations associated with breast cancer uncovers potential novel pathways of SET8-mediated function in cancer cells. Furthermore, the enzyme-substrate networks produced by ML-hybrid ensemble models specific to each SIRT family member reveals novel potential pathways of conserved and enzyme-specific interaction.

This pioneering ML-hybrid ensemble method not only outperforms conventional in vitro and in silico approaches, it also demonstrates broad applicability across diverse PTM-inducing enzymes. As we stand at the intersection of artificial intelligence and protein function, this novel paradigm not only sheds light on the intricate world of PTMs, but it also sets a precedent for future investigations into enzyme-substrate networks and protein function modulation.

SET8 expression and conventional substrate prediction with permutation array analysis

A well-characterized and highly active SET8_193-352 construct²⁸ was applied to the peptide array experiments. The purity (Figure 2A) and methyltransferase activity of the construct was assessed for the H4K20 peptide (GGAKRHRKVLRDNIQ) (Figure 2B). There are several generalized methods of identifying novel substrates for PTM-inducing enzymes. To accurately assess the ability of the proposed SET8 ML-hybrid ensemble methodology to determine novel substrates, a comparable array-based permutation motif was generated to identify potential substrates^7,21. The permutation array was created using the histone H4K20 sequence (±4 AAs, each sequence has 15 AAs in total) and was exposed to SET8 activity to identify peptide variants susceptible to methylation (Figure 2C, Supplementary Table 1). Densitometry results processed by PeSA2.0 yielded the following motif: [KPGCHIVD]XH[RVIKYSAHML]K[IVT]L[RDLGI]X (Supplementary Figure 1)²⁹. A search of the known methyllysine proteome (Supplementary Table 2) was performed with the scoring matrix, and a normalized score cutoff of 0.5, relative to unmodified H4K20 peptide (assigned a score of 1), yielded 346 hits (Supplementary Table 3). Of these candidate substrate hits, just 26 peptides were validated as being methylated by SET8 in vitro with peptide array, indicating a method precision rate of 7.5% in this enriched methyllysine proteome dataset (Figure 2D, Supplementary Table 3).

To accurately compare both the SET8 ML-hybrid ensemble model and permutation methods of substrate prediction, we next identified novel SET8 substrates from the dataset of surface-exposed lysine. This approach reveals the applicability of these approaches to the exploration of enzyme substrates beyond those that are currently known to be modified (e.g., missense cancer mutations). Using the scoring matrix from the permutation array (Figure 2C), a search of the surface-exposed lysine proteome was performed and yielded 15,961 sites contained within 2,424 proteins (Supplementary Table 4). A randomly selected subset of these positively predicted sites (n=100) identified two positive hits, indicating a precision rate of 2% (precision represents the quantity of validated positive predictions among all positive predictions) (Figure 2E).

Training set generation: SET8 substrates within the known methyllysine proteome

To apply an effective ML model, the initial dataset must provide sufficient samples of the positive case and the negative case, ideally in equal amounts^30,31. A randomly sampled subset (n=100) of the approximately 600,000 lysine-centric sites within the proteome tested with peptide arrays identified no sites of SET8 methylation. This observation is indicative of SET8’s highly specific recognition of substrates for methylation, and emphasizes the need for an improved approach for generating training data²¹. To efficiently enhance the likelihood of detecting positive sites, a targeted subset of the known methyllysine proteome was obtained from PhosphoSitePlus²⁷. This dataset contains modified lysines, including mono-, di-, and tri-methylation. Enzymes such as KMTs often contain conserved catalytic domains and act upon methylatable histone and non-histone substrates, meaning the methyllysine proteome should contain an enriched number of substrates for SET8; in contrast, the methylated status of the full proteome is unknown²⁴.

Upon analysis of the SET8-exposed peptide arrays that comprise the methyllysine proteome obtained from PhosphoSitePlus²⁷, the targeted subset successfully contained peptide substrates of SET8. Specifically, of the 4,593 peptides tested, 213 were deemed to be positive for SET8 methylation (Supplementary Table 3). The 213 positive sites were identified across 179 proteins, indicating that several proteins harbored multiple SET8 methylation sites. To date, the commonly accepted substrates of SET8 are P53-K382, PCNA-K248, Numb-K158, and Numb-K163²⁴. The 213 sites identified in vitro with the targeted search alone expands upon these four substrates.

SET8 base model fitting and fine-tuning

The lysine methylome for SET8 was numerically encoded with the application of MACCS keys, one-hot sequential encoding, and ProtDCal molecular descriptions (Supplementary Table 5)^{32,33,34,35,36}. The resulting set contains 483 features. With the stratified K-fold cross-validation method, the SET8 methyllysine proteome dataset was split into training and testing sets to effectively assess model fitting and prevent overfitting³⁷. The F-score was selected as the best way to measure the predictive performance of the model on the imbalanced dataset^30,31. A linear discriminant analysis, along with random oversampling of the positive class (i.e., sites positive for SET8 methylation), attained the highest F-score, which was 0.13^38,39. An m-threshold analysis was performed (Figure 2F), as well as both precision-recall and receiver operating characteristic (ROC) curves were generated (Figure 2G). Metrics for the default threshold of 0.5 resulted in an F-score of 0.13, a precision of 0.085, recall of 0.24, and specificity of 0.83. The metrics further demonstrate the benefit of the F-score, defined by the harmonic mean of precision and recall, and describes our positive identification rate, rather than using specificity, which is falsely inflated by the negative identification rate.

Feature importance was analyzed for the selected model, hereafter referred to as the base model. Features deemed crucial by the model for identifying positive sites of SET8 methylation were specific one-hot-encoded AA/position combinations, including tryptophan, cysteine, and tyrosine at positions +3, +4, and –6 from the central lysine, respectively. Additionally, the MACCS key corresponding to the aromatic bond between carbon and nitrogen (key 65), found in histidine, proline, and tryptophan, was deemed to be of high importance⁴⁰. Sulphur (key 88), found in cysteine and methionine was also determined to be highly important to the model’s classification of positives⁴⁰. Regarding the classification of negatives, or sites not methylated by SET8, once again, one-hot encoded positions played a crucial role. Specifically, regarding the central lysine (position 0), cysteine at –6, phenylalanine, and methionine at +6, and methionine at –2 scored highly for feature importance in negative classification. One MACCS key was included as well, key 132, which represents AO-CH₂-A (where A represents any elemental symbol) and likely corresponds to the presence of aspartic acid, glutamic acid, serine, or threonine within the site⁴⁰.

SET8 ML-hybrid ensemble model construction

The ability of a lysine residue to undergo methylation is a prerequisite for any newly predicted SET8 substrate. To enhance the performance of the SET8 substrate prediction, or base model, a composite or ensemble model was constructed using MethylSight, the current state-of-the-art generalized predictor of lysine methylation¹¹. In the inaugural study, MethylSight identified 51 novel sites of histone methylation, and 89% of the sites were confirmed to physically exhibit methylated lysine¹¹. Much like the SET8 substrate prediction model previously described, MethylSight uses ProtDCal to characterize the 15-AA-long site surrounding a central lysine¹¹. Hence, it is well suited for integration with the SET8 substrate prediction model using stacked ensemble learning.

As with the initial model fitting, stratified K-fold cross-validation was applied to assess the performance of each model. Two features were applied: the SET8 substrate prediction score (described above); and the MethylSight score (i.e., the likelihood of methylation). The F-score was optimized with the application of a logistic regression model and SVM SMOTE oversampling^41,42,43. The simplicity of logistic regression was reflected in the singular hyperparameter of 100 max iterations determined from the tuning process⁴¹. A much-improved F-score of 0.12 was determined for the ensemble model, along with improved values of 0.25 for precision, 0.08 for recall, and 0.98 for specificity. A comparison of performance metrics with classification threshold is illustrated in Figure 2H. To optimize the performance of the ensemble model, a threshold cutoff of 0.82 was applied. Given the performance increase gained from the integration of methyllysine prediction into the ensemble model (hereafter referred to as the SET8 ML-hybrid ensemble model), the investigation proceeded with this hybrid model (Supplementary Tables 6 and 7).

Proteome-wide prediction of SET8 substrates

Using our SET8 ML-hybrid ensemble model, experimental validation of the 2,367 predicted positive sites of SET8 methylation was completed by testing each site for in vitro methylation. Of these predictions, 885 sites permitted in vitro SET8 methyltransferase activity, representing a validated precision of 37.4%. The precision of this method is much improved over the 0% validated precision of the random search method and the 2% validated precision determined with the permutation array within the surface-exposed lysine proteome (Figure 2I). An analysis of the sequence composition of the predicted sites of SET8 methylation by the ML-hybrid ensemble model demonstrates that the known SET8 substrates differ from the predicted sites, with substantial variation observed in the latter (Figure 2J). The SET8 ML-hybrid ensemble model proved to be 100% accurate in identifying a subset (n=362) of predicted negative, lowest-scoring sites, as verified by peptide array experiments with SET8 (Supplementary Table 8). Based on these findings, it is clear our SET8 ML-hybrid ensemble model improves on the traditional substrate identification approach.

A total of 2,367 positive SET8 methylation sites were predicted by the SET8 ML-hybrid ensemble model within the surface-exposed lysine dataset, representing sites within 1,203 proteins. To investigate the enriched biological functions of the 1,203 proteins (i.e., predicted SET8 substrate network), clustering analysis with GO annotations was performed using the spatial analysis of functional enrichment (SAFE) approach (Figure 3)^44,45. The HuRI proteome was selected for protein mapping because of its quality, high-confidence interactions among proteins⁴⁶.

A shared theme among the enriched biological processes is involvement in cell homeostasis, regulation, and control of the cell cycle (Figure 3A). Given the established involvement of SET8 with these cellular events, mediated through known substrates, the possibility that SET8 might participate in such processes through the methylation of other substrates identified by our SET8 ML-hybrid ensemble model is bolstered^23,24,28,47. Other affiliated processes include mRNA and RNA polyadenylation. Regulation through the polyadenylation of mRNA has been associated with other SET-domain-containing methyltransferases, specifically SET1 and SET2, through histone methylation⁴⁸. The involvement of SET8 in transcription modulation may also implicate it in polyadenylation regulation; however, a direct connection has not been reported²³. In summary, the substrates generated by the ML-hybrid ensemble model provide the potential to unveil new functional narratives for SET8 and its role(s) in disease. The efficacy of the SET8 ML-hybrid ensemble model is further demonstrated by the progression from the proteome isolated for surface-exposed lysine residues that contain 145,379 sites to the 2,367 predictions (Supplementary Tables 9 and 10), which resulted in 885 in vitro validated sites, as shown in Figure 3B.

Cell-based validation of SET8 substrate candidates

To validate SET8-influenced cellular methyllysine events, we used parallel reaction monitoring MS to assess the in vitro SET8 substrates newly identified by our SET8 ML-hybrid ensemble model in a targeted manner. To restrict the number of methylation sites monitored with this approach, we generated an isolation list that was constrained to the primary and secondary interactors of SET8, as described by the STRING database⁴⁹ (Supplementary Table 11). The 44 proteins within the network in Figure 4A contained 75 sites of predicted SET8 methylation that were verified by peptide array experiments. Of these 75 sites, it was predicted that 32 sites would create suitable digested peptides in silico; these were targeted for MS monitoring in SET8 overexpressed HCT116 cells (Figure 4B; Supplementary Table 12). Of the 32 monitored sites, only nine were reliability detectable, and elevated levels of mono-methylation were observed in three (33%) of these substrates: SETD1B-K41, KAT6A-K314, and PRDM12-K269 (Figure 4C–4E; Supplementary Figure 3). In conclusion, the ML-hybrid ensemble model is able to identify novel substrates of possible SET8 methylation activity, as confirmed by site-targeted MS monitoring.

SET8 substrate discovery in cancer

Elevated expression of SET8 is linked to a high mortality rate in patients with breast cancer (Figure 5A)^50,51. However, the behavior of SET8 in cancerous cells remains unclear, and further investigation is required to uncover the functional role(s) SET8 plays in tumorigenesis. As cancer-associated mutations continue to diversify, mutation datasets serve as a valuable resource with which to elucidate the effect mutations have on protein structure and function. In the case of missense mutations, they may cause the gain or loss of methylatable lysine or make changes to neighboring residues, which then dictate the suitability of these sites for SET8 methylation⁵⁰.

To explore the possibility of gain or loss of SET8 substrates in breast cancer, missense mutations were downloaded from the COSMIC database (v.96) and applied to the human proteome. Of the initial mutations, 9,438 either occurred within seven AAs of a lysine residue (e.g., any residue) or resulted in the gain or loss of an individual lysine, directly impacting the creation or loss of a potential methylation site (Supplementary Table 13). The corresponding unmutated sites, except sites in which a lysine did not previously exist, were also assembled. Application of the SET8 ML-hybrid ensemble model to normal and breast cancer datasets predicted that most of the mutations (94.6%) would not affect SET8’s methylation behavior toward the site, likely because their structure is not changed dramatically by a single AA mutation. In contrast, 4.0% (376) of mutations resulted in a predicted gain of SET8 methylation, and 0.7% (62) resulted in a loss (Figure 5B). Of the 4.0% of sites predicted to gain SET8 substrate status, 46.8% (176) were the result of the mutation introducing a new lysine that is itself predicted to be methylated by SET8 (Figure 5C). MCODE clustering analysis of the total set of mutations revealed a directed subset of predicted substrate interactions that were highly interconnected with SET8 (Supplementary Figure 4). Mutations within the subset that resulted in a gain of predicted SET8 methylation were investigated for involvement in pathways implicated in breast cancer. Of particular interest was XPF (encoded by ERCC4), a protein associated with the vital cellular process of DNA damage repair⁵². Specifically, the XPF-S352A mutation led to the prediction of a new SET8 methylation site at XPF-K350. As detailed in Figure 5D, XPF is directly involved in DNA damage repair, including nucleotide excision, double-strand break, and interstrand cross-link repair pathways⁵³. Gap filling is then proceeded by PCNA, a known substrate of SET8²⁴, further implicating SET8 in the NER pathway⁵⁴. Finally, ligation is performed with DNA ligase and the NER pathway for DNA damage is complete⁵³. Interestingly, SET8 has been implicated in DNA repair previously, specifically in 53BPI/BRCA1 double-stranded DNA repair through histone H4K20 mono-methylation⁵⁴. ERCC4 gene mutations have also been determined to affect XPF function within the NER pathway⁵². Beyond breast cancer, our SET8 ML-hybrid ensemble model was also applied to missense mutations present in pancreatic cancer (COSMIC database, v.96) (Supplementary Figure 5A and 5B; Supplementary Table 14).

The introduction of our ML-hybrid ensemble model marks a significant leap forward in the identification of substrates for specific PTM-inducing enzymes. This methodology is successfully generalized across multiple enzyme classes and is inevitably extendable to a broader spectrum of PTMs, including phosphorylation, sumoylation, and ubiquitination. The power of this approach lies in its ability to efficiently characterize PTM-specific subsets of the proteome, a task made feasible by the manageable size; only a few thousand sites can be readily profiled in parallel using peptide arrays (Fig. 1). Moreover, this approach ensures greater success in positively identifying new enzyme substrates than established array-based methods^5,57.

Although many predictive models can identify sites of a particular PTM, few possess the capacity to pinpoint the enzyme responsible^11,15,58. In contrast, our ML-hybrid ensemble methodology is highly specific to the enzyme under investigation and reveals the predicted PTM site. Notably, the ensemble models for both SET8 and SIRT2 demonstrate exceptional performance when experimentally validated in cells overexpressing the target enzyme, a departure from many PTM prediction tools^58,59. The ML-hybrid ensemble model’s experimentally validated precision of 37–43% stands in stark contrast to the 2.0% validated precision achieved with a traditional permutation array in the in vitro search for new substrates within the human proteome. Further the high precision of our models enables its facile use in directing substrate identification studies using more targeted, and intensive validation strategies.

The limited amount of labeled PTM-specific training data imposes restrictions on ML model complexity, but significantly reduces computational load. Deep learning algorithms, including neural networks, are precluded from predicting enzyme-specific substrates because of the dataset’s small size. However, this constraint aligns with the notion that deep learning in PTM prediction is most effective in a more generalized context, as seen in MusiteDeep⁵⁸. Most potential substrates for an enzyme within the proteome are, in fact, not substrates, as a result there is an inherent imbalance that narrows the range of applicable models. Although the abbreviated peptide sequences may not fully represent a full-length enzyme-substrate interaction, this limitation is not exclusive to our approach; conventional permutation array experiments employ similar methodologies when validating results²¹. The unique advantage of our ML-hybrid ensemble method lies in the extensive number of predictions (more than 2,300 for SET8), yielding a substantial number of validated predictions (885 for SET8), thereby expanding the list of possible substrates for MS-monitoring methods compared with conventional methods²¹.

The value of any predictive model of substrate selection lies within its ability to prioritize the exploration of new datasets and reveal novel insights. The targeted MS monitoring of predicted sites of SET8 methylation in cells highlighted the further elevation of SETD1B-K41, KAT6A-K314, and PRDM12-K269 mono-methylation levels in SET8 overexpressed HCT116 cells (Fig. 3). These findings suggest the potential involvement of SET8 in gene activation and regulatory pathways. The change in SETD1B-K41me1 levels in SET8 overexpressed cells may link SET8 to the COMPASS complex, of which SETD1B is a part⁶⁰. KAT6A, commonly localized to CpG islands, may be influenced by SET8 mono-methylation at K269⁶¹, potentially impacting gene regulation. Furthermore, the association of both KAT6A and SETD1B with CXXC1 (CpG-binding protein) implicates SET8 in gene activation^61,62. The regulatory domain binding factor PRDM12 was observed to demonstrate elevated K269me1 levels with SET8 overexpression⁶³. With additional investigation, this may suggest SET8’s involvement in gene activation and regulation, potentially through the methylation of PRDM12-K269⁶³.

To help reveal potential insights into the role(s) SET8 plays in breast cancer, our SET8 ML-hybrid ensemble model was applied to explore the creation and loss of potential substrates to help annotate a predicted breast-cancer-specific SET8 enzyme-substrate network. When applied to cancer mutation datasets, the ML-hybrid ensemble model has specific advantages. The sensitivity of the model is evident when comparing oncogenic and healthy proteome scores, which provide a unique perspective on oncogenic mutations. For example, the predicted gain of SET8 substrate methylation due to a missense mutation sheds light on potential proteins of interest, particularly the association of SET8 with breast cancer within the NER pathway through XPF (Fig. 4)⁵⁴. This implicates SET8 in the DNA repair process, employing PCNA, another substrate of SET8, for gap filling⁵⁴. The XPF-S352A mutation and subsequent SET8 methylation at XPF-K350 could be hypothesized to affect the NER pathway, such that DNA damage repair may not be completed. Interestingly, the K350 site falls within the XPF-SLX4 interaction region⁶⁴, an interaction associated with the Fanconi anemia DNA repair pathway and defects in SLX4 function have been linked to breast cancer⁶⁵.

The generalizability of the ML-hybrid ensemble approach is demonstrated by the sirtuin family investigation. The rates of identified in vitro peptide substrates for each SIRT range from as low as 2.59% positive (97.4% negative), to 38.4% positive (61.6% negative). With the balancing methodology and the careful tuning of models, each SIRT-based model produced metrics as impressive, if not more so, than the SET8 data. This is exemplified by the MS-validated precision of the SIRT2 model of 43.0%. The conserved and enzyme-specific substrates within each SIRT network further provides a unique insight into potential pathways of activity (Fig. 5). Interestingly, SIRT2 and SIRT3 share almost twice the substrates as any other combination of SIRT. This finding is supported by prior investigation that uncovered that the pair may compensate for each other when the other is deficient⁶⁶. The second largest overlap in predicted substrates occurs between SIRT5 and SIRT7. As the most recently discovered member, SIRT7 is underrepresented in literature, however some associations between SIRT5 and SIRT7 exist. The two have been implicated within disease states such as cardiac hypertrophy and inflammatory bowel diseases, as well as the regulation of the NF-κB inflammation signalling pathway²⁷. The predicted substrates of the ML-hybrid ensemble model for SIRT7 may prove to be a key to direct future investigations to uncover SIRT7’s cellular function(s).

In conclusion, our generalizable ML-hybrid ensemble approach represents a significant advance in the precise identification of enzyme-substrate networks and the features of substrate enzymes that influence selection. The methodology’s proven versatility offers promise for a wide array of PTM-inducing enzymes, thereby illuminating previously uncharted territories of protein function modulation. Although some limitations exist, including dataset size and an unavoidable class imbalance, our approach applies techniques to overcome such constraints to offer a powerful tool for exploring the intricate world of PTMs. The verification of predicted SET8 and SIRT2 substrates in cells underscores the model’s robustness, outperforming traditional methods and paving the way for a more holistic understanding of enzyme-substrate selection and involvement in cellular processes. The application of this model to cancer datasets provides a promising avenue with which to uncover critical insights into oncogenic mutations and the role of enzymes within these modified proteomes. Moreover, our ML-hybrid ensemble approach presents the capability to uncover the enzyme-substrate network of an entire family of enzymes to accurately predict the extent of an enzyme family’s functional role within the cell. With its potential for broader application and its capacity to shed light on unexplored facets of cellular regulation, our ML-hybrid ensemble model stands poised to have a profound impact on the study of enzyme-substrate networks and protein function modulation.

Peptide synthesis

Peptide SPOT arrays were synthesized to commercial aminated cellulose membranes (Intavis Inc.) with standard Fmoc (N-(9-fluorenyl)methoxycarbonyl) chemistry automatically using a Multipep synthesizer (Intavis Inc.)⁶⁷. Resulting arrays contained approximately 2 nmol peptide per SPOT, separated from the membrane surface by a flexible C-terminal 6-aminohexanoic acid linker. Post-treatment involved the cleavage of the protective groups from the side-chains with an acidic solution (51% water, 47.5% trifluoroacetic acid, and 1.5% tri-isopropylsilane)^67,68. Arrays were then washed in ethanol and dried for future use. Dried arrays were stored desiccated at 4°C.

Protein expression and purification

A bacterial expression construct of human SET8 encoding the catalytic site residues 191–352 were cloned into the parallel expression vector pHIS2 using BamHI and XhoI restriction sites⁴⁸. The construct was transformed in BL21 DE3 Escherichia coli, and after propagation were induced with 0.3 mM isopropyl β-D-1-thiogalacttopyranoside⁴⁸. Following protein extraction, affinity column purification of the soluble lysate was completed with 500 µL HisPur™ Ni-NTA Resin (ThermoFisher Scientific, Cat# 88221). Fractions were eluted using P500 buffer (50 mM NaHPO₄ (pH 7), 500 mM NaCl, 10% glycerol, 0.05% TritonX-100, 1 mM DTT, 500 mM Imidazole) and then dialyzed into a storage buffer (20 mM tris pH 7.5-8, 200 mM NaCl, 10% glycerol, 1 mM DTT). Protein purity was assessed with a 12% SDS PAGE gel (Supplementary Fig. 1A) stained with Coomassie Brilliant Blue G-250 stain. Concentration was determined with a Bradford Assay⁶⁹. Prior to storage, activity was confirmed using the Methyltransferase-Glo Assay to assess in vitro activity with H4K20 peptide (GGAKRHRKVLRDNIQ) (Supplementary Fig. 1B), according to the manufacturer’s specifications (Promega, Cat# V7601). Recombinant SET8 protein was then snap frozen stored at -80°C.

Training dataset generation

The annotated Human lysine methylome was obtained from PhosphoSitePlus (accessed February 5th, 2020). This dataset contained sequence information for each lysine methylation site represented as 15 residue peptides, centered on the methylated lysine. (i.e., position 8) ^28,70. The data was then cleaned using Python 3 and the Pandas package for efficient isolation of human protein sites of lysine methylation^71,72. Briefly, sites that were within 7 Aas from the beginning or end of the sequence were padded with alanine residues, and any duplicate sequences were removed (i.e., methylation events that occur within a conserved 15 AA sequence, but among unique proteins). The resulting dataset yielded a total of 4,593 Human lysine methylation sites (Supplementary Table 2).

Peptide arrays were then methylated using 1 µM SET8_191− 325 with 5 µCi/mL S-[methyl-³H]-Adenosyl-L-methionine (SAM) in methylation buffer (50 mM Tris pH 8.5, 2 mM MgCl2, 10 µM DTT) overnight at room temperature⁷³. After methylation, arrays were washed 6x 3 min in buffer (100 mM NH4HCO3, 1% SDS) and then a 7% 2,5-Diphenyloxazole solution in ethanol was sprayed generously over the array and left to air dry. This procedure was repeated a total of three times. Dried peptide arrays were then exposed to intensifying screens (Dupont, Cronex Lighting Plus) at -80°C for two weeks and imaged using a Typhoon™ FLA 7000 IP phosphorimager (General Electric). Peptide methylation was determined by densitometry using the Protein Array Analyzer (v.1.1.c) toolset for Image J (v.1.53) was applied to obtain SPOT densitometry from the array images. The peptide substrate dataset for the sirtuin family was adapted from Rauh et al., 2013⁵. Peptides which produced a signal ratio score of 1 or greater were classified as positive, and positions with masked AA residues were replaced with alanine (Supplementary Table 16).

Biochemical Feature Generation

Peptide libraries were represented numerically through one-hot encoding, a common method of vectorizing peptides^34,35. A matrix describing AA position and identity with a 1 or 0 is applied to the sequence, resulting in 300 features for SET8, and 260 for the sirtuins (Eq. 1).

\({x}_{AA\text{, }pos}=\left\{\begin{array}{c}1 if seq\left[pos\right]==AA\\ 0 if else \end{array}\right. \text{, for }pos=\left[1,N\right]\text{, }AA=\{A, C, D, \dots , W\}\) [1]

To encode molecular structure, the Molecular Access System (MACCS) Keys of 166 binary fingerprints were applied³⁶. Created to describe molecular structure, MACCS Keys include predefined atom symbols, bond types, and atom properties³⁶. The RDKit package for Python (www.rdkit.org) was applied to generate MACCS Keys from the site sequence, representing the molecular structure for the AAs present.

Aggregate molecularly descriptive features were generated for the peptide sequence through ProtDCal (v4.5)³⁸. The 17 metrics selected included molecular weight, hydrophobicity, isoelectric point, free energy, and Levitt’s Probability for various protein conformation amongst other descriptive sequence-based metrics³⁸ (Supplementary Table 5). Combined, the sequential information provided by one-hot encoding, along with the unique molecular elements represented by the MACCS keys and the molecularly descriptive features determined by ProtDCal resulted in 483 features for SET8, and 443 for the sirtuins.

Machine Learning – Model Fitting

All procedures related to model fitting and data balancing were completed in Python 3 using the Scikit-Learn and Imbalanced-learn packages^71,74,75. Pandas and Numpy were applied for data handling and storage, and plots were generated with Matplotlib^72,76,77.

Class imbalance was present in all samples, although it ranged in degree for each enzyme studied. The SET8-methylated lysine peptide array data provided a class imbalance of 213 positives within the 4,593 sites tested for SET8 activity. Imbalance, along with the relatively small size of the training data, was taken into careful consideration when selecting ML models³³. For proper comparison to a baseline model, the dummy classifier which simply randomly classifies data input was first fit to the data. Applying the F-score as the primary selection criteria and a model complexity considerate of the size and imbalance of our dataset as the secondary selection criteria, the selected models were linear discriminant analysis (LDA) and the decision tree classifier⁴⁰.

Class imbalance in which the positive case composed less than 30% of the training dataset was addressed with data balancing or sampling methods, which was the case for all datasets except SIRT5 (38.4% positive, 61.6% negative). Both over and undersampling techniques were tested to improve F-score with our models, and the best performing data balancing approach was selected (Supplementary Table 17). In these approaches, a selection of the positive class is replicated within the dataset, increasing the overall percentage of positive values⁷⁴. Alternatively, negative values are removed to reduce the overall percentage of the negative case within the training dataset.

Machine Learning – Cross-validation

The fit of each model was assessed with repeated stratified K-fold cross-validation³⁹. Cross-validation involves the splitting of the labelled training set and related features into n-folds. In this case, the label indicates whether our datapoint (i.e., modification site) is methylated. Due to the imbalanced nature of our data, stratification was applied to ensure the same percentage of positive and negative labels are represented within each fold.

Many metrics may be used to assess classification performance; however, the nature of the data must be taken into careful consideration. In a dataset with fewer positive instances than negatives, the preferred metric is the F-score, as it accounts for true positive and false positive counts, as well as false negatives^33,39 (Eq. 2). To assess the fit of each model, stratified 10-fold cross-validation was applied to the lysine methylome with 3 repeats and using F-score.

\(F=\frac{2TP}{2TP+FN+FP}\) [2]

The best-scoring models were determined to be the linear discriminant analysis (LDA) method and the decision tree classification method. The LDA method determines a linear combination of features that optimally separates the two classes; “modified by our enzyme” and “not modified by our enzyme”. The LDA method has been applied to solve biochemical problems in the past, including the prediction of protein function and tertiary structure from sequence^78,79. One investigation employed LDA to predict the site of protein sumoylation, highlighting the benefit of LDA in the prediction of PTMs⁸⁰.

The decision tree classifier composes a model through the creation of nodes representing a test on the features provided. With each node, a branch descends to eventually point towards a classification of the input to one of the two classes as defined in the LDA method. Decision tree classification has been applied to the classification of enzymatic and non-enzymatic metal binding sites, as well as the familial classification of an unknown protein^81,82.

Machine Learning – Hyperparameter Tuning

The next step of model fitting involves the tuning of the model’s hyperparameters. These model-specific parameters control the method in which the model is applied to the data. To effectively search for the optimal hyperparameters, a grid search was applied⁸³. Each combination of hyperparameter was tested with the repeated stratified K-fold cross-validation method, and the combination resulting in the highest F-score was returned^39,83. The balancing methods are also tuned through hyperparameter optimization to better the model’s performance.

Machine Learning – Optimizing Decision Threshold

Adjustment of the model’s threshold is required to fine-tune the performance, particularly so with imbalanced training data. When applied to new data, the trained model outputs a score representative of the probability that a data point is a member of the positive class. By default, a decision threshold of 0.5 is applied the probability; however, this decision threshold can be tuned to trade off recall and precision³³. Recall assesses the model’s identification of true positives; precision measures the proportion of positive predictions that are correct; and accuracy over truly negative instances are quantified with specificity³³. For significantly imbalanced datasets, the best threshold for the model was determined by identifying the point at which the F-metric was maximized^11,33,39. Otherwise, a decision threshold of 0.5 was maintained.

Machine Learning – Ensemble Learning

To improve the overall performance of the models, ensemble learning algorithms were explored. In ensemble algorithms, a secondary ML predictor is applied in parallel. The lysine methylation predictor MethylSight was implemented as the secondary method for the SET8 model, as it generally predicts sites of lysine methylation; a dependent variable for a lysine to be a SET8 substrate¹¹. MethylSight uses support vector ML, and was validated experimentally to predict novel sites of methylation, including the identification of novel KDM5B substrate H2B-K43me2¹¹. For the sirtuin family the generalized deep learning predictor MusiteDeep was applied as the secondary predictor. The N6-acetyl-lysine predictive function of MusiteDeep was employed to generally identify sites of lysine acetylation, much like MethylSight for SET8. MusiteDeep outperformed other representative ML and deep learning algorithms for N6-acetyl-lysine prediction²⁹.

The ensemble methods explored included hard voting, soft voting, and stacking. Each method takes in the probability score output by either predictor that a site is methylated. The hard voting method classifies through most votes, or scores above our specified positive cutoff thresholds for the positive class by either model⁸⁴. In soft voting, classification is determined through the mean of probabilities by either model⁸⁴. Finally, stacking utilizes the scores of each model as features of a third model, along with dataset balancing as previously outlined⁸⁴. Each method was tested, and F-scores were generated with repeated stratified K-fold cross-validation. For SET8, a holdout set from the MethylSight investigation was employed to provide an unbiased final validation of each ensemble method¹¹.

Experimental Dataset Generation and Evaluation

The human proteome was obtained from UniProt, and analyzed with NetSurfP2.0 to generate values for relevant surface accessibility^85,86. Lysine residues with a relevant surface accessibility value of over 0.2 were selected, and the surrounding (±7 and ±6 from central lysine) AAs were isolated from the full protein sequence to form the experimental datasets. As with the generation of the lysine methylome, sites less than 7 or 6 AAs from the beginning or end of a protein sequence were padded with alanine. Additionally, any duplicated sequences or sequences found within the training dataset were dropped.

The SET8 ensemble learning model was applied to the resulting set and resulted in 2,367 novel positive predictions of new lysine methylation sites that are also SET8 substrates (Supplementary Table 9). These values were mapped to the human binary protein interactome (HuRI), and abundant biological processes were showcased with the Spatial Analysis of Functional Enrichment (SAFE) package for Python 3^45,47,71. Additionally, all predicted positive sites were tested individually via peptide SPOT array experiments for SET8 in vitro methylation, as previously described. The resulting seven sirtuin models were applied and predicted various novel sites of deacetylation (Supplementary Table 18). The commonality of predictions between each SIRT was illustrated through Circos and UpSet plots as generated by the pyCircos and UpSetPlot packages for Python 3^71,87,88.

Cell line transfection

Cells were cultured in Dulbecco’s modified Eagle’s medium (DMEM, Gibco Cat# 11965092) supplemented with 10% heat-inactivated FBS and 1% penicillin/streptomycin at 37°C, 5% CO₂. Cell lines were regularly tested for Mycoplasma pneumoniae using polymerase chain reaction (PCR) assay. For SET8 overexpression experiments, cells were transfected with 10 µg pcDNA3.1 HA-SET8 using jetOptimus transfection reagent (Polyplus, Cat# 101000025), following manufacturer instruction. The cells were harvested 48h after transfection and cell pellets were snap frozen and stored at -20°C until further use. Cell pellets were lysed in extraction buffer containing 50 mM tris (pH 8), 150 mM NaCl, 1 mM EDTA, 10% glycerol, 0.5% NP-40, and protease inhibitors (1 mM PMSF, 10 µM E-64, 1 µM Pepstatin, 1 µM Leupeptin, and 1 mM sodium orthovanadate). Protein concentration was determined by Bradford assay and samples stored at -80°C until use. Conditions of SIRT2 overexpression in HCT116 cells are described previously⁵⁶.

Mass Spectrometry

Protein concentration was determined by Pierce BCA Protein Assay (Thermo Fisher Scientific) and 500 µg of soluble lysate was digested with Arg-C (Roche, 11370529001). Cell lysates were first diluted in Arg-C digestion buffer (100 mM Tris-HCl, 10 mM CaCl₂, pH 7.6) followed by standard reduction and alkylation at room temperature. Briefly, proteins were first reduced with 3 mM TCEP for 45 minutes, alkylated with 15 mM iodoacetamide (IAA) for 60 minutes, followed by quenching of any unused IAA by incubation with 20 mM DTT for 45 minutes. Next, activation solution and Arg-C were added to a final concentration of 1X and a 1:200 protein-to-protease ratio, respectively. The digestion occurred overnight at 37 ^oC with end-to-end rotation. Resultant peptides were desalted C18 Spin Columns (Thermo Fisher Scientific, 89870) and C18-ZipTip (Millipore Sigma, ZTC18S096), dried by SpeedVac, and resuspended in 20 µL of MS-grade water + 0.1% formic acid.

To assess mono-methylation levels of newly identified in vitro SETD8 substrates, parallel reaction monitoring (PRM)-MS was performed using a Q-Exactive Plus hybrid quadrupole-orbitrap mass spectrometer, at the John L. Holmes Mass Spectrometry Facility at the University of Ottawa, as previously described⁸⁹. PRM-MS scanning was guided by an isolation list built in Skyline Software⁹⁰. The isolation list was generated by first identifying the primary interactors of SET8, isolated from the STRING human interactome dataset using Python 3 and the Pandas package^50,71,72. The interactors of the primary interactors of SET8, or secondary interactors were also included in this list. Proteins which contained the predicted and validated SET8 in vitro methylation sites were selected from within the primary and secondary interactors of SET8, and the resulting network was mapped with the NetworkX package for Python⁹¹. The verified sites contained within this subnetwork were then used to generate an isolation list for targeted mass spectrometry analysis. To quantify relative mono-methylation levels of detectable target peptides, Skyline software was used to design an isolation list (Supplementary Table 15) and was used to derive total peak areas for modified target substrates as well as other unmodified peptides along the protein sequence. The total peak area of each modification site of interest was divided by that of all other reliably detected unmodified peptides within a given parental protein. The MS-verified sites for SIRT2 were obtained from the dataset provided in Zhang et. al., 2022⁵⁶.

Application to the Cancer Proteome

Targeted screen mutant repositories were obtained for breast cancer from v96 of the COSMIC database (cancer.sanger.ac.uk)⁹². In Python 3, the data were cleaned to isolate missense mutations, which were then applied to the full human proteome obtained from UniProt^71,85. Mutations which occur within ±7 AAs from either (1) a lysine, or (2) a position mutated to a lysine, were accepted within the oncoproteome set. An accompanying dataset for the healthy proteome was generated. The ML-hybrid ensemble model for SET8 methylation prediction was applied to both the healthy and cancerous dataset. The resulting scores were compared, yielding several potential states for mutations; (1) mutations predicted to have a null effect on SET8 methylation (i.e., no change from healthy population), (2) mutations predicted to remove SET8 methylation activity (i.e., loss of SET8 substrate from healthy population), and (3) mutations predicted to induce SET8 methylation (i.e., gain of SET8 from health population). The human interactome from the STRING database was applied to the mutated proteins and the probability scores from the SET8 ML-hybrid ensemble model were scaled to resemble STRING scores and included within the interaction list as well. The network was mapped with Cytoscape (v3.9.1), and molecular subcomplexes were isolated using the Molecular Complex Detection cluster generator from the clusterMaker (v2.0) toolset^93–95. The subcomplex which contained SET8 was further investigated for proteins implicated in pathways associated with the cancer type, and resulting pathways were represented in Adobe Illustrator.

DATA AVAILABILITY

The data conveying all results reported may be readily obtained within the github repository https://github.com/nashirag/ML-Hybrid_Ensemble_Method. Source data for all figures is accessible within the same repository, including all SPOT peptide array densitometry figures are available. All mass spectrometry data is available through the PeptideAtlas repository (project ID: PASS05848).

CODE AVAILABILITY

The Python code implemented to produce all results may be accessed at https://github.com/nashirag/ML-Hybrid_Ensemble_Method_SET8.

Brandi, J., Noberini, R., Bonaldi, T. & Cecconi, D. Advances in enrichment methods for mass spectrometry-based proteomics analysis of post-translational modifications. J. Chromatogr. A1678, 463352 (2022).
Deribe, Y. L., Pawson, T. & Dikic, I. Post-translational modifications in signal integration. Nat. Struct. Mol. Biol.17, 666–672 (2010).
Liu, J., Qian, C. & Cao, X. Post-Translational Modification Control of Innate Immunity. Immunity45, 15–30 (2016).
Qian, M. et al. Targeting post-translational modification of transcription factors as cancer therapy. Drug Discov. Today25, 1502–1512 (2020).
Rauh, D. et al. An acetylome peptide microarray reveals specificities and deacetylation substrates for all human sirtuin isoforms. Nat. Commun.4, 2327 (2013).
Merbl, Y. & Kirschner, M. W. Large-scale detection of ubiquitination substrates using cell extracts and protein microarrays. Proc. Natl. Acad. Sci.106, 2543–2548 (2009).
Moore, K. E. & Gozani, O. An unexpected journey: Lysine methylation across the proteome. Biochim. Biophys. Acta BBA - Gene Regul. Mech.1839, 1395–1403 (2014).
Polo, S. et al. A single motif responsible for ubiquitin recognition and monoubiquitination in endocytic proteins. Nature416, 451–455 (2002).
Mitchell, C. J. et al. Unbiased identification of substrates of protein tyrosine phosphatase ptp‐3 in C. elegans. Mol. Oncol.10, 910–920 (2016).
Yu-Ying, Y., Markus, G. & Howard, H. C. Identification of lysine acetyltransferase p300 substrates using 4-pentynoyl-coenzyme A and bioorthogonal proteomics. Bioorg. Med. Chem. Lett.21, 4976–4979 (2011).
Biggar, K. K. et al. Proteome-wide Prediction of Lysine Methylation Leads to Identification of H2BK43 Methylation and Outlines the Potential Methyllysine Proteome. Cell Rep.32, 107896 (2020).
Jamal, S., Ali, W., Nagpal, P., Grover, A. & Grover, S. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J. Transl. Med.19, 218 (2021).
Kiemer, L., Bendtsen, J. D. & Blom, N. NetAcet: prediction of N-terminal acetylation sites. Bioinformatics21, 1269–1270 (2005).
Neely, B. A. et al. Toward an Integrated Machine Learning Model of a Proteomics Experiment. J. Proteome Res.22, 681–696 (2023).
Deng, W. et al. GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences. Sci. Rep.6, 39787 (2016).
Wu, Z., Lu, M. & Li, T. Prediction of substrate sites for protein phosphatases 1B, SHP-1, and SHP-2 based on sequence features. Amino Acids46, 1919–1928 (2014).
Wang, X. et al. UbiBrowser 2.0: a comprehensive resource for proteome-wide known and predicted ubiquitin ligase/deubiquitinase–substrate interactions in eukaryotic species. Nucleic Acids Res.50, D719–D728 (2022).
Ferrari, E. et al. Identification of New Substrates of the Protein-tyrosine Phosphatase PTP1B by Bayesian Integration of Proteome Evidence. J. Biol. Chem.286, 4173–4185 (2011).
Smith, K., Rhoads, N. & Chandrasekaran, S. Protocol for CAROM: A machine learning tool to predict post-translational regulation from metabolic signatures. STAR Protoc.3, 101799 (2022).
Lanouette, S. et al. Discovery of Substrates for a SET Domain Lysine Methyltransferase Predicted by Multistate Computational Protein Design. Structure23, 206–215 (2015).
Kudithipudi, S., Dhayalan, A., Kebede, A. F. & Jeltsch, A. The SET8 H4K20 protein lysine methyltransferase has a long recognition sequence covering seven amino acid residues. Biochimie94, 2212–2218 (2012).
Fang, J. et al. Purification and Functional Characterization of SET8, a Nucleosomal Histone H4-Lysine 20-Specific Methyltransferase. Curr. Biol.12, 1086–1099 (2002).
Milite, C. et al. The emerging role of lysine methyltransferase SETD8 in human diseases. Clin. Epigenetics8, 102 (2016).
Biggar, K. K., Wang, Z. & Li, S. S.-C. SnapShot: Lysine Methylation beyond Histones. Mol. Cell68, 1016-1016.e1 (2017).
Zhang, H. et al. SET8 prevents excessive DNA methylation by methylation-mediated degradation of UHRF1 and DNMT1. Nucleic Acids Res.47, 9053–9068 (2019).
Chin, H. G. et al. The microtubule-associated histone methyltransferase SET8, facilitated by transcription factor LSF, methylates α-tubulin. J. Biol. Chem.295, 4748–4759 (2020).
Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res.43, D512–D520 (2015).
Yin, Y. et al. SET8 recognizes the sequence RHRK20VLRDN within the N terminus of histone H4 and mono-methylates lysine 20. J. Biol. Chem.280, 30025–30031 (2005).
Topcu, E., Ridgeway, N. H. & Biggar, K. K. PeSA 2.0: A software tool for peptide specificity analysis implementing positive and negative motifs and motif-based peptide scoring. Comput. Biol. Chem.101, 107753 (2022).
Burkov, A. The hundred-page machine learning book. (Andriy Burkov, 2019).
Brownlee, J. Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning. (Machine Learning Mastery, 2021).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods16, 687–694 (2019).
Erjavac, I., Kalafatovic, D. & Mauša, G. Coupled encoding methods for antimicrobial peptide prediction: How sensitive is a highly accurate model? Artif. Intell. Life Sci.2, 100034 (2022).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci.42, 1273–1280 (2002).
Ruiz-Blanco, Y. B., Paz, W., Green, J. & Marrero-Ponce, Y. ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins. BMC Bioinformatics16, 162 (2015).
Romero‐Molina, S., Ruiz‐Blanco, Y. B., Green, J. R. & Sanchez‐Garcia, E. ProtDCal‐Suite: A web server for the numerical codification and functional analysis of proteins. Protein Sci. pro.3673 (2019) doi:10.1002/pro.3673.
Szeghalmy, S. & Fazekas, A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors23, 2333 (2023).
Izenman, A. J. Linear Discriminant Analysis. in Modern Multivariate Statistical Techniques 237–280 (Springer New York, 2013). doi:10.1007/978-0-387-78189-1_8.
Kamalov, F., Leung, H.-H. & Cherukuri, A. K. Keep it simple: random oversampling for imbalanced data. in 2023 Advances in Science and Engineering Technology International Conferences (ASET) 1–4 (IEEE, 2023). doi:10.1109/ASET56582.2023.10180891.
Cereto-Massagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods71, 58–63 (2015).
Wright, R. E. Logistic regression. Read. Underst. Multivar. Stat. 217–244 (1995).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res.16, 321–357 (2002).
Nguyen, H. M., Cooper, E. W. & Kamei, K. Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig.3, 4–21 (2009).
Baryshnikova, A. Spatial Analysis of Functional Enrichment (SAFE) in Large Biological Networks. in Computational Cell Biology (eds. von Stechow, L. & Santos Delgado, A.) vol. 1819 249–268 (Springer New York, 2018).
The Gene Ontology Consortium et al. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res.49, D325–D334 (2021).
Luck, K. et al. A reference map of the human binary protein interactome. Nature580, 402–408 (2020).
Couture, J.-F., Collazo, E., Brunzelle, J. S. & Trievel, R. C. Structural and functional analysis of SET8, a histone H4 Lys-20 methyltransferase. Genes Dev.19, 1455–1465 (2005).
Kaczmarek Michaels, K., Mohd Mostafa, S., Ruiz Capella, J. & Moore, C. L. Regulation of alternative polyadenylation in the yeast Saccharomyces cerevisiae by histone H3K4 and H3K36 methyltransferases. Nucleic Acids Res.48, 5407–5425 (2020).
Szklarczyk, D. et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res.43, D447–D452 (2015).
Liu, B. et al. A functional single nucleotide polymorphism of SET8 is prognostic for breast cancer. Oncotarget7, 34277–34287 (2016).
Yang, C., Wang, K., Zhou, Y. & Zhang, S.-L. Histone lysine methyltransferase SET8 is a novel therapeutic target for cancer treatment. Drug Discov. Today26, 2423–2430 (2021).
Bogliolo, M. et al. Mutations in ERCC4, Encoding the DNA-Repair Endonuclease XPF, Cause Fanconi Anemia. Am. J. Hum. Genet.92, 800–806 (2013).
Faridounnia, M., Folkers, G. & Boelens, R. Function and Interactions of ERCC1-XPF in DNA Damage Response. Molecules23, 3205 (2018).
Xu, L. et al. Roles for the methyltransferase SETD8 in DNA damage repair. Clin. Epigenetics14, 34 (2022).
Levy, D. et al. A proteomic approach for the identification of novel lysine methyltransferase substrates. Epigenetics Chromatin4, 19 (2011).
Meng, L. et al. Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput. Struct. Biotechnol. J.20, 3522–3532 (2022).
Schwartz, D. Prediction of lysine post-translational modifications using bioinformatic tools. Essays Biochem.52, 165–177 (2012).
Shilatifard, A. The COMPASS Family of Histone H3K4 Methylases: Mechanisms of Regulation in Development and Disease Pathogenesis. Annu. Rev. Biochem.81, 65–95 (2012).
Weber, L. M. et al. The histone acetyltransferase KAT6A is recruited to unmethylated CpG islands via a DNA binding winged helix domain. Nucleic Acids Res.51, 574–594 (2023).
Shinsky, S. A., Monteith, K. E., Viggiano, S. & Cosgrove, M. S. Biochemical Reconstitution and Phylogenetic Comparison of Human SET1 Family Core Complexes Involved in Histone Methylation. J. Biol. Chem.290, 6361–6375 (2015).
Rienzo, M. et al. PRDM12 in Health and Diseases. Int. J. Mol. Sci.22, 12030 (2021).
Hashimoto, K., Wada, K., Matsumoto, K. & Moriya, M. Physical interaction between SLX4 (FANCP) and XPF (FANCQ) proteins and biological consequences of interaction-defective missense mutations. DNA Repair35, 48–54 (2015).
Bakker, J. L. et al. Analysis of the Novel Fanconi Anemia Gene SLX4 / FANCP in Familial Breast Cancer Cases. Hum. Mutat.34, 70–73 (2013).
Chopra, A. et al. A peptide array pipeline for the development of Spike-ACE2 interaction inhibitors. Peptides158, 170898 (2022).
Hilpert, K., Winkler, D. F. & Hancock, R. E. Cellulose-bound Peptide Arrays: Preparation and Applications. Biotechnol. Genet. Eng. Rev.24, 31–106 (2007).
Bradford, M. M. A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal. Biochem.72, 248–254 (1976).
Hornbeck, P. V. et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res.40, D261–D270 (2012).
Rossum, G. V. & Drake, F. L. Python 3 Reference Manual. (CreateSpace, 2009).
McKinney, W. Data Structures for Statistical Computing in Python. in 56–61 (2010). doi:10.25080/Majora-92bf1922-00a.
Rowe, E. M. & Biggar, K. K. An optimized method using peptide arrays for the identification of in vitro substrates of lysine methyltransferase enzymes. MethodsX5, 118–124 (2018).
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. ArXiv12010490 Cs (2018).
Lemaitre, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. (2016) doi:10.48550/ARXIV.1609.06570.
Harris, C. R. et al. Array programming with NumPy. Nature585, 357–362 (2020).
Hunter, J. D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng.9, 90–95 (2007).
Wang, H., Yan, L., Huang, H. & Ding, C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM Trans. Comput. Biol. Bioinform.14, 503–513 (2017).
Álvarez, Ó., Fernández-Martínez, J. L., Corbeanu, A. C., Fernández-Muñiz, Z. & Kloczkowski, A. Predicting protein tertiary structure and its uncertainty analysis via particle swarm sampling. J. Mol. Model.25, 79 (2019).
Xu, Y., Ding, Y.-X., Deng, N.-Y. & Liu, L.-M. Prediction of sumoylation sites in proteins using linear discriminant analysis. Gene576, 99–104 (2016).
Bergstra, J. & Bengio, Y. Random Search for Hyper-Parameter Optimization. J Mach Learn Res13, 281–305 (2012).
Dietterich, T. G. Ensemble Methods in Machine Learning. in Multiple Classifier Systems vol. 1857 1–15 (Springer Berlin Heidelberg, 2000).
The UniProt Consortium et al. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res.51, D523–D531 (2023).
Klausen, M. S. et al. NetSurfP‐2.0: Improved prediction of protein structural features by integrated deep learning. Proteins Struct. Funct. Bioinforma.87, 520–527 (2019).
Rosario, F. J. et al. Placental Remote Control of Fetal Metabolism: Trophoblast mTOR Signaling Regulates Liver IGFBP-1 Phosphorylation and IGF-1 Bioavailability. Int. J. Mol. Sci.24, 7273 (2023).
MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics26, 966–968 (2010).
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring Network Structure, Dynamics, and Function using NetworkX. in Proceedings of the 7th Python in Science Conference (eds. Varoquaux, G., Vaught, T. & Millman, J.) 11–15 (2008).
Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res.47, D941–D947 (2019).
Shannon, P. et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res.13, 2498–2504 (2003).
Morris, J. H. et al. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics12, 436 (2011).
Bader, G. D. & Hogue, C. W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics4, 2 (2003).

There is NO Competing Interest.

SupplementaryTables.xlsx
Supplementary Tables
SupplementaryFigures.pdf

Download PDF

Version 1

posted

You are reading this latest preprint version

Machine learning-based exploration of enzyme-substrate networks: SET8-mediated methyllysine and its changing impact within cancer proteomes

Status:

Version 1

Abstract

Figures

INTRODUCTION

RESULTS

DISCUSSION

METHODS

Peptide synthesis

Protein expression and purification

Training dataset generation

Biochemical Feature Generation

Machine Learning – Model Fitting

Machine Learning – Cross-validation

Machine Learning – Hyperparameter Tuning

Machine Learning – Optimizing Decision Threshold

Machine Learning – Ensemble Learning

Experimental Dataset Generation and Evaluation

Cell line transfection

Mass Spectrometry

Application to the Cancer Proteome

Declarations

DATA AVAILABILITY

CODE AVAILABILITY

References

Additional Declarations

Supplementary Files

Status:

Version 1