SET8 expression and conventional substrate prediction with permutation array analysis
A well-characterized and highly active SET8193-352 construct28 was applied to the peptide array experiments. The purity (Figure 2A) and methyltransferase activity of the construct was assessed for the H4K20 peptide (GGAKRHRKVLRDNIQ) (Figure 2B). There are several generalized methods of identifying novel substrates for PTM-inducing enzymes. To accurately assess the ability of the proposed SET8 ML-hybrid ensemble methodology to determine novel substrates, a comparable array-based permutation motif was generated to identify potential substrates7,21. The permutation array was created using the histone H4K20 sequence (±4 AAs, each sequence has 15 AAs in total) and was exposed to SET8 activity to identify peptide variants susceptible to methylation (Figure 2C, Supplementary Table 1). Densitometry results processed by PeSA2.0 yielded the following motif: [KPGCHIVD]XH[RVIKYSAHML]K[IVT]L[RDLGI]X (Supplementary Figure 1)29. A search of the known methyllysine proteome (Supplementary Table 2) was performed with the scoring matrix, and a normalized score cutoff of 0.5, relative to unmodified H4K20 peptide (assigned a score of 1), yielded 346 hits (Supplementary Table 3). Of these candidate substrate hits, just 26 peptides were validated as being methylated by SET8 in vitro with peptide array, indicating a method precision rate of 7.5% in this enriched methyllysine proteome dataset (Figure 2D, Supplementary Table 3).
To accurately compare both the SET8 ML-hybrid ensemble model and permutation methods of substrate prediction, we next identified novel SET8 substrates from the dataset of surface-exposed lysine. This approach reveals the applicability of these approaches to the exploration of enzyme substrates beyond those that are currently known to be modified (e.g., missense cancer mutations). Using the scoring matrix from the permutation array (Figure 2C), a search of the surface-exposed lysine proteome was performed and yielded 15,961 sites contained within 2,424 proteins (Supplementary Table 4). A randomly selected subset of these positively predicted sites (n=100) identified two positive hits, indicating a precision rate of 2% (precision represents the quantity of validated positive predictions among all positive predictions) (Figure 2E).
Training set generation: SET8 substrates within the known methyllysine proteome
To apply an effective ML model, the initial dataset must provide sufficient samples of the positive case and the negative case, ideally in equal amounts30,31. A randomly sampled subset (n=100) of the approximately 600,000 lysine-centric sites within the proteome tested with peptide arrays identified no sites of SET8 methylation. This observation is indicative of SET8’s highly specific recognition of substrates for methylation, and emphasizes the need for an improved approach for generating training data21. To efficiently enhance the likelihood of detecting positive sites, a targeted subset of the known methyllysine proteome was obtained from PhosphoSitePlus27. This dataset contains modified lysines, including mono-, di-, and tri-methylation. Enzymes such as KMTs often contain conserved catalytic domains and act upon methylatable histone and non-histone substrates, meaning the methyllysine proteome should contain an enriched number of substrates for SET8; in contrast, the methylated status of the full proteome is unknown24.
Upon analysis of the SET8-exposed peptide arrays that comprise the methyllysine proteome obtained from PhosphoSitePlus27, the targeted subset successfully contained peptide substrates of SET8. Specifically, of the 4,593 peptides tested, 213 were deemed to be positive for SET8 methylation (Supplementary Table 3). The 213 positive sites were identified across 179 proteins, indicating that several proteins harbored multiple SET8 methylation sites. To date, the commonly accepted substrates of SET8 are P53-K382, PCNA-K248, Numb-K158, and Numb-K16324. The 213 sites identified in vitro with the targeted search alone expands upon these four substrates.
SET8 base model fitting and fine-tuning
The lysine methylome for SET8 was numerically encoded with the application of MACCS keys, one-hot sequential encoding, and ProtDCal molecular descriptions (Supplementary Table 5)32,33,34,35,36. The resulting set contains 483 features. With the stratified K-fold cross-validation method, the SET8 methyllysine proteome dataset was split into training and testing sets to effectively assess model fitting and prevent overfitting37. The F-score was selected as the best way to measure the predictive performance of the model on the imbalanced dataset30,31. A linear discriminant analysis, along with random oversampling of the positive class (i.e., sites positive for SET8 methylation), attained the highest F-score, which was 0.1338,39. An m-threshold analysis was performed (Figure 2F), as well as both precision-recall and receiver operating characteristic (ROC) curves were generated (Figure 2G). Metrics for the default threshold of 0.5 resulted in an F-score of 0.13, a precision of 0.085, recall of 0.24, and specificity of 0.83. The metrics further demonstrate the benefit of the F-score, defined by the harmonic mean of precision and recall, and describes our positive identification rate, rather than using specificity, which is falsely inflated by the negative identification rate.
Feature importance was analyzed for the selected model, hereafter referred to as the base model. Features deemed crucial by the model for identifying positive sites of SET8 methylation were specific one-hot-encoded AA/position combinations, including tryptophan, cysteine, and tyrosine at positions +3, +4, and –6 from the central lysine, respectively. Additionally, the MACCS key corresponding to the aromatic bond between carbon and nitrogen (key 65), found in histidine, proline, and tryptophan, was deemed to be of high importance40. Sulphur (key 88), found in cysteine and methionine was also determined to be highly important to the model’s classification of positives40. Regarding the classification of negatives, or sites not methylated by SET8, once again, one-hot encoded positions played a crucial role. Specifically, regarding the central lysine (position 0), cysteine at –6, phenylalanine, and methionine at +6, and methionine at –2 scored highly for feature importance in negative classification. One MACCS key was included as well, key 132, which represents AO-CH2-A (where A represents any elemental symbol) and likely corresponds to the presence of aspartic acid, glutamic acid, serine, or threonine within the site40.
SET8 ML-hybrid ensemble model construction
The ability of a lysine residue to undergo methylation is a prerequisite for any newly predicted SET8 substrate. To enhance the performance of the SET8 substrate prediction, or base model, a composite or ensemble model was constructed using MethylSight, the current state-of-the-art generalized predictor of lysine methylation11. In the inaugural study, MethylSight identified 51 novel sites of histone methylation, and 89% of the sites were confirmed to physically exhibit methylated lysine11. Much like the SET8 substrate prediction model previously described, MethylSight uses ProtDCal to characterize the 15-AA-long site surrounding a central lysine11. Hence, it is well suited for integration with the SET8 substrate prediction model using stacked ensemble learning.
As with the initial model fitting, stratified K-fold cross-validation was applied to assess the performance of each model. Two features were applied: the SET8 substrate prediction score (described above); and the MethylSight score (i.e., the likelihood of methylation). The F-score was optimized with the application of a logistic regression model and SVM SMOTE oversampling41,42,43. The simplicity of logistic regression was reflected in the singular hyperparameter of 100 max iterations determined from the tuning process41. A much-improved F-score of 0.12 was determined for the ensemble model, along with improved values of 0.25 for precision, 0.08 for recall, and 0.98 for specificity. A comparison of performance metrics with classification threshold is illustrated in Figure 2H. To optimize the performance of the ensemble model, a threshold cutoff of 0.82 was applied. Given the performance increase gained from the integration of methyllysine prediction into the ensemble model (hereafter referred to as the SET8 ML-hybrid ensemble model), the investigation proceeded with this hybrid model (Supplementary Tables 6 and 7).
Proteome-wide prediction of SET8 substrates
Using our SET8 ML-hybrid ensemble model, experimental validation of the 2,367 predicted positive sites of SET8 methylation was completed by testing each site for in vitro methylation. Of these predictions, 885 sites permitted in vitro SET8 methyltransferase activity, representing a validated precision of 37.4%. The precision of this method is much improved over the 0% validated precision of the random search method and the 2% validated precision determined with the permutation array within the surface-exposed lysine proteome (Figure 2I). An analysis of the sequence composition of the predicted sites of SET8 methylation by the ML-hybrid ensemble model demonstrates that the known SET8 substrates differ from the predicted sites, with substantial variation observed in the latter (Figure 2J). The SET8 ML-hybrid ensemble model proved to be 100% accurate in identifying a subset (n=362) of predicted negative, lowest-scoring sites, as verified by peptide array experiments with SET8 (Supplementary Table 8). Based on these findings, it is clear our SET8 ML-hybrid ensemble model improves on the traditional substrate identification approach.
A total of 2,367 positive SET8 methylation sites were predicted by the SET8 ML-hybrid ensemble model within the surface-exposed lysine dataset, representing sites within 1,203 proteins. To investigate the enriched biological functions of the 1,203 proteins (i.e., predicted SET8 substrate network), clustering analysis with GO annotations was performed using the spatial analysis of functional enrichment (SAFE) approach (Figure 3)44,45. The HuRI proteome was selected for protein mapping because of its quality, high-confidence interactions among proteins46.
A shared theme among the enriched biological processes is involvement in cell homeostasis, regulation, and control of the cell cycle (Figure 3A). Given the established involvement of SET8 with these cellular events, mediated through known substrates, the possibility that SET8 might participate in such processes through the methylation of other substrates identified by our SET8 ML-hybrid ensemble model is bolstered23,24,28,47. Other affiliated processes include mRNA and RNA polyadenylation. Regulation through the polyadenylation of mRNA has been associated with other SET-domain-containing methyltransferases, specifically SET1 and SET2, through histone methylation48. The involvement of SET8 in transcription modulation may also implicate it in polyadenylation regulation; however, a direct connection has not been reported23. In summary, the substrates generated by the ML-hybrid ensemble model provide the potential to unveil new functional narratives for SET8 and its role(s) in disease. The efficacy of the SET8 ML-hybrid ensemble model is further demonstrated by the progression from the proteome isolated for surface-exposed lysine residues that contain 145,379 sites to the 2,367 predictions (Supplementary Tables 9 and 10), which resulted in 885 in vitro validated sites, as shown in Figure 3B.
Cell-based validation of SET8 substrate candidates
To validate SET8-influenced cellular methyllysine events, we used parallel reaction monitoring MS to assess the in vitro SET8 substrates newly identified by our SET8 ML-hybrid ensemble model in a targeted manner. To restrict the number of methylation sites monitored with this approach, we generated an isolation list that was constrained to the primary and secondary interactors of SET8, as described by the STRING database49 (Supplementary Table 11). The 44 proteins within the network in Figure 4A contained 75 sites of predicted SET8 methylation that were verified by peptide array experiments. Of these 75 sites, it was predicted that 32 sites would create suitable digested peptides in silico; these were targeted for MS monitoring in SET8 overexpressed HCT116 cells (Figure 4B; Supplementary Table 12). Of the 32 monitored sites, only nine were reliability detectable, and elevated levels of mono-methylation were observed in three (33%) of these substrates: SETD1B-K41, KAT6A-K314, and PRDM12-K269 (Figure 4C–4E; Supplementary Figure 3). In conclusion, the ML-hybrid ensemble model is able to identify novel substrates of possible SET8 methylation activity, as confirmed by site-targeted MS monitoring.
SET8 substrate discovery in cancer
Elevated expression of SET8 is linked to a high mortality rate in patients with breast cancer (Figure 5A)50,51. However, the behavior of SET8 in cancerous cells remains unclear, and further investigation is required to uncover the functional role(s) SET8 plays in tumorigenesis. As cancer-associated mutations continue to diversify, mutation datasets serve as a valuable resource with which to elucidate the effect mutations have on protein structure and function. In the case of missense mutations, they may cause the gain or loss of methylatable lysine or make changes to neighboring residues, which then dictate the suitability of these sites for SET8 methylation50.
To explore the possibility of gain or loss of SET8 substrates in breast cancer, missense mutations were downloaded from the COSMIC database (v.96) and applied to the human proteome. Of the initial mutations, 9,438 either occurred within seven AAs of a lysine residue (e.g., any residue) or resulted in the gain or loss of an individual lysine, directly impacting the creation or loss of a potential methylation site (Supplementary Table 13). The corresponding unmutated sites, except sites in which a lysine did not previously exist, were also assembled. Application of the SET8 ML-hybrid ensemble model to normal and breast cancer datasets predicted that most of the mutations (94.6%) would not affect SET8’s methylation behavior toward the site, likely because their structure is not changed dramatically by a single AA mutation. In contrast, 4.0% (376) of mutations resulted in a predicted gain of SET8 methylation, and 0.7% (62) resulted in a loss (Figure 5B). Of the 4.0% of sites predicted to gain SET8 substrate status, 46.8% (176) were the result of the mutation introducing a new lysine that is itself predicted to be methylated by SET8 (Figure 5C). MCODE clustering analysis of the total set of mutations revealed a directed subset of predicted substrate interactions that were highly interconnected with SET8 (Supplementary Figure 4). Mutations within the subset that resulted in a gain of predicted SET8 methylation were investigated for involvement in pathways implicated in breast cancer. Of particular interest was XPF (encoded by ERCC4), a protein associated with the vital cellular process of DNA damage repair52. Specifically, the XPF-S352A mutation led to the prediction of a new SET8 methylation site at XPF-K350. As detailed in Figure 5D, XPF is directly involved in DNA damage repair, including nucleotide excision, double-strand break, and interstrand cross-link repair pathways53. Gap filling is then proceeded by PCNA, a known substrate of SET824, further implicating SET8 in the NER pathway54. Finally, ligation is performed with DNA ligase and the NER pathway for DNA damage is complete53. Interestingly, SET8 has been implicated in DNA repair previously, specifically in 53BPI/BRCA1 double-stranded DNA repair through histone H4K20 mono-methylation54. ERCC4 gene mutations have also been determined to affect XPF function within the NER pathway52. Beyond breast cancer, our SET8 ML-hybrid ensemble model was also applied to missense mutations present in pancreatic cancer (COSMIC database, v.96) (Supplementary Figure 5A and 5B; Supplementary Table 14).