3.1. Performance of models trained on non-organism-specific phosphorylation sites to predict phosphorylation sites in C. reinhardtii
Performance assessment of models trained on non-organism-specific sites (MusiteDeep dataset9) was done with an independent test on phosphorylation sites in C. reinhardtii. For this, we used cross-learning where models are trained on the non-organism-specific phosphorylation dataset and tested on the C. reinhardtii dataset (combined phosphorylation site dataset of S and T). We also included prediction using models provided by DeepPhos11 (DeepPhos (provided)) and also the one where we trained it on MusiteDeep dataset (DeepPhos (Trained on MusiteDeep Dataset)). The results are shown in Table 4. The highest MCC of the compared methods is 0.52. For comparison purpose, an organism-specific predictor (Rice_Phospho 1.020), has been shown to have significantly better MCC (0.62) while predicting phosphorylation sites in rice. It can be observed from this table that the performance of the models trained on non-organism-specific phosphorylation sites do not perform as well compared to models trained on organism-specific phosphorylation sites. This serves as our premise to develop a C. reinhardtii specific phosphorylation site predictor.
Table 4
Performance metrics of different models in cross-learning using an independent test dataset for S and T. LSTM and CNN are our models trained on MusiteDeep dataset (non-organism-specific phosphorylation sites).
Models | Sensitivity | Specificity | Accuracy | AUC | MCC |
DeepPhos (Provided) | 0.63 | 0.86 | 0.75 | 0.82 | 0.51 |
DeepPhos (Trained on MusiteDeep Dataset) | 0.77 | 0.65 | 0.71 | 0.79 | 0.52 |
LSTM with embedding (Trained on MusiteDeep Dataset) | 0.75 | 0.74 | 0.74 | 0.83 | 0.49 |
CNN with embedding (Trained on MusiteDeep Dataset) | 0.79 | 0.72 | 0.75 | 0.83 | 0.52 |
3.2. Model evaluation using manually extracted features
Next, we also investigated the performance of a ML model (Random Forest) and DL model on manually extracted features from the C. reinhardtii dataset. We generated physicochemical-based features like Pseudo Amino Acid Composition (PAAC), K-Spaced Amino Acid Pairs (AAP) and Composition, Transition and Distribution (CTD) as well as autocorrelation features like Moreau-Broto Autocorrelation (MBA) and Entropy Features, such as Shannon Entropy (SE), Relative Entropy (RE), and Information Gain (IG). Using Random Forest (RF), we selected 178 optimized features. Using these features both RF and DL models were evaluated. The results are shown in Table 5. Our performance suffered using the manually extracted features, even when compared to non-organism-specific models.
Table 5
Performance metrics of different models applying manually extracted features using an independent test dataset for S and T.
Models | Sensitivity | Specificity | Accuracy | AUC | MCC |
Random Forest (RF) | 0.84 | 0.61 | 0.72 | 0.80 | 0.46 |
CNN | 0.88 | 0.58 | 0.73 | 0.81 | 0.48 |
3.3. Model development and 10-fold cross-validation
We performed 10-fold cross-validation using a base CNN with embedding model on different window sizes ranging from 9 to 61 on S, T and ST (Table S1-S3). Further window sizes were not analyzed due to the sheer size of the windows and the corresponding increase in the number of pseudo-residues ‘-‘ that was required at higher window sizes.
From Fig. 3, general trend for MCC of S, T and ST shows improvement with increasing window sizes up to around 45. Thereafter, it reaches a plateau with not much significant improvements in performance. Optimal window sizes of 57, 53 and 53 were chosen for S, T and ST respectively, for further study.
Figure 3. 10-fold cross-validation mean MCC of S, T, ST and Y for different window sizes.
For Y, the results of 10-fold cross-validation are shown in Table S4. The relatively high standard deviations observed for this dataset suggest that there is more variability in performance, which is not surprising given the smaller size of the Y dataset compared to the other datasets. From Fig. 3, MCC for prediction of Y phosphosites does not follow specific pattern. For these reasons, the Chlamy-EnPhosSite and Chlamy-MwPhosSite models were not applied to the prediction of Y phosphosites.
3.4. Assessment of Chlamy-EnPhosSite using Independent testing
Next, an independent test was carried out with different models for S, T, and ST using 20% of each dataset that had been set aside for independent testing. For these studies, the window sizes that exhibited the best performance when evaluated by 10-fold cross-validation were used respectively, as described above. For independent testing, LSTM, CNN, Chlamy-MwPhosSite, and Chlamy-EnPhosSite models were trained on the 80% of the dataset set aside for training.
The results of the independent test with the S dataset are shown in Table 6 and the ROC curve is shown in Fig. 4. Both Chlamy-MwPhosSite and Chlamy-EnPhosSite exhibited improved performance compared to base models of LSTM and CNN. For instance, the highest AUC and MCC 0.89 and 0.62 respectively, were observed for Chlamy-EnPhosSite, although these values were only marginally better than those observed for Chlamy-MwPhosSite.
Table 6
Performance metrics of different models using an independent test dataset for S.
Models | SN | SP | ACC | AUC | MCC |
LSTM with embedding | 0.87 | 0.72 | 0.79 | 0.87 | 0.59 |
CNN with embedding | 0.89 | 0.69 | 0.79 | 0.87 | 0.60 |
Chlamy-MwPhosSite | 0.89 | 0.71 | 0.80 | 0.88 | 0.61 |
Chlamy-EnPhosSite | 0.89 | 0.72 | 0.80 | 0.89 | 0.62 |
Figure 4. ROC curve for different DL models for S.
For the T dataset, the results of the independent test are shown in Table 7, and the ROC curve is shown in Fig. 5. Both Chlamy-EnPhosSite and Chlamy-MwPhosSite have improved performance on base models, LSTM and CNN. The best values for AUC, MCC, and SN (0.86, 0.56, and 0.92 respectively) were attained by Chlamy-EnPhosSite, whereas the best values for SP and ACC (0.79 and 0.78 respectively) were attained by Chlamy-MwPhosSite.
Table 7
Performance metrics of different models using an independent test dataset for T.
Models | SN | SP | ACC | AUC | MCC |
LSTM with embedding | 0.83 | 0.69 | 0.76 | 0.84 | 0.53 |
CNN with embedding | 0.86 | 0.66 | 0.76 | 0.84 | 0.53 |
Chlamy-MwPhosSite | 0.76 | 0.79 | 0.78 | 0.84 | 0.55 |
Chlamy-EnPhosSite | 0.92 | 0.61 | 0.77 | 0.86 | 0.56 |
Figure 5. ROC curve for different DL models for T.
For the ST dataset, both Chlamy-EnPhosSite and Chlamy-MwPhosSite exhibited improved performance compared to the LSTM- and CNN-based models (Table 8 and Fig. 6). Both Chlamy-MwPhosSite and Chlamy-EnPhosSite attained AUC, MCC and ACC of 0.90, 0.64 and 0.82 respectively. CNN with embedding got better SN (0.91), whereas Chlamy-MwPhosSite achieved better SP (0.78).
Table 8
Performance metrics of different models using an independent test dataset for S and T.
Models | SN | SP | ACC | AUC | MCC |
LSTM with embedding | 0.87 | 0.74 | 0.81 | 0.88 | 0.61 |
CNN with embedding | 0.91 | 0.69 | 0.81 | 0.88 | 0.61 |
Chlamy-MwPhosSite | 0.86 | 0.78 | 0.82 | 0.90 | 0.64 |
Chlamy-EnPhosSite | 0.90 | 0.73 | 0.82 | 0.90 | 0.64 |
Figure 6. ROC curve for different DL models for S and T combined.
3.5. Predicting phosphorylation sites in entire C. reinhardtii proteome using Chlamy-EnPhosSite
Our independent test results suggest that Chlamy-EnPhosSite (ensemble-based approach) is the best predictor (although marginally), thus we will use Chlamy-EnPhosSite for subsequent analysis. To explore the utility of Chlamy-EnPhosSite for predicting novel phosphosites, we applied Chlamy-EnPhosSite to predict S and T phosphorylation sites in the full C. reinhardtii proteome. Chlamy-EnPhosSite was applied to predict phosphorylation sites on a total of 1,809,304 S/T sites and it was able to perform these predictions in about an hour in GeForce RTX 2080 machine. With 0.5 as a probability cut-off, Chlamy-EnPhosSite predicted 499,411 phosphorylated sites and with cut-off value of 0.7, Chlamy-EnPhosSite predicted 237,949 phosphorylated sites.
In addition, we also validated the predictions made by Chlamy-EnPhosSite on entire C. reinhardtii proteome using a newly generated dataset of phosphosites from C. reinhardtii32. Like the independent test sets, our model was blind to this new dataset during training, therefore these studies serve as a second, completely independent test set of S/T residues. Within this new dataset, 2,663 novel C. reinhardtii S and T phosphorylation sites were included since they were not present in the previous dataset. Using 0.5 as a cut-off, Chlamy-EnPhosSite was able to predict 2,362 out of 2,663 (89.69%) phosphorylated sites correctly. Using a more stringent cut-off of 0.7, Chlamy-EnPhosSite still correctly predicted 2,046 out of 2,663 (76.83%) phosphorylated sites. By further increasing the cut-off, the probability of avoiding false positives increases, but there is a trade-off with a decrease in the number of true positives. Together, these data suggest that our DL-based model, Chlamy-EnPhosSite, could be used to predict novel phosphosites in C. reinhardtii.
These phosphorylation site predictions can elucidate protein modulation in important signaling cascades such as the target of rapamycin (TOR) signaling pathway. The TOR kinase is a conserved master regulator of cell growth whose activity is modulated in response to nutrients, energy, and stress40–42. This includes regulation of protein synthesis and degradation through the control of translation, ribosome biosynthesis, and autophagy43. In Arabidopsis thaliana, TOR directly phosphorylates ribosomal protein S6 kinase (S6K), which in turn phosphorylates ribosomal protein S6 (RPS6)44. A method to monitor TOR activity in C. reinhardtii through S6K phosphorylation has been difficult to obtain because S6K phosphopeptide identification has eluded MS detection and commercial anti-phosphoS6K antibodies have failed32,45. Instead, antibodies against the downstream C. reinhardtii RPS6 phosphosite S245, a conserved site that is phosphorylated by S6K in a TOR-dependent manner in yeast and humans46,47, has been used as a proxy to monitor TOR activity. Validation confirmed that this site is phosphorylated in a TOR-dependent manner and can be used to monitor TOR function in C. reinhardtii. Interestingly, typical quantitative LC-MS/MS-based phosphoproteomics methods using TiO2 enrichment failed to detect RPS6-S245 phosphorylation, and this site was only detected by orthogonal enrichment strategies and extensive fractionation. However, the model described herein, Chlamy-EnPhosSite was able to predict phosphorylation on RPS6 S245 with a probability of 0.65, displaying prediction accuracy and the ease of phosphorylation site identification compared to MS-based methods. This may be extended to other kinase/signaling pathway intermediates whereby sites predicted could then lead to viable routes for validation/activity readouts in subsequent biologically focused experiments.