Feature analysis and single feature classification
In the first part of our work, we analyzed the distributions of the main features used for the classification task. We focused on the six conservation features (PC, PP, fref, falt, fwt, fmut) comparing their distributions for the subsets of Pathogenic and Benign variants. The average, median and standard deviation of such distributions are reported in Table S2. As observed in previous works (Kircher et al., 2014; Capriotti and Fariselli, 2017) the distribution of the PhyloP100way score (PP) in mutated loci associated with Pathogenic and Benign variants are significantly different (Fig. 1). Indeed, the two distributions show median values of 7.5 and 1.5, respectively, with a Kolmogorov-Smirnov distance (KSD) of 0.57 (Fig. 1B and Table S2). This distance is greater than the KSD observed for the PhastCons100way score (PC).
A higher difference between the distributions of the conservation scores for the subset of Pathogenic and Benign variants is observed when the frequencies in sequence profile from genomic and proteins are considered. The most remarkable differences are generally detected when comparing the distributions of the frequency of the alternative allele (falt) and the mutant residue (fmut) for which the KSD is ~ 0.60. Analyzing the frequencies of the reference allele (fref) and wild-type residue (fwt) their KSD is 0.58 and 0.55, respectively (Table S2). The distributions of the four types of frequencies (fref, falt,, fwt, and fmut) for the subsets of Pathogenic and Benign variants are plotted in Fig. 2
This observation agrees with the results obtained in the prediction of Pathogenic variants using a classification threshold on a single feature. The classification threshold is optimized on the CommonClinvar dataset maximizing both the true positive and negative rates (Table S3). Applying the optimized thresholds on the prediction of the variants in the NewClinvar dataset, we found that a simple classifier based on the frequency of the mutant residue extracted from a protein sequence profile achieve 81% overall accuracy (Q2), 0.63 Matthews correlation coefficient (MC) and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.86 (Table 2).
Table 2
Performance of basic predictors based on a single feature on the NewClinvar dataset Prediction threshold are optimized on the CommonClinvar dataset. Q2: Overall Accuracy, TNR: True negative rate, NPV: Negative predicted value, TPR: True Positive Rate, PPV: Positive Predicted Value, MC: Matthews Correlation Coefficient, F1: harmonic mean of precision and sensitivity, AUC: Area Under the Receiver Operator Characteristic Curve, AUP Area under the Precision Recall Curve. All the performance measures are defined in Supplementary Materials.
Feature | Threshold | Q2 | TNR | NPV | TPR | PPV | MC | F1 | AUC | AUP |
PC | 1.000 | 0.737 | 0.611 | 0.816 | 0.862 | 0.689 | 0.489 | 0.766 | 0.755 | 0.815 |
PP | 4.704 | 0.769 | 0.796 | 0.756 | 0.743 | 0.784 | 0.539 | 0.763 | 0.841 | 0.828 |
fref | 0.977 | 0.779 | 0.815 | 0.760 | 0.742 | 0.801 | 0.559 | 0.770 | 0.836 | 0.843 |
falt | 0.000 | 0.794 | 0.750 | 0.821 | 0.837 | 0.770 | 0.589 | 0.802 | 0.828 | 0.863 |
fwt | 0.702 | 0.769 | 0.806 | 0.750 | 0.731 | 0.791 | 0.539 | 0.759 | 0.844 | 0.836 |
fmut | 0.005 | 0.815 | 0.819 | 0.812 | 0.810 | 0.817 | 0.629 | 0.814 | 0.857 | 0.856 |
According to the previous observation, the PhastCons100way score (PC) is the least discriminating feature. When using the optimized threshold on the classification of the NewClinvar variants, the method based on a single PC threshold achieves 74% overall accuracy, 0.49 MC and 0.75 AUC (Table 2). Slightly lower performances are obtained when the frequencies of the reference allele and the wild-type residue in the sequence profile are considered. In this case the method based on a single fref threshold results in 78% Q2, 0.56 MC and 0.84 AUC. These results can also be observed plotting the Receiving Operating Characteristic (ROC) and Precision-Recall (PR) curves reported in Fig. S2.
Assessment of the machine learning methods
Starting from the previous observations, we developed three machine learning approaches based on the different groups of conservation features. The PPScore method is based on the PhastCons100way, PhyloP100way scores representing unique conservation measures not describing the type of nucleotides observed in the mutated loci. The other two methods consider the frequencies of the nucleotides or residues in the original and new sequences that correspond to fref, falt and fwt, fmut for DNAProf and ProtProf, respectively. To these groups of measures we added a third feature representing the total number of sequences aligned in the mutated loci (Ng ,Np). We implemented three machine learning methods for predicting Pathogenic variants based on the gradient boosting algorithm with these groups of features. First, the performance of these methods is tested with a 10-fold cross-validation procedure on the CommonClinvar dataset. To avoid possible overfitting we clustered all the proteins based on the sequence identity and grouped all their variants in a unique subset. The average performance of PPScore, DNAProf and ProtProf on a balanced set of Pathogenic and Benign variants are reported in Table S4. The results show that among the three methods ProtProf, which is based on protein sequence profile, achieved the highest performance reaching 83% overall accuracy (Q2), 0.67 Matthews correlation coefficient and 0.91 Area Under the Receiver Operating Characteristic Curve (AUC). PPScore which is based on PhastCons100way, PhyloP100way show the lowest performance resulting in ~ 4% lower AUC and ~ 9% lower MC. An intermediate level of performance is achieved by DNAProf which results in ~ 2% lower AUC and ~ 3% lower MC with respect to ProtProf. Similar results are obtained when assessing the performance of the three methods on the NewClinvar dataset. Also in this case we predicted the impact of each variant removing from the training set all the variants in the CommonClinvar training set belonging to the same cluster of proteins. The performance of PPScore, DNAProf and ProtProf on a balanced set of variants from the NewClinvar dataset are summarized in Table 3.
Table 3
Prediction in cross-validation of the NewClinvar variant dataset. Q2: Overall Accuracy, TNR: True negative rate, NPV: Negative predicted value, TPR: True Positive Rate, PPV: Positive Predicted Value, MC: Matthews Correlation Coefficient, F1: harmonic mean of precision and sensitivity, AUC: Area Under the Receiver Operator Characteristic Curve, AUP Area Under the Precision Recall Curve. All the performance measures are defined in Supplementary Materials. For CADD a raw score classification threshold of 3.1 was considered.
Method | Q2 | TNR | NPV | TPR | PPV | MC | F1 | AUC | AUP |
CADD | 0.844 | 0.821 | 0.860 | 0.867 | 0.829 | 0.688 | 0.847 | 0.911 | 0.905 |
ProtProf | 0.831 | 0.865 | 0.809 | 0.796 | 0.855 | 0.662 | 0.824 | 0.910 | 0.905 |
DNAProf | 0.812 | 0.780 | 0.834 | 0.845 | 0.794 | 0.626 | 0.818 | 0.881 | 0.873 |
PPScore | 0.771 | 0.776 | 0.769 | 0.767 | 0.774 | 0.543 | 0.770 | 0.855 | 0.846 |
Comparison with CADD algorithm
In the final part of our analysis we compared the performance of our simple gradient boosting-based algorithms with those obtained with CADD (Rentzsch et al., 2019). CADD is one of the most accurate methods for predicting Pathogenic variants in coding and non-coding regions (Benevenuta et al., 2021). This method, which is based on more than hundreds of genomic features, was trained on more than 30 million variants. To use CADD as a binary classifier we considered the raw output of the program and we selected the threshold that maximizes the true positive and negative rates on the CommonClinvar dataset. The performance of CADD at the optimal raw score classification threshold of 3.1 is reported in Table S4. This threshold was used for the classification of the variants in the NewClinvar dataset. The performance of CADD on the NewClinvar dataset is summarized in Table 3. This analysis shows that CADD and ProtProf algorithms result in a similar performance in the classification of Pathogenic missense variants in terms of Area Under the Receiver Operating Characteristic (AUC) and Precision-Recall (AUP) curves on both CommonClinvar and NewClinvar datasets. We can also observe that DNAProf which is based on the sequence profile extracted from the multiz100way sequence alignments results only in ~ 3% lower AUC and AUP. The Receiver Operating Characteristic and Precision-Recall curves for CADD and the three methods presented in this manuscript are plotted in Fig. 3.