Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

doi:10.21203/rs.3.rs-884099/v1

Download PDF

Research Article

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

https://doi.org/10.21203/rs.3.rs-884099/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 30 Jan, 2022

Read the published version in Human Genetics →

You are reading this latest preprint version

Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. This observation indicates that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.

Molecular Genetics

Clinical Pharmacology

Biotechnology and Bioengineering

Variant interpretation

Pathogenic missense variants

evolutionary information

conservation score

High-throughput sequencing technologies have changed our daily research by rapidly accumulating genomic data and helping to profile patient genomes (MacArthur et al., 2014; Claussnitzer et al., 2020). These studies make variant interpretation a fundamental challenge in precision medicine (Fernald et al., 2011; Capriotti et al., 2012; McInnes et al., 2021). Missense variants by changing a single amino acid in a protein sequence can be neutral or induce loss of function.

In the last two decades several methods have been developed to prioritize functional missense variants relying on protein sequence/structure information (Ancien et al., 2018; Tennessen et al., 2012; Niroula and Vihinen, 2016; Petrosino et al., 2021) and the protein interaction networks (Rost et al., 2016; Capriotti et al., 2019; Ozturk and Carter, 2021).

It is widely accepted that evolutionary information encoded in multiple sequence alignments of DNAs and proteins is a major resource for scoring variant pathogenicity. This paper evaluates the relevance of this information for missense variant predictions by comparing simple scores and simple predictors with the widely used and well-performing Combined Annotation-Dependent Depletion (CADD) algorithm (Rentzsch et al., 2019).

We computed the conservation scores on DNA (PhastCons100way and PhyloP100way), the frequencies of the reference and alternative alleles in the genome, and frequencies of the wild-type and mutant residues in protein multiple alignments. Our analysis showed that a machine learning method trained on a few sequence conservation features at DNA or protein levels, achieves similar performance of a state-of-the-art algorithm. In this work we compared the performance of CADD with those reached by three different basic gradient boosting algorithms on a set of missense variants from the Clinvar database. Our result indicates that the evolutionary information provides the main features for scoring the pathogenicity of missense variants.

Datasets

To evaluate the performance of different machine learning methods for predicting the pathogenicity of missense variants we collected two datasets from the Clinvar database (Landrum et al., 2020). For building the two datasets we considered two versions of Clinvar released in June 2020 and August 2021, respectively. The first dataset (CommonClinvar) consists of the missense variants annotated as Pathogenic and Benign in both versions of the database while the second dataset (NewClinvar) collects the new missense variants reported in the last version of Clinvar since June 2020 (Fig. S1). The variants reported in the older version of Clinvar not confirmed in the last version were discarded. Thus, the CommonClivar consists of 36,751 missense variants from 7,582 proteins 53.5% of which are annotated as Benign and the remaining ones (46.5%) as Pathogenic. NewClinvar, which includes only the newly annotated variants, is composed of 5,172 from 1,855 proteins 43,4% of which are reported as Benign and 56.6% as Pathogenic. The composition of the two datasets is summarized in Table S1. Both CommonClinvar and NewClinvar datasets are available as supplementary files.

Conservation features

In this work we analyzed the performance in the prediction of pathogenic variants using three basic methods based on sequence conservation features. Each method considers only three input features, which are described below.

The first two methods are based on features obtained from a genome level multiple sequence alignment made available through the UCSC genome browser (Kent et al., 2002), which evaluates conservation along the DNA multiple alignments. The conservation scores considered for the first method are calculated by PhastCons (Siepel et al., 2005) and PhyloP (Pollard et al., 2010) algorithms, while the features of the second method are extracted from the multiz100way multiple alignments. The PhastCons100 and PhyloP100way scores as well as the multiz100way alignments for the hg38 human reference genome are available at https://hgdownload.cse.ucsc.edu/goldenpath/hg38/.

The three features of the last method are generated from a protein sequence profile calculated on the results of a BLAST (Altschul et al., 1997) search on the UniRef90 database (Suzek et al., 2007) released in June 2020. For the BLAST search we used an e-value cutoff of 10^− 9 as suggested in previous works (Capriotti et al., 2006; Calabrese et al., 2009; Capriotti et al., 2013).

In detail, we consider the PhastCons100way and PhyloP100way scores for each mutated loci, the frequencies of the reference and alternative alleles from the multiz100way multiple sequence alignment, and the total number of aligned genomic sequences of the mutated loci. To map the missense variant at the protein level, we calculated the wild-type and mutant residue frequencies from the output of a BLAST search and the number of aligned proteins in the mutated site.

Machine learning algorithms

Using the eight features described above, we develop three binary classifiers (PPScores, DNAProf, ProtProf) using the following groups of three features:

PPScore: PhastCons100way (PC) and PhyloP100way (PP) scores, and number aligned genomic sequences (N_g) in multiz100way
DNAProf: Frequencies of the reference (f_ref) and alternative (f_alt) alleles, and number aligned genomic sequences in multiz100way (N_g).
ProtProf: Frequencies of the wild-type (f_wt) and mutant (f_mut) residues, and number aligned protein sequences (N_p) from a BLAST search on UniRef90.

For each group of features defined above we developed a binary classifier based on the gradient boosting algorithm as implemented in the scikit-learn package (Pedregosa, F. et al., 2011). The proposed groups of features are summarized in Table 1.

Table 1

Three groups of features used
Group	Features
PPScores	PC	PP	N_g
DNAProf	f_ref	f_alt	N_g
ProtProf	f_wt	f_mut	N_p
for the development of the binary classifiers.

Training and testing procedure

We first evaluated the performance of each method on CommonClinvar using a 10-fold cross-validation procedure for a fair evaluation of the proposed method performance. To reduce at the minimum the possible overfitting we mapped each missense variant on the relative protein sequence and we cluestter all the sequences using the blastclast algorithm (https://ftp.ncbi.nih.gov/blast/documents/blastclust.html) with a sequence identity threshold of 25% and a coverage of 50%. Using the clustering based on sequence similarity we perform a 10-fold cross-validation procedure keeping all the variants belonging to the same cluster in the same subset. A second test is performed considering the NewClinvar dataset. In this case the impact of the variants of a given protein are predicted excluding from the training set (CommonClinvar) all the variants belonging to proteins of the same cluster. We extracted a balanced set of Pathogenic and Benign variants from CommonClinvar and NewClinvar dataset for each test, randomly downscaling the most abundant class. The reported scoring measures for all the methods are averaged over ten randomly selected sets.

Benchmarking and performance measures

To characterize the prediction power of the main conservation features described in this work, for each of them we developed a single feature binary classifier based on a single threshold. For each feature the classification threshold is optimized on the CommonClinvar dataset maximizing both the true positive and true negative rates. The optimized threshold is tested in the classification of the NewClinvar variant dataset. The same procedure is used to optimize the raw score output threshold of the CADD algorithm (Rentzsch et al., 2019) as binary classifier.

Finally, the performances of all the binary classifiers described above are compared with those achieved by the CADD algorithm. All the measures considered for scoring the performance of the methods are defined in Supplementary Materials.

Feature analysis and single feature classification

In the first part of our work, we analyzed the distributions of the main features used for the classification task. We focused on the six conservation features (PC, PP, f_ref, f_alt, f_wt, f_mut) comparing their distributions for the subsets of Pathogenic and Benign variants. The average, median and standard deviation of such distributions are reported in Table S2. As observed in previous works (Kircher et al., 2014; Capriotti and Fariselli, 2017) the distribution of the PhyloP100way score (PP) in mutated loci associated with Pathogenic and Benign variants are significantly different (Fig. 1). Indeed, the two distributions show median values of 7.5 and 1.5, respectively, with a Kolmogorov-Smirnov distance (KSD) of 0.57 (Fig. 1B and Table S2). This distance is greater than the KSD observed for the PhastCons100way score (PC).

A higher difference between the distributions of the conservation scores for the subset of Pathogenic and Benign variants is observed when the frequencies in sequence profile from genomic and proteins are considered. The most remarkable differences are generally detected when comparing the distributions of the frequency of the alternative allele (f_alt) and the mutant residue (f_mut) for which the KSD is ~ 0.60. Analyzing the frequencies of the reference allele (f_ref) and wild-type residue (f_wt) their KSD is 0.58 and 0.55, respectively (Table S2). The distributions of the four types of frequencies (f_ref, f_alt,, f_wt, and f_mut) for the subsets of Pathogenic and Benign variants are plotted in Fig. 2

This observation agrees with the results obtained in the prediction of Pathogenic variants using a classification threshold on a single feature. The classification threshold is optimized on the CommonClinvar dataset maximizing both the true positive and negative rates (Table S3). Applying the optimized thresholds on the prediction of the variants in the NewClinvar dataset, we found that a simple classifier based on the frequency of the mutant residue extracted from a protein sequence profile achieve 81% overall accuracy (Q2), 0.63 Matthews correlation coefficient (MC) and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.86 (Table 2).

Table 2

Performance of basic predictors based on a single feature on the *NewClinvar* dataset Prediction threshold are optimized on the *CommonClinvar* dataset. Q2: Overall Accuracy, TNR: True negative rate, NPV: Negative predicted value, TPR: True Positive Rate, PPV: Positive Predicted Value, MC: Matthews Correlation Coefficient, F1: harmonic mean of precision and sensitivity, AUC: Area Under the Receiver Operator Characteristic Curve, AUP Area under the Precision Recall Curve. All the performance measures are defined in Supplementary Materials.
Feature	Threshold	Q2	TNR	NPV	TPR	PPV	MC	F1	AUC	AUP
PC	1.000	0.737	0.611	0.816	0.862	0.689	0.489	0.766	0.755	0.815
PP	4.704	0.769	0.796	0.756	0.743	0.784	0.539	0.763	0.841	0.828
f_ref	0.977	0.779	0.815	0.760	0.742	0.801	0.559	0.770	0.836	0.843
f_alt	0.000	0.794	0.750	0.821	0.837	0.770	0.589	0.802	0.828	0.863
f_wt	0.702	0.769	0.806	0.750	0.731	0.791	0.539	0.759	0.844	0.836
f_mut	0.005	0.815	0.819	0.812	0.810	0.817	0.629	0.814	0.857	0.856

According to the previous observation, the PhastCons100way score (PC) is the least discriminating feature. When using the optimized threshold on the classification of the NewClinvar variants, the method based on a single PC threshold achieves 74% overall accuracy, 0.49 MC and 0.75 AUC (Table 2). Slightly lower performances are obtained when the frequencies of the reference allele and the wild-type residue in the sequence profile are considered. In this case the method based on a single f_ref threshold results in 78% Q2, 0.56 MC and 0.84 AUC. These results can also be observed plotting the Receiving Operating Characteristic (ROC) and Precision-Recall (PR) curves reported in Fig. S2.

Assessment of the machine learning methods

Starting from the previous observations, we developed three machine learning approaches based on the different groups of conservation features. The PPScore method is based on the PhastCons100way, PhyloP100way scores representing unique conservation measures not describing the type of nucleotides observed in the mutated loci. The other two methods consider the frequencies of the nucleotides or residues in the original and new sequences that correspond to f_ref, f_alt and f_wt, f_mut for DNAProf and ProtProf, respectively. To these groups of measures we added a third feature representing the total number of sequences aligned in the mutated loci (N_g ,N_p). We implemented three machine learning methods for predicting Pathogenic variants based on the gradient boosting algorithm with these groups of features. First, the performance of these methods is tested with a 10-fold cross-validation procedure on the CommonClinvar dataset. To avoid possible overfitting we clustered all the proteins based on the sequence identity and grouped all their variants in a unique subset. The average performance of PPScore, DNAProf and ProtProf on a balanced set of Pathogenic and Benign variants are reported in Table S4. The results show that among the three methods ProtProf, which is based on protein sequence profile, achieved the highest performance reaching 83% overall accuracy (Q2), 0.67 Matthews correlation coefficient and 0.91 Area Under the Receiver Operating Characteristic Curve (AUC). PPScore which is based on PhastCons100way, PhyloP100way show the lowest performance resulting in ~ 4% lower AUC and ~ 9% lower MC. An intermediate level of performance is achieved by DNAProf which results in ~ 2% lower AUC and ~ 3% lower MC with respect to ProtProf. Similar results are obtained when assessing the performance of the three methods on the NewClinvar dataset. Also in this case we predicted the impact of each variant removing from the training set all the variants in the CommonClinvar training set belonging to the same cluster of proteins. The performance of PPScore, DNAProf and ProtProf on a balanced set of variants from the NewClinvar dataset are summarized in Table 3.

Table 3

Prediction in cross-validation of the *NewClinvar* variant dataset. Q2: Overall Accuracy, TNR: True negative rate, NPV: Negative predicted value, TPR: True Positive Rate, PPV: Positive Predicted Value, MC: Matthews Correlation Coefficient, F1: harmonic mean of precision and sensitivity, AUC: Area Under the Receiver Operator Characteristic Curve, AUP Area Under the Precision Recall Curve. All the performance measures are defined in Supplementary Materials. For CADD a raw score classification threshold of 3.1 was considered.
Method	Q2	TNR	NPV	TPR	PPV	MC	F1	AUC	AUP
CADD	0.844	0.821	0.860	0.867	0.829	0.688	0.847	0.911	0.905
ProtProf	0.831	0.865	0.809	0.796	0.855	0.662	0.824	0.910	0.905
DNAProf	0.812	0.780	0.834	0.845	0.794	0.626	0.818	0.881	0.873
PPScore	0.771	0.776	0.769	0.767	0.774	0.543	0.770	0.855	0.846

Comparison with CADD algorithm

In the final part of our analysis we compared the performance of our simple gradient boosting-based algorithms with those obtained with CADD (Rentzsch et al., 2019). CADD is one of the most accurate methods for predicting Pathogenic variants in coding and non-coding regions (Benevenuta et al., 2021). This method, which is based on more than hundreds of genomic features, was trained on more than 30 million variants. To use CADD as a binary classifier we considered the raw output of the program and we selected the threshold that maximizes the true positive and negative rates on the CommonClinvar dataset. The performance of CADD at the optimal raw score classification threshold of 3.1 is reported in Table S4. This threshold was used for the classification of the variants in the NewClinvar dataset. The performance of CADD on the NewClinvar dataset is summarized in Table 3. This analysis shows that CADD and ProtProf algorithms result in a similar performance in the classification of Pathogenic missense variants in terms of Area Under the Receiver Operating Characteristic (AUC) and Precision-Recall (AUP) curves on both CommonClinvar and NewClinvar datasets. We can also observe that DNAProf which is based on the sequence profile extracted from the multiz100way sequence alignments results only in ~ 3% lower AUC and AUP. The Receiver Operating Characteristic and Precision-Recall curves for CADD and the three methods presented in this manuscript are plotted in Fig. 3.

Here we analyzed different evolutionary information encodings for missense variant pathogenicity predictions. We compared the encoding at DNA and protein levels, where different multiple alignments techniques apply. Multiple alignments of protein sequence include more sequence and more remote homologs for many genes than pre-calculated genome alignments from the UCSC genome browser. This condition can be the reason why the performance of a method trained using the protein-based information is slightly better. With these simple inputs based on evolutionary information, a machine learning method can perform comparably to CADD, which uses more sophisticated inputs. This result shows that, at least for the missense variants, an input based on evolutionary information of the wild-type and mutated residue plays the most relevant role in scoring pathogenic variants.

Recently, it has been suggested that protein positions have a significant role and can act as Neutral, Toggle or Rheostat (Miller et al., 2019). Here we indicate an alternative view of protein positions that can be seen as a non-linear combination of the frequencies of wild-type/mutant residues at protein level or reference/alternative allele at DNA level. The results of our analysis suggest that the performances of new and more sophisticated machine learning algorithms should always be compared with those achieved by simple conservation-based methods. As recently proposed (Walsh et al., 2021), the design of such benchmark tests should consider the adoption of specific guidelines for avoiding bias in the training and testing sets. This procedure is important to exclude overfitting on the context-dependent features (Grimm et al., 2015) and identify new important features for improving the performance of variant scoring algorithms.

Funding

This work was supported by the PRIN project, “Integrative tools for defining the molecular basis of the diseases: Computational and Experimental methods for Protein Variant Interpretation” of the Ministero Istruzione, Università e Ricerca (PRIN201744NR8S).

Conflicts of interest/Competing interests

None

Availability of data and material

Additional information is available in Supplementary Materials. The CommonClinvar and NewClinvar datasets are provided as supplementary files.

Authors' contributions

EC developed and collected and analyzed the data. Both the authors designed the research and contributed to the writing of the manuscript.

Acknowledgments

PF thanks the Italian Ministry for Education, University and Research for the programme “Dipartimenti di Eccellenza 20182022D15D18000410001”.

Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389–402.
Ancien,F. et al. (2018) Prediction and interpretation of deleterious coding variants in terms of protein structural stability. Sci. Rep., 8, 4480.
Benevenuta,S. et al. (2021) Calibrating variant-scoring methods for clinical decision making. Bioinformatics, (In press).
Calabrese,R. et al. (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat, 30, 1237–44.
Capriotti,E. et al. (2012) Bioinformatics for personal genome interpretation. Brief Bioinform, 13, 495–512.
Capriotti,E. et al. (2019) Integrating molecular networks with genetic variant interpretation for precision medicine. Wiley Interdiscip Rev Syst Biol Med, 11, e1443.
Capriotti,E. et al. (2006) Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics, 22, 2729–34.
Capriotti,E. et al. (2013) WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation. BMC Genomics, 14 Suppl 3, S6.
Capriotti,E. and Fariselli,P. (2017) PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants. Nucleic Acids Res, 45, W247–W252.
Claussnitzer,M. et al. (2020) A brief history of human disease genetics. Nature, 577, 179–189.
Fernald,G.H. et al. (2011) Bioinformatics challenges for personalized medicine. Bioinformatics, 27, 1741–8.
Grimm,D.G. et al. (2015) The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Hum. Mutat., 36, 513–523.
Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, 996–1006.
Kircher,M. et al. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46, 310–5.
Landrum,M.J. et al. (2020) ClinVar: improvements to accessing data. Nucleic Acids Res., 48, D835–D844.
MacArthur,D.G. et al. (2014) Guidelines for investigating causality of sequence variants in human disease. Nature, 508, 469–76.
McInnes,G. et al. (2021) Opportunities and challenges for the computational interpretation of rare variation in clinically important genes. Am. J. Hum. Genet., 108, 535–548.
Miller,M. et al. (2019) funtrp: identifying protein positions for variation driven functional tuning. Nucleic Acids Res., 47, e142.
Niroula,A. and Vihinen,M. (2016) Variation Interpretation Predictors: Principles, Types, Performance, and Choice. Hum. Mutat., 37, 579–597.
Ozturk,K. and Carter,H. (2021) Predicting functional consequences of mutations using molecular interaction network features. Hum. Genet.
Pedregosa, F. et al. (2011) Scikit-learn: Machine Learning in Python. JMLR, 12, 2825–2830.
Petrosino,M. et al. (2021) Analysis and Interpretation of the Impact of Missense Variants in Cancer. Int. J. Mol. Sci., 22, 5416.
Pollard,K.S. et al. (2010) Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res, 20, 110–21.
Rentzsch,P. et al. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res., 47, D886–D894.
Rost,B. et al. (2016) Protein function in precision medicine: deep understanding with machine learning. FEBS Lett., 590, 2327–2341.
Siepel,A. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 15, 1034–50.
Suzek,B.E. et al. (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23, 1282–8.
Tennessen,J.A. et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337, 64–69.
Walsh,I. et al. (2021) DOME: recommendations for supervised machine learning validation in biology. Nat. Methods.

Download PDF

Journal Publication

published 30 Jan, 2022

Read the published version in Human Genetics →

Editorial decision: Minor revisions
26 Oct, 2021
Reviewers invited by journal
06 Sep, 2021
Editor assigned by journal
06 Sep, 2021
First submitted to journal
06 Sep, 2021

You are reading this latest preprint version

Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Material And Methods

Datasets

Conservation features

Machine learning algorithms

Training and testing procedure

Benchmarking and performance measures

Results

Feature analysis and single feature classification

Assessment of the machine learning methods

Comparison with CADD algorithm

Conclusion And Discussion

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1