Study design
I developed a machine learning algorithm that is able to predict the impact of Nonsynonymous (NsSNPS) variants on susceptibility of Mycpbacterium Tuberculosis to antituberculous medications based on amino acid sequence and predicted protein secondary structure.
Database: -
We collected data from 2 databases; from TB Drug Resistance Database (TBDReaMDB), and collected the drug sensitive variants from GMTV database(14,15).
TBDReaMDB is a comprehensive resource on drug resistance mutations in M. tuberculosis developed by conducting a systematic review to identify drug resistance mutations from the existing literature to include in the database(14).
GMTV contains a broad spectrum of data derived from different sources and related to M. tuberculosis molecular biology, epidemiology, TB clinical outcome, year and place of isolation, drug resistance profiles and displays the variants across the genome using a dedicated genome browser. GMTVdatabase, which includes 1084 genomes and over 69,000 SNP or Indel variants, can be queried about M. tuberculosis genome variation and putative associations with drug resistance, geographical origin, and clinical stages and outcomes (15).
Inclusion and Exclusion criteria:
We collected a List of 1488 NsSNPS that are associated with drug resistance to Rifampicin, Isoniazid, Pyrazinamide and Ethambutol. The data included the gene ID, protein ID, codon number, wild amino acid, mutation amino acid, drug susceptibility, and variant impact on protein.
Variants were grouped into two groups (sensitive vs resistant) according to drug sensitivity. Variants found in drug sensitive organism were labelled sensitive, whereas variants found in drug sensitive organism were labelled resistant. We included only proteins that have variants in the two groups. These proteins are Rv0667, Rv1908c, Rv2043c, Rv2428, Rv3793, and Rv3795. The final number of variants included was 1115.
Features generation:
We used PMUT online tool to generate the features included in the algorithm training.
PMut Web portal allows the user to perform pathology predictions, to access a complete repository of pre-calculated predictions, and to generate and validate new predictors. The default predictor performs with good quality scores. The PMut portal is freely accessible at http://mmb.irbbarcelona.org/PMut (16).
The features computed by PMUT and entered in the classifiers were number of sequences in the alignment, number of amino acids in the aligned position (no gaps), total and relative number of aligned wild type amino acids, total and relative number of aligned mutated amino acids, position Weight Matrix score, and PMUT overall score. Table (1) below shows the selected features for the PMut2017 predictor.
Classifiers training and testing
We trained different classifiers on the features generated by PMUT and compared them according to the confusion matrix, receiver operator curve (ROC), and area under the curve (AUC). The classifiers that were trained are Random Forrest (rf), Boosting prediction(ada), Naive Bayes (nb), Neural networks(nnet), k-Nearest Neighbors (knn), Logistic regression (LR), and Linear Discriminant analysis (lda).
Software/ Packages:
We used Microsoft excel 2016 and SPSS v22 to clean the data. We used R Software to train the classifiers. We used the “caret” and “nnet” packages to train and evaluate the classifiers, and “ROCR” package to plot the ROCs curves and calculate the AUC for each classifier.