Long Short Term Memory model for accurate predictions of Single Nucleotide Polymorphisms in Mycobacterium tuberculosis from timeseries genome analysis

doi:10.21203/rs.3.rs-1512018/v1

Download PDF

Article

Long Short Term Memory model for accurate predictions of Single Nucleotide Polymorphisms in Mycobacterium tuberculosis from timeseries genome analysis

https://doi.org/10.21203/rs.3.rs-1512018/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Drug resistance in tuberculosis is on the serious threat list of world health organization with a critical focus on addressing the genomic variations in Mycobacterium tuberculosis. This provides an opportunity for better understanding of evolutionary progression leading to anti-microbial resistance. This impacts on the economic stability of the global healthcare sector. A timeline genomic analysis from 2003 to 2021 of 578 mycobacterium genomes have been performed to understand the pattern underlying the genomic variations. A total of 4,76,053 mutations with Ts/Tv ratio of 0.448 was observed. In this regard, a recurrent neural network approach of Long short term memory model was optimized to predict the genome-wide mutations in the ratio of 80:20 test to training set respectively.

The genomic sequences were split into batches and the error rate is averaged to 5.14% in 4 out of 5 batches and 38.99% in last batch with accurate position specific predictions. This has an impact on countering the anti-microbial resistance by identifying regions with high and low genomic variability providing insights on novel drug prospective targets. The further scope lies in improvising the model by enriching the datasets.

Long Short Term Memory

Mycobacterium tuberculosis

Single Nucleotide Polymorphisms

Predictions

According to a WHO estimate, one third of the world’s population are infected with the bacterium that causes tuberculosis (TB), Mycobacterium tuberculosis (MTb) [1, 2]. Although this disease is generally curable and preventable, it has been observed in the recent past that a number of the first-line drugs are becoming ineffective due to mutations in the organism across its generations [3–5]. One of the main reasons for the widespread nature of this disease is that it is airborne, and hence can be easily passed on from one person to another. The treatment of the affliction is severely complicated by the fact that the genetic sequence of this organism has mutated over time, rendering a number of drugs ineffective [6–8]. Any form of tuberculosis that does not respond to the drugs designed for its treatment are referred to as drug resistant TB [9, 10].

Drug resistance occurs either due to improper administration of the TB drugs, or due to genetic mutations across generations. As per a WHO study conducted in 2018, treatments had a 54% chance of treating multidrug resistant TB, and only a 30% chance of treating an extreme drug resistant type of TB [11].

Typical factors of virulence like those displayed by other bacterial pathogens, like the toxins secreted by Escherichia coli O157:H7, Corynebacterium diphtheriae, Shigella dysenteriae and others are not observed in Mycobacterium tuberculosis. Little is known about the mode of action of M. tuberculosis however, its virulence can be quantitatively measured. These insights are then used to establish the effects of modification in the bacterium on its ability to cause disease [12]. To describe its virulence, terms like “mortality” and “morbidity” are used. Mortality can be defined as the percentage of individuals that die from infection and can also be defined as the time lag between infection and death. The second very pertinent parameter that describes virulence is bacterial load, or the count of bacteria found in the individual after the initial infection. This helps us to identify the way bacteria responds in the host system and its ability to survive (fitness level) [14].

Understanding the pathogenesis of TB will help to better understand and measure morbidity and mortality and thus measure virulence. Uncontrolled growth of the bacteria at the site of infection results in widespread lung damage that finally brings about death by asphyxiation. This lack of oxygen brings about the destruction of lung parenchymal cells which are responsible for oxygen uptake and the blockage of the bronchiolar passages by granulomatous growths as well as the rupture of the liquefied granulomas and the resultant blood produced in adjacent lung tissue. Other types of tuberculosis like tubercular meningitis that affects meningial membranes of the brain, result in death due to brain tissue inflammation [15–18]

In the current work we provide a holistic view of the mutational patterns in M. tuberculosis via an evolutionary time series analysis. This involves capturing the SNP’s from 2003 to 2021 and using a long short term memory model (LSTM) which is a part of recurrent neural network (RNN) to predict the mutations.

2.1 Artificial Neural Networks

A Neural Network (NN) is a network of connected processing units that each perform computations and act as universal function approximators. The processing units are known as perceptrons which form the basic building blocks of an artificial neural network[19]. Figure 1a shows an example of an ANN which takes inputs, processes them using weights and biases and gives a prediction as an output.

Shallow neural network has lesser number of networks compared to deep neural networks, which can contain hundreds or even thousands of perceptrons arranged in several hidden layers. The weights and biases are updated after every iteration of training using the process of backpropagation.

2.2 Recurrent Neural Networks (RNNs)

The recurrent neural networks were developed for the first time in the 1980s. It consists of an input layer, hidden layers and an output layer just like an Artificial Neural Network. Memory from previous steps is stored in the RNN Architecture using the chain like structure of this neural network.[19] A sequence of steps can be accepted by the RNN network to keep a track of processing that takes place in the past, unlike the neural network which is not dependent on the processing that takes place in the past. This means the output from step t − 1 is fed back into the network to influence the outcome of step t, and for each subsequent step. Therefore, RNNs have been successful in learning sequences. The sequential learning process is shown in Fig. 1b.

Figure 1b illustrates a simple RNN with one input unit, one output unit, and one recurrent hidden unit expanded into a full network, where Xt is the input at time step t and htis the output at timestep t. Backpropagation through time (BPTT) is the process that RNNs use to update the weights and biases while taking into account the modification of the feedback process. Working-backward approach is used layer by layer from the network’s final output where the weights are updated based on the total output error. The information loops repeat resulting in huge updates to neural network model weights and lead to an unstable network due to the accumulation of error gradients during the updating process. Therefore, BPTT is not sufficiently efficient to learn a pattern from long-term dependency because of the gradient vanishing and the exploding gradient problems. This would be one of the crucial reasons leading to difficulties in the training of recurrent neural networks which can be solved by Long Short-Term Memory Networks (LSTMs).

2.3 Long Short-Term Memory Networks (LSTM)

Long Short-Term Memory, an evolution of RNN, was introduced by Hochreiter and Schmidhuber to primarily tackle the problems posed by RNNs by adding additional interactions per module (or cell). LSTM’s can be used as an advanced version of RNNs that can remember information for prolonged periods of time. [21, 22]

A detailed architecture for construction of a LSTM model can be visualized in Fig. 1c.

3.1 Comprehensive overview of variations in M.tuberculosis

Genome analysis was performed for 578 assembled samples across 18 years (2003–2021). Distribution of samples per every year is tabulated in (Fig. 1a). Details of 578 samples along with its submission details which are assembly name, assembly biosample accession and strain details are provided in (Supplementary file 1)

The bam files were annotated using iVar and the output was recorded in vcf format.

Across all the 578 samples a total of 476,053 variations were recorded. The variation occurrence was found to be 1 in every 9 nucleotides. The total number of variations can be classified into 430,660 single nucleotide polymorphisms (SNP’s); 45,211 Multiple nucleotide polymorphisms (MNP’s); 147 Inserts; 4 Deletions and 31 in mixed variants category. Inversions and duplications were found to be zero. The Fig. 2a provides an overview representation of number of mutations occurring for every 100k base pairs of Mtb genome.

3.2 LSTM model accurately predicts SNP’s for 2021

The LSTM models were successfully trained and the accuracies and the loss values of the models were measured. For the specific year 2021, the SNP’s are already known and this can be the best scenario to predict the and compare for accuracy. For batch 1, the number of SNPs predicted for the year 2021 is 2936 but the actual number is 2936, which gives an accuracy of 95.91% prediction. Based on similar calculations, the accuracies for all the batches of data are shown in Table 1. Mean Squared Error (MSE) Loss function and the Adam optimizer with a learning rate of 0.001 was used to provide maximum performance. The LSTM Architecture was trained in PyTorch and a model hidden cell with 2 hidden layers were used apart from the LSTM layers. The model saturated at 420 epochs which were constantly being used for all the five batches of SNP data.

For the 4th batch (3 million to 4 million base pairs), the loss from MSE was 0.03%, but it is observed that the prediction loss comes out to be 38.99%. This can be attributed to the unpredictable variations in the SNP data of the base pairs. On the other hand, the 5th batch shows an error of 3.5% from the MSE Loss, but the predictions come out to be very close to the actual results with an error of 1.12%.

A line plot in Fig. 2b provides a visual comparison of the predicted versus the actual SNP’s for the 2021 M.tuberculosis data.

Table 1

A comparison table of predicted SNP from LSTM model with actual SNP from the genomic analysis along with the error rate.
Batch	Predicted SNPs	Actual SNP Values	Error Rate
1	2936	3061	4.08%
2	3495	3270	6.88%
3	4072	3754	8.47%
4	5311	3821	38.99%
5	1082	1070	1.12%

This model can be further optimized and can be used as a template for predictions of mutations for other organisms on priority list of WHO for anti-microbial resistance.

4.1 Retrieving the data from public datasets

For the current study, 578 assembled whole genome datasets from 2003 to 2021 were downloaded from NCBI datasets for M.tuberculosis (https://www.ncbi.nlm.nih.gov/datasets/genomes/?taxon=1773).

The genomic sequence were downloaded in fasta format for analysis. The reference genome considered in the study is Mycobacterium tuberculosis H37Rv bearing a NCBI reference number NC_000962 [23].

4.2 Mapping of samples with reference genomes

BBMap [24] is a splice-aware global aligner for DNA and RNA sequencing reads. BBMap requires read input to be fasta or fastq, compressed or raw. The tool has been used for long RNA-seq reads[25] along with BB merge [26]. All the required information on compiling, installation and running of samples is available at https://github.com/BioInfoTools/BBMap. The tools generate the coverage information using pileup. This avoids the need of using mpileup prior to variant calling. The output file is stored in bam format for further processing.

4.3 Variants calling

iVar [27] is a computational package for detecting the variants of reference aligned genomic sequences (in bam format). The detailed description on the dependencies, installing and running the package is available in https://andersen-lab.github.io/ivar/html/index.html

The tool is also available at Galaxy Webserver (https://usegalaxy.org/) with a user friendly interface for uploading the bam files and running of iVar analysis pipeline. The output is stored in vcf format for further processing.

4.4 Data pre-processing

A year-wise database starting from 2003 to 2021 was created for the 4.4 million base pairs (bp). The dataset was then separated into 5 batches: 0 to 1 million bp’s, 1 to 2 million bp’s, 2 to 3 million bp’s, 3 to 4 million bp’s, and 4 to 4.4 million bp’s. Each batch was labelled from 1 to 5 for the base pairs range mentioned above respectively. Figure 3 (a-e) shows the data visualized from batch 1 with x axis as the year and the y axis indicating the number of SNPs that have taken place in a particular year.

The values of the SNPs are then normalized using a min/max scaler with minimum and maximum values of -1 and 1 respectively. The normalization is applied only to the training data and not on test data to avoid information leakage from training set to test set. The dataset is converted to tensors that PyTorch [28] can act as input, and the final pre-processing step converts training data into sequences and labels.

4.5 Model training with SNP data

The model was trained on the SNP data for each of the 5 SNP batches. For each of the above defined batches, the data is divided into training and test sets in the approximate ratio of 80% training and 20% testing.

Prediction of mutations from timeseries genomic data has always been looked upon as a viable solution and have a presage towards drug discovery. In recent years, LSTM has been optimized and implemented in SARS-nCoV-2 studies [29–31] which is focused on prediction of transversions. The current work carried out is related to prediction of mutations. The advantages are on understanding the overall mutation patterns of M.tuberculosis. The regions of high and low regions of variability can provide crucial insights. Proteins targets in low variability regions can be considered as prospective drug targets for research to be carried out. With increase in number of datasets the model can be optimized to achieve better results.

Conflict of Interest

All the authors declare that there is no potential conflict of interest.

Author Contributions

VN was involved in ideation and conceptualization of the overall work. JK was contributed via providing crucial insights and supervising the analysis workflow. AU was involved in genome analysis and SNP detection. SF contributed to the machine learning modelling and optimization. All the authors were involved in drafting the manuscript and reviewing the final submitted version.

Funding

The funding acquisition was made from the Bangalore Bioinnovation Centre, Karanataka Innovation and Technology Society, Department of Electronics, IT, BT and S&T, Government of Karnataka, India, towards paying the publication cost.

Acknowledgments

The authors acknowledge the Bangalore Bioinnovation Centre, Department of Electronics, IT, BT and S&T, Government of Karnataka, India, for funding acquisition towards paying the publication cost. The authors thank Dr. Shobha G from Department of Computer science and Engineering, R V College of Engineering, Bangalore for providing GPU (Quadro GV100) computational support. The authors also acknowledge the efforts of Ms. Anagha S Setlur for proof reading and crucial insights in improving the language and presentation of the manuscript. A warm heartfelt thanks to the staff and administration at R V College of Engineering for the support.

Data Availability Statement

The datasets used for analysis in the current study is available in NCBI datasets and can be accessed using the link provided https://www.ncbi.nlm.nih.gov/datasets/genomes/?taxon=1773.

Tracking Universal Health Coverage: 2017 Global Monitoring Report. Washington, DC: World Health Organization; 2017.
The Sustainable Development Goals Report 2017. The Sustainable Development Goals Report: United Nations; 2017.
Ritz N, Curtis N. Novel concepts in the epidemiology, diagnosis and prevention of childhood tuberculosis. Swiss Medical Weekly. 2014.
Huebner RE, Schein MF, Bass JB. The Tuberculin Skin Test. Clinical Infectious Diseases. 1993;17(6):968–75.
Andersen P, Munk ME, Pollock JM, Doherty TM. Specific immune-based diagnosis of tuberculosis. The Lancet. 2000;356(9235):1099–104.
Lalvani A, Pareek M. A 100 year update on diagnosis of tuberculosis infection. British Medical Bulletin. 2009;93(1):69–84.
Lalvani A, Pathan AA, McShane H, Wilkinson RJ, Latif M, Conlon CP, et al. Rapid Detection ofMycobacterium tuberculosisInfection by Enumeration of Antigen-specific T Cells. American Journal of Respiratory and Critical Care Medicine. 2001;163(4):824–8.
Harboe M, Oettinger T, Wiker HG, Rosenkrands I, Andersen P. Evidence for occurrence of the ESAT-6 protein in Mycobacterium tuberculosis and virulent Mycobacterium bovis and for its absence in Mycobacterium bovis BCG. Infection and Immunity. 1996;64(1):16–22.
Guinn KM, Hickey MJ, Mathur SK, Zakel KL, Grotzke JE, Lewinsohn DM, et al. Individual RD1-region genes are required for export of ESAT-6/CFP-10 and for virulence of Mycobacterium tuberculosis. Molecular Microbiology. 2004;51(2):359–70.
Updated TB guidelines raise upper age limit for treating latent disease. The Pharmaceutical Journal. 2016.
Prasanna A, Niranjan V. Classification of Mycobacterium tuberculosis DR, MDR, XDR Isolates and Identification of Signature Mutation Pattern of Drug Resistance. Bioinformation. 2019 Apr 15;15(4):261–268
Lu P, Chen X, Zhu L-m, Yang H-t. Interferon-Gamma Release Assays for the Diagnosis of Tuberculosis: A Systematic Review and Meta-analysis. Lung. 2016;194(3):447–58.
Sollai S, Galli L, de Martino M, Chiappini E. Systematic review and meta-analysis on the utility of Interferon-gamma release assays for the diagnosis of Mycobacterium tuberculosis finfection in children: a 2013 update. BMC Infectious Diseases. 2014;14(S1).
Pai M, Zwerling A, Menzies D. Systematic Review: T-Cell–based Assays for the Diagnosis of Latent Tuberculosis Infection: An Update. Annals of Internal Medicine. 2008;149(3):177.
Geluk A, van Meijgaarden KE, Joosten SA, Commandeur S, Ottenhoff THM. Innovative Strategies to Identify M. tuberculosis Antigens and Epitopes Using Genome-Wide Analyses. Frontiers in Immunology. 2014;5.
Zvi A, Ariel N, Fulkerson J, Sadoff JC, Shafferman A. Whole genome identification of Mycobacterium tuberculosisvaccine candidates by comprehensive data mining and bioinformatic analyses. BMC Medical Genomics. 2008;1(1).
Coppola M, van Meijgaarden KE, Franken KLMC, Commandeur S, Dolganov G, Kramnik I, et al. New Genome-Wide Algorithm Identifies Novel In-Vivo Expressed Mycobacterium Tuberculosis Antigens Inducing Human T-Cell Responses with Classical and Unconventional Cytokine Profiles. Sci Rep. 2016;6(1).
Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393(6685):537–44.
Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks. 2015;61:85–117.
Staudemeyer RC. Applying long short-term memory recurrent neural networks to intrusion detection. South African Computer Journal. 2015;56.
Le, Ho, Lee, Jung. Application of Long Short-Term Memory (LSTM) Neural Network for Flood Forecasting. Water. 2019;11(7):1387.
Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation. 1997;9(8):1735–80.
Lew JM, Kapopoulou A, Jones LM, Cole ST. TubercuList – 10 years after. Tuberculosis. 2011;91(1):1–7.
Bushnell, Brian. “BBMap: A Fast, Accurate, Splice-Aware Aligner.” (2014).
Križanović K, Echchiki A, Roux J, Šikić M. Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinformatics. 2017;34(5):748–54.
Bushnell B, Rood J, Singer E. BBMerge – Accurate paired shotgun read merging via overlap. PLoS One. 2017;12(10):e0185056.
Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biology. 2019;20(1).
Neural Information Processing Systems. The Deep Learning Revolution: The MIT Press; 2018.
Pathan RK, Biswas M, Khandaker MU. Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos, Solitons & Fractals. 2020;138:110018.
Mohamed T, Sayed S, Salah A, Houssein EH. Long Short-Term Memory Neural Networks for RNA Viruses Mutations Prediction. Mathematical Problems in Engineering. 2021;2021:1–9.
Saha I, Ghosh N, Maity D, Seal A, Plewczynski D. COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses. Front Genet. 2021;12.

No competing interests reported.

Suppdata1.csv

Download PDF

Version 1

posted

You are reading this latest preprint version

Long Short Term Memory model for accurate predictions of Single Nucleotide Polymorphisms in Mycobacterium tuberculosis from timeseries genome analysis

Status:

Version 1

Abstract

Figures

1 Introduction

2 Machine Learning

2.1 Artificial Neural Networks

2.2 Recurrent Neural Networks (RNNs)

2.3 Long Short-Term Memory Networks (LSTM)

3 Results And Discussion

4 Methods

4.1 Retrieving the data from public datasets

4.2 Mapping of samples with reference genomes

4.3 Variants calling

4.4 Data pre-processing

4.5 Model training with SNP data

5 Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1