3.1 Comprehensive overview of variations in M.tuberculosis
Genome analysis was performed for 578 assembled samples across 18 years (2003–2021). Distribution of samples per every year is tabulated in (Fig. 1a). Details of 578 samples along with its submission details which are assembly name, assembly biosample accession and strain details are provided in (Supplementary file 1)
The bam files were annotated using iVar and the output was recorded in vcf format.
Across all the 578 samples a total of 476,053 variations were recorded. The variation occurrence was found to be 1 in every 9 nucleotides. The total number of variations can be classified into 430,660 single nucleotide polymorphisms (SNP’s); 45,211 Multiple nucleotide polymorphisms (MNP’s); 147 Inserts; 4 Deletions and 31 in mixed variants category. Inversions and duplications were found to be zero. The Fig. 2a provides an overview representation of number of mutations occurring for every 100k base pairs of Mtb genome.
3.2 LSTM model accurately predicts SNP’s for 2021
The LSTM models were successfully trained and the accuracies and the loss values of the models were measured. For the specific year 2021, the SNP’s are already known and this can be the best scenario to predict the and compare for accuracy. For batch 1, the number of SNPs predicted for the year 2021 is 2936 but the actual number is 2936, which gives an accuracy of 95.91% prediction. Based on similar calculations, the accuracies for all the batches of data are shown in Table 1. Mean Squared Error (MSE) Loss function and the Adam optimizer with a learning rate of 0.001 was used to provide maximum performance. The LSTM Architecture was trained in PyTorch and a model hidden cell with 2 hidden layers were used apart from the LSTM layers. The model saturated at 420 epochs which were constantly being used for all the five batches of SNP data.
For the 4th batch (3 million to 4 million base pairs), the loss from MSE was 0.03%, but it is observed that the prediction loss comes out to be 38.99%. This can be attributed to the unpredictable variations in the SNP data of the base pairs. On the other hand, the 5th batch shows an error of 3.5% from the MSE Loss, but the predictions come out to be very close to the actual results with an error of 1.12%.
A line plot in Fig. 2b provides a visual comparison of the predicted versus the actual SNP’s for the 2021 M.tuberculosis data.
Table 1
A comparison table of predicted SNP from LSTM model with actual SNP from the genomic analysis along with the error rate.
Batch
|
Predicted SNPs
|
Actual SNP Values
|
Error Rate
|
1
|
2936
|
3061
|
4.08%
|
2
|
3495
|
3270
|
6.88%
|
3
|
4072
|
3754
|
8.47%
|
4
|
5311
|
3821
|
38.99%
|
5
|
1082
|
1070
|
1.12%
|
This model can be further optimized and can be used as a template for predictions of mutations for other organisms on priority list of WHO for anti-microbial resistance.