From 2015 to 2021 in Bangladesh, a diverse array of genetic variations characterises the emergence of distinct circulating lineages
To explore the evolutionary dynamics of V. cholerae linked to ongoing cholera cases in Bangladesh, a genomic analysis was done considering the years 2015 to 2021. We sequenced 129 V. cholerae O1 El Tor isolates taken from stool samples of patients between September 2015 to April 2021 admitted to hospitals in six districts (Barisal, Chittagong, Dhaka, Khulna, Rajshahi and Sylhet) of Bangladesh, Table S1. During the duration of this study, isolates belonging to serotypes Inaba and Ogawa were identified, Fig.1. Consistent with previous studies9,14, a serotype switch was observed, with Inaba predominantly present in 2016 and 2017, followed by a predominance of Ogawa samples in 2018 and 2019 (Fig. S1). Both serotypes were detected in 2015 and continued to coexist from 2020 onwards. Serotypes were significantly associated with collection years (chi-square test with p-value Bonferroni < 0.005) but not significantly associated with collection location (chi-square test with p-value Bonferroni > 0.005).
The maximum likelihood phylogeny of the 129 isolates was reconstructed based on the alignment of the core genome (3468 genes) and showed two distinctly evolved lineages, Fig. 1. Comparison with previous studies9,11, identified these lineages as BD-1.2 (n=84) and BD-2 (n=45), Fig. S2. Apart from the previously reported genetic variations4, we identified additional differences existing between the two lineages, in VSP (vibrio seventh pandemic; VSP-1 and VSP-2), VPI (vibrio pathogenicity islands, VPI-1 and VPI-2) and PLE (phage inducible chromosomal island-like elements), see Fig. 1. More precisely, in VSP-2, BD-2 isolates had a tryptophan at position 249, while BD-1.2 had a leucine at this position. In addition, in VSP-2, gene VC-514 (aer) was present in all BD-2 isolates but absent in BD-1.2. In VPI-2 a SNP led to an amino-acid variation at position 150, with BD-1.2 having an aspartic acid, and BD-2 an asparagine. BD-2 samples exclusively exhibited PLE2, while BD-1.2 samples had both PLE1 and PLE2 along with PLE2. Moreover, further differences were found in nonsynonymous SNPs on core genes and presence/absence of accessory genes, as described in the following section.
The distinct phylogeny patterns of BD-2 and BD-1.2, were also confirmed through a comparative study analysing 1134 isolates from V. cholerae El Tor O1 strains across 84 countries, including our isolates, (Tables S2 and S3, Fig. S3). BD-2 isolates clustered with Indian-1 (IND-1), while BD-1, BD-1.1, and BD-1.2 isolates from Bangladesh clustered with African (T9-T13)15, Latin America-3 (LAT-3)16, Asian-2 (AS-2), and Indian-2 (IND-2) lineages (Fig. S3), in agreement with previous results9.
Genetic and temporal differentiation of V. cholerae BD-1.2 and BD-2 lineages correlate with SNPs on coding and non-coding regions, and accessory genes
To assess the relatedness of V. cholerae isolates in our cohort, we measured the number of different core genome SNPs in a pairwise manner across all isolates. We created a network based on clusters of related isolates with less than 15 SNPs, as done previously17,18. Across the cohort the median SNP difference was 117 SNPs (ranging from 0 to 1710 SNPs with IQR of 1211). The resulting undirected graph (Fig. 2) revealed that BD-2 and BD-1.2 formed two disconnected graphs each composed of samples from a specific lineage, but with no distinct separations between the Ogawa and Inaba serotypes.
To identify additional potential involvement of genetic elements in shaping the differences between the BD-1.2 and BD-2 isolates in our cohort, beyond current annotations (ctxB allele, type of SXT/ICE, VSP-II, VIP-I, gyrA gene allele)9, we sought for patterns of similarities and differences, at a finer scale, searching for the number, type and position of accessory genes as well as mutations in the core genome and intergenic regions across all the isolates. A two-sided Fisher exact test, with Bonferroni correction, was performed to assess the relationship between the BD-2 and BD-1.2 lineages and each of the various genomic features (core and intergenic SNPs and accessory genes). Overall, we found a significantly larger proportion of core genome mutations (51.4%, 1224 core genome SNPs and 73.1%, 160 intergenic SNPs) and a small proportion of accessory genes (11.3%, 115 genes) that exhibited statistically significant differentiation between the two lineages, Table S4. Refer to Supplementary Note 1 and Fig. S4 for more details on the statistical analysis comparing the number of accessory genes, core genome SNPs and intergenic SNPs. The comparative analysis also indicated a temporal shift in the distribution of core genome and intergenic SNPs over the years, showing that BD-1.2 isolates accumulated different SNPs compared to BD-2 isolates as time progressed (Fig. S4E-F).
Out of the 115 accessory genes that differed between the two lineages, 12 were annotated while the remaining 101 were hypothetical. Among these 12 annotated genes, five – (lon_3, endA, adh, hdfR_4 and bcr_2) – were predominant (over 96% presence) in BD-1.2 and absent in BD-2, and seven (aer_3, hlyA_2, mcrC, mepM_3, mrr, tetA and tetR) were present (over 97% presence) in BD-2 and absent in BD-1.2. Of the twelve annotated genes, three are known to be antimicrobial resistance genes (bcr, tetA and tetR)19. TetA and tetR were mainly detected in BD-2 isolates (97.7%), confirmed as primarily tetracycline-resistant through susceptibility testing in both doxycycline and tetracycline antibiotics (Table S1). On the contrary, bcr, a multidrug efflux pump, was predominantly present in BD-1.2 isolates (96.4% of isolates) and completely absent in BD-2 isolates. Out of the 16 known antimicrobial resistant genes (ARGs) present in the pangenome of this cohort, only tetA, tetR and bcr were found to statistically separate both lineages. TetA and tetR were both located in a contig showing high similarity to the SXT-ICE element, SXT(HN1) in BD-2 isolates. Conversely, bcr was found in a mobile element in the BD-1.2 isolates with similarity to SXT ICE element, ICEVchBan5. The presence of these SXT elements in the BD-2 and BD-1.2 lineages was previously shown by Monir et al9. Both contigs contained two identical insertion sequences, mobile genetic elements MGEs, (ISShfr9 and ISVsa3), see Fig. S5. Also, among the 12 annotated genes, four (endA, hlyA, lon and mcrC) were previously found to be related to virulence19–24. More information about the function of these genes is given in the Supplementary Note 2.
To assess the extent of our results beyond our cohort, we investigated whether the 12 annotated accessory genes that we had found were also present in other Bangladeshi and Indian lineages. We performed a comparative genomic analysis of 219 Vibrio cholerae O1 reference isolates collected in Kolkata, India, and Dhaka, Bangladesh, between the years 2004 and 2022 (ENA public database http://www.ebi.ac.uk/ena, see Table S5). The results confirmed the presence/absence patterns of the 12 genes in the BD-1.2 and BD-2 lineages in the reference isolates, aligning with our initial findings, see Supplementary Note 2.
In addition to differences in accessory gene types and patterns, missense mutations associated to allelic variations were found in BD-1.2, when compared to BD-2 strains. We identified 1385 SNPs in the core genome, including 291 non-synonymous and 934 synonymous coding variants, both representing variants in their functional protein-coding form. In addition, 160 intergenic SNPs were found, representing variants in their regulatory form. Many SNPs showcased unique allelic distribution patterns between the two lineages. When mapped back, the non-synonymous SNPs identified 291 amino acid substitutions in 105 genes, including 50 known genes and 55 hypothetical ones (see Table S4). Table 1 shows core genes with allelic distribution between BD-1.2 and BD-2 significantly different (i.e., containing polymorphic sites found exclusively in one lineage but absent in the other lineage).
Table 1. Core genes with a significant different allelic distribution between BD-1.2 and BD-2. Core genes containing non-synonymous SNPs and showcasing the allelic variants that were found exclusively in one lineage but absent in the other lineage. For each SNP, the allelic frequency of the major allele and minor allele and the P-value for the allelic distribution between BD-1.2 and BD-2 have been calculated. The reference allelic variation refers to the nucleotide present in reference genome sequence of V. cholerae N16961 El Tor (NCBI Accession ID: NC_002505.1 and NC_002506.1) and the alteration allelic variation refers to the nucleotide absent in the reference genome, as previously done by Monir et al9,11.
Gene name
|
Vibrio cholerae namea
|
SNP genomic location
|
Alleles
REFb / ALTc
|
Frequency of REFb allele in BD-1.2 / BD-2
|
Frequency of ALTc allele in BD-1.2 / BD-2
|
Amino-acid change
|
P-value (Fisher exact test)
|
appC
|
VC_1571d
|
676
|
G/A
|
0/100
|
100/0
|
Ala226Thr
|
7.97E-36
|
argG
|
VC_2642d
|
847
|
A/G
|
100/0
|
0/100
|
Thr283Ala
|
7.97E-36
|
bluF
|
VC_1641
|
448
|
C/T
|
0/100
|
100/0
|
Lys149Asn
|
7.97E-36
|
bluF
|
VC_1641
|
449
|
A/G
|
0/100
|
100/0
|
Ser150Arg
|
7.97E-36
|
bluF
|
VC_1641
|
451
|
T/G
|
0/100
|
100/0
|
Leu151his
|
7.97E-36
|
bluF
|
VC_1641
|
455
|
T/C
|
0/100
|
100/0
|
Gly152Ser
|
7.97E-36
|
bluF
|
VC_1641
|
456
|
G/A
|
0/100
|
100/0
|
Phe153Ala
|
7.97E-36
|
bluF
|
VC_1641
|
461
|
C/T
|
0/100
|
100/0
|
Gln154Ala
|
7.97E-36
|
bluF
|
VC_1641
|
462
|
A/G
|
0/100
|
100/0
|
Thr155Asn
|
7.97E-36
|
bluF
|
VC_1641
|
466
|
C/T
|
0/100
|
100/0
|
Ala156Ser
|
7.97E-36
|
bluF
|
VC_1641
|
469
|
C/T
|
0/100
|
100/0
|
Ile157Arg
|
7.97E-36
|
bluF
|
VC_1641
|
472
|
G/A
|
0/100
|
100/0
|
Asp158Ile
|
7.97E-36
|
clcA
|
VC_A0526d
|
311
|
G/A
|
0/100
|
100/0
|
Gly104Glu
|
7.97E-36
|
cobB
|
VC_1509d
|
149
|
C/T
|
100/0
|
0/100
|
Pro50Leu
|
7.97E-36
|
ctxB
|
VC_A0009
|
58
|
C/A
|
0/100
|
100/0
|
His20Asn
|
7.97E-36
|
cysG_1
|
VC_1363d
|
113
|
T/C
|
100/0
|
0/100
|
Val38Ala
|
7.97E-36
|
dltA_1
|
VC_A0149
|
3145
|
T/C
|
0/100
|
100/0
|
Phe1049Leu
|
7.97E-36
|
dsbD
|
VC_2701d
|
1748
|
C/T
|
0/100
|
100/0
|
Thr583Ile
|
7.97E-36
|
ftsI
|
VC_2407d
|
1472
|
G/A
|
0/100
|
100/0
|
Arg491His
|
7.97E-36
|
glmM
|
VC_0639d
|
587
|
G/T
|
100/0
|
0/100
|
Arg196Leu
|
7.97E-36
|
gyrA
|
VC_1258
|
1980
|
G/T
|
100/0
|
0/100
|
Asp660Glu
|
7.97E-36
|
hrpB
|
VC_0601
|
2345
|
C/T
|
100/0
|
0/100
|
Ala782Val
|
7.97E-36
|
licH
|
VC_1284d
|
166
|
G/A
|
0/100
|
100/0
|
Ala56Thr
|
7.97E-36
|
mak
|
VC0270d
|
346
|
G/A
|
0/100
|
100/0
|
Gly116Arg
|
7.97E-36
|
murI
|
VC_0158d
|
409
|
G/T
|
100/0
|
0/100
|
Ala137Ser
|
7.97E-36
|
mutL
|
VC_0345
|
1048
|
T/C
|
0/100
|
100/0
|
Cys350Arg
|
7.97E-36
|
nudF_2
|
VC_2435
|
325
|
C/T
|
100/0
|
0/100
|
Arg109Cys
|
7.97E-36
|
pctB_4
|
VC_0514
|
746
|
T/G
|
100/0
|
0/100
|
Leu249Trp
|
7.97E-36
|
phhA
|
VC_A0828d
|
56
|
A/T
|
100/0
|
0/100
|
Gln19Leu
|
7.97E-36
|
putA
|
|
1799
|
C/T
|
100/0
|
0/100
|
Ala600Val
|
7.97E-36
|
recD
|
VC_2319
|
2039
|
A/G
|
0/100
|
100/0
|
Tyr680Cys
|
7.97E-36
|
rssB_3
|
VC_1652
|
235
|
C/T
|
0/100
|
100/0
|
Leu79Phe
|
7.97E-36
|
skp
|
VC_2251
|
146
|
T/G
|
0/100
|
100/0
|
Leu46Trp
|
7.97E-36
|
suhB
|
VC_0745d
|
650
|
A/G
|
100/0
|
0/100
|
Glu217Gly
|
7.97E-36
|
tamA
|
VC_2548
|
797
|
C/T
|
100/0
|
0/100
|
Thr266Ile
|
7.97E-36
|
trmL_1
|
VC_A0627
|
16
|
A/T
|
100/0
|
0/100
|
Thr6Ser
|
7.97E-36
|
tyrS_1
|
VC_0631
|
1177
|
A/G
|
100/0
|
0/100
|
Thr393Ala
|
7.97E-36
|
valS
|
VC_2503
|
1796
|
G/A
|
0/100
|
100/0
|
Arg599His
|
7.97E-36
|
ycbB
|
VC_1268
|
976
|
C/T
|
100/0
|
0/100
|
Pro327Ser
|
7.97E-36
|
a V. cholerae name: specific VC (Vibrio cholerae) gene names. b REF (Reference) allele: refers to the nucleotide present in reference genome sequence of V. cholerae N16961 El Tor (NCBI Accession ID: NC_002505.1 and NC_002506.1)
c ALT (Alteration) allele: refers to the nucleotide absent in the reference genome
d Metabolic genes found in the V. cholerae GSM model iAM-Vc960
|
Among the genes exhibiting lineage-specific allelic variation, some contribute to functions including growth, cell wall organization, colonization, toxigenicity and resistance, similar to what found previously9. Additionally, we found genes with a unique non-synonymous variant in BD-1.2, with roles in toxin transport and acid tolerance, shedding light on functions that may clarify their contribution to the recent prevalence of BD-1.2 over BD-2. See Supplementary Note 3 for more information about these genes. Notably, OmpU is another gene with a statistically significant mutation (G325D) underlying lineages’ separation. Amino acid D is predominant in BD-1.2, while the amino-acid G is prevalent in BD-2.
To understand the systemic relationships connecting the identified lineage-specific genetic signatures on a mechanistic level, we analysed the 30 core genes in Table 1 with allelic variants that were found exclusively in one lineage but absent in the other lineage using the V. cholerae GSM model iAM-Vc960 (Fig. 4). Thirteen of these genes (murI, ftsI, appC, suhB, glmM, dsbD, licH, cysG_1, cobB, clcA, argG, mak, phhA) are metabolic and have been identified as playing integral roles in amino acid metabolism, cell wall metabolism, carbon metabolism, amino sugar and nucleotide sugar metabolism, energy metabolism (see Table S6). Flux variability analysis (FVA) and flux balance analysis (FBA) were used to predict, through gene knock outs, the essentiality and the effects of the identified genetic determinants on the flow of metabolites through V. cholerae metabolic network (all known metabolic reactions) and on the growth rates of V. cholerae. The genes cysG, clcA, adh and mcrC, were found as essential for growth (i.e. knocking these genes out reduced the biomass growth to less than 0.0001h-1) in both rich and minimal media. Furthermore, murI, glmM, and dapF displayed auxotrophic behaviour in minimal media, whereas cysG, clcA, adh, and mcrC were essential in rich media with alternative carbon sources. Additionally, three genes, murI, glmM and dapF, were found to be essential for growth in minimal media only. Next, flux variability analysis (FVA) was used to identify biochemical reactions whose flux span was significantly (greater than 10% change) changed by knocking out these genes. In total ten genes murI, glmM, cysG, clcA, argG, mak, adh, dapF, add, and mcrC when knocked out significantly changed the flux span in at least one reaction through the model by FVA analysis, Table S6. Finally, FBA analysis was used to determine the effect of gene knockouts on metabolite yield. Five genes, murI, glmM, cycG, mak, and dapF were found to reduce at least one metabolite yield to zero in the model when knocked out (given the wildtype yield was greater than 0), Table S6.
Lastly, when mapping the 160 intergenic SNPs back to genomes, we found their location in the upstream/downstream regions of 35 known genes and 34 hypotheticals genes (see Table S4). These intergenic SNPs exhibited allelic distribution, with the minor variant prevalent in the BD-2 isolates (68% to 100%), while the major variant dominated in the BD-1.2 isolates (over 98%), only one SNP in BD-1.2 had a major allelic variant at of 47% (Fisher exact test, Bonferroni correction p-value< 2.31e-08). Many of these SNPs were located within transcriptional factor binding sites (TFBs) (Table S4). Intergenic SNPs, exhibiting significantly different allelic distributions between BD-1.2 and BD-2, mapped across the TFBs of 11 TFs (ToxT, Fur, AmpR, OmpR, LuxR, LexA, ArgR, PhoP, CRP, ArcA) (Fig. S6-S16). More information about the function of these transcriptional factor binding motifs is provided in Supplementary Note 4.
Machine learning unravels correlations between genomic determinants and clinical symptoms in humans
Beyond identifying the potential involvement of new genetic traits in differentiating the BD-1.2 and BD-2 lineages, we hypothesized that the same or additional genetic features might play a significant role in the manifestation of clinical symptoms in patients when infected with Vibrio. We focused on the lineage BD-1.2, which caused the most recent outbreak in Bangladesh. To identify if and which coding and non-coding mutations and/or presence/absence of accessory genes would correlate with the different clinical symptoms, we employed a bespoke, supervised machine learning pipeline.
The pipeline is aimed at mining sequencing data to identify the genetic elements that more strongly correlate with observed clinical symptoms differences, which in this case are vomit, dehydration, number of stools, duration of diarrhoea and abdominal pain (see Methods section). The pipeline is a bespoke adaptation of ML-based data-mining methods previously developed within our team to identify correlations between genomic features with phenotypes18,19,25,26. In the pipeline, information about different genetic features (SNPs -both from coding and non-coding regions- and presence/absence of accessory genes) can be encoded as input to ML-powered predictive models designed to estimate the likelihood of observing the selected phenotypes under each specific pattern of input values18. As long as trained with sufficient observational data, the ML-powered predictive models are able to replicate experimental evidence, in addition to providing information on what inputs correlated most strongly with each phenotypic manifestation. Through such introspective power, the pipeline is able to unravel co-occurrent, multiple mechanisms (mutations, horizontal gene transfer - HGT), variants in their functional protein-coding and regulatory forms, as well as their additive effect on the targeted phenotypes, which in this work, were different clinical symptoms.
To ensure the best performance and avoid any bias in the pipeline, we experimented with multiple, different technologies powering the pipeline, namely five classification technologies (Linear SVM, RBF SVM, Random Forest, Extra tree classifier and Logistic regression) and two meta-methods (Adaboost and XGBoost). A nested cross validation approach was adopted to compare the technologies and select the best performing one, using the Friedman and Nemenyi tests to assess final classification performance (see Methods section).
The following clinical symptoms were selected, namely: vomit, abdominal pain, diarrhoea duration, 24-hour stool count and dehydration. Each clinical symptom was handled by building a dedicated predictor pipeline, with the goal to produce separate results in terms of correlation with genetic elements. Two symptoms (vomit and abdominal pain) were encoded as binary (presence vs absence). The other three symptoms – diarrhoea duration, 24-hour stool count, and dehydration – were encoded as multi-class: dehydration as “None”, “Moderate” and “Severe”; diarrhoea duration as < 1 day, 1-3 days, 4-6 days, and 7-9 days; and stool count in 24 hours as 3-5 times, 6-10 times, 11-15 times, 16-20 times, and 21+ times. We handled the prediction of multi-class phenotypes via the implementation of concurrent binary predictors, each addressing a pairwise combination of outcomes. In the end, we were able to successfully develop six adequately performing binary prediction models for the phenotypical outcomes: i) stools 11-15 times vs. 16-20 times; ii) stools 11-15 times vs. 21+ times; iii) moderate vs. severe dehydration; iv) diarrhoea duration <1 day vs. 1-3 days; v) presence vs absence of vomit; and vi) presence vs absence of abdominal pain (Table S7). The remaining binary predictors were discarded for not performing adequately, either because of unbalanced available sets of observations (needed for training the supervised ML models), or because of more challenging separability of the phenotypes given the selected inputs (no features were statistically significant based on the Fisher exact test). Among the tested pipeline technologies mentioned earlier, logistic regression was identified by the Friedman F-test and the Nemenyi post-hoc analysis as the best performing one (Fig. S17). Of the six binary prediction models, four had an AUC greater than 0.9, Fig. 4. Table S8 indicates the performance metrics obtained by all binary predictors for each clinical symptom. Fig. 4 and S18 show the performance results for the Logistic regression classifier.
The analysis of the best-performing predictors allowed to identify the inputs features (core genome coding and intergenic SNPs and accessory genes) most strongly correlated to each phenotype (Table S9). Ninety-two different features in total were selected as relevant by extraction from the six predictor models, with 45% being present in 3 or more models (Fig. 5). No features were selected for all symptoms. Fifty-eight accessory genes (10 known genes, tufB_2, blc, yiaC, pckA, luxR_2, hcpA_1, rpoS, dcuA, oppA, luxR, and 48 hypothetical genes) and 28 core SNPs over 23 genes (14 known, clpS, gshB, dapF, fabV_1, add, tufB, lpoA, phrB, yjcS, fabH1, cysG_2, padC, pepN, tadA_2, and 9 hypothetical genes) were identified as strongly associated to the symptoms. Four known genes (add, dapF, fabV, and tufB) were all associated to three predictor models (abdominal pain, duration of diarrhoea and number of stools), with add also associated with dehydration and the other three genes also associated with vomit. The genes gshB, clpS and lopA were all associated with three predictor models (diarrhoea duration <1 day vs. 1-3 days, stools 11-15 times vs. 16-20 times, stools 11-15 times vs. 21+ times), with lpoA also associated with abdominal pain and the other two genes with vomit.
Among the 58 accessory genes linked to clinical symptoms, four hypothetical genes were also statistically significant in distinguishing the two lineages. Among the other accessory genes selected, four (blc, pckA, luxR and rpoS) have important biological functions. In particular, Blc, also known as VlpA, is a lipocalin, that is correlated to acquisition of drug resistance in V. cholera27. PckA (phosphoenolpyruvate carboxykinase) is important for gluconeogenesis, a highly conserved pathway in bacteria and humans. Interfering with gluconeogenesis pathway impacts V. cholerae colonization in mouse models, highlighting its crucial role in sustaining V. cholerae growth and viability within the intestines28. LuxR plays a key role in regulating biofilm production and secretion in V. cholerae29. RpoS is a sigma factor that facilitates physiological adaptation to general starvation and stationary phase growth in different species. V. cholerae strains lacking the gene rpoS are impaired in the ability to survive in different environmental stresses. RpoS was also shown to be important in V. cholerae for efficient intestinal colonization30.
Out of the 28 core SNPs associated to the clinical symptoms, 11 were also found previously as statistically significant in differentiating the BD-2 and BD-1.2 lineages (see above), Table S9. These 11 SNPs mapped to 11 genes (clpS, gshB, dapF, fabV_1, add, and six hypothetical). Among the SNPs mapping to known genes (clpS, gshB, dapF, fabV_1, add), three are non-synonymous SNPs mapping to clpS, gshB and fabV. In V. cholerae ClpS regulation involves cAMP receptor protein (CRP)31. CRP is important in intestinal colonization31. GshB, encodes a glutathione synthetase (GSH), a gene associated to resistance to oxidative stress. V. cholerae fabV is one of the several triclosan-resistant ENR encoding genes32.
Nine symptoms-related genes were identified as metabolic genes in the iAM-Vc960GSM model (Fig. 6). Eight of these genes were associated to five metabolic systems (Table S10). FabH1 and gshB associated with cofactor and prosthetic group metabolism; pckA is associated with carbohydrate metabolism; dcuA plays a crucial role in C4-dicarboxylate transport; dapF, pepN and gshB are significant in amino acid metabolism; add and pckA are relevant to nucleotide metabolism; oppA and fabH1 are involved in cell wall metabolism, with fabH1 relevant for fatty acid biosynthesis (Table S10).
Using FBA and FVA analysis, the knockouts of the genes dapF and gshB were found to halt production of several metabolites. The genes pckA, add, dapF, oppA, gshB were found to significantly change the reaction flux span, Table S10. Both FBA and FVA analysis can infer if potential metabolic adaptation mechanisms for V. cholerae can lead to alterations in bacterial virulence, potentially leading to worst symptoms, if genes significantly affect pathways which are associated to important functions such as colonization, biofilm production and cell wall synthesis. For example, the gshB gene, a glutathione reductase, contributes to V. cholerae intestinal colonization33 and have a role in acid tolerance response34. Similarly, dapF was found as an essential gene in minimal media and led to auxotrophic behaviour to the amino-acid lysine. As Pearcy et al.35 indicated, an auxotrophic behaviour of a gene connected to amino-acid biosynthesis is important because it can provide competitive fitness advantage against commensal bacteria. During the infection stage V. cholerae engage and compete with commensal bacteria for nutrient acquisition to support rapid growth and multiplication36. Moreover, the lysine pathway plays a central role in eubacteria cell wall biosynthesis, since meso-diaminopimelate is the immediate precursor for the biosynthesis of its main component, peptidoglycan, with dapF responsible for the creation of meso-diaminopimelate in the lysine pathway37,38. The proper synthesis and maintenance of peptidoglycan is essential for bacterial virulence and its viability39.
To delve deeper into understanding the functional mechanisms underlying clinical symptoms, we explored the interactome of the proteins associated to the clinical symptoms. The protein-protein interaction network (PPI) analysis revealed the interactome of 36 proteins, selected by the machine learning pipeline, with 109 other proteins, Fig. S19. The KEGG analysis indicated enrichment in ribosome proteins (e.g., RpoS) and fatty acid biosynthesis (e.g., FabH1, FabV) (Fig. S20). The colonization in the human intestine and virulence of V. cholerae is intricately connected to both fatty acid metabolism40 and the ribosome pathway41. The GO analysis highlighted enrichment in translation, peptide biosynthetic processes, and gene expression, featuring TufA, TufB, RpoS, GshB (Table S11-S12). The peptide biosynthetic pathway plays a vital role in V. cholerae biofilm formation and colonization42.
None of the six intergenic SNPs selected by the machine learning pipeline were in TFBs or promoters. These SNPs were located in a region without any functional annotations within 2 kbps upstream or 0.5 kbps downstream of a gene, adhering to the standard database dbSNP cutoffs for SNP-to-gene mapping43,44. See Table S9 for additional information about the location of these SNPs.
Structural analysis suggests evolutionary drivers of selection and mechanistic bases for BD-2 and BD-1.2 lineages evolution and associations to clinical symptoms
To further understand whether the identified alleles play a causal role in the evolution of lineages and clinical symptoms, we selected two of the top-ranked non-synonymous SNP candidates, prioritizing the following aspects in relation to the associated genes: (i) have significant allelic distribution between BD1-1.2 and BD-2; (ii) have a significant correlation, as detected by the ML pipeline, with the selected clinical symptoms; (iii) characterised as functionally important for V. cholerae metabolisms (i.e. significantly impacting reaction flux when knocked out, as highlighted by the GSM model) and/or interactome (i.e. enrichment of the functions and mechanisms related to pathogenesis); (iv) 3D structural mutation analysis could be benchmarked with experimental evidence. This resulted in three genes, all top-ranked by both the Fisher Exact test for BD-1.2 and BD-2 lineage evolution and the ML analysis for the underlying clinical symptoms, namely: fabV, gshB and clpS. We mapped the alleles of fabV, gshB and clpS to their protein structures using both experimental crystal structures and predicted homology models. However, the 3D-structure could be utilised to infer the mechanistic basis only for fabV and gshB.
In all BD-2 isolates fabV had a proline at position 149 (Pro149) whereas, in BD-1.2 isolates, the Pro149 was found in only 40.5% of cases, with the remaining 59.5% isolates exhibiting histidine at position 149 (His149). The BD-1.2 isolates with His149 showed a higher duration of diarrhoea (1-3 days) and a higher number of stool score (16-20 times and 21+ in 24 hours) compared to the BD-1.2 isolates with Pro149, featuring a lower diarrhoea duration (<1 day) and lower number of stools score (11-15 times). The amino acid 149 was located in the trans-2-enoyl-CoA reductase catalytic domain (Fig. 7A-E), when Pro149 is present, it interacts with Lys148, Ser151, Trp159 through Van der Waals (VDW) interactions, whereas His149 not only forms the aforementioned interactions but also creates an extra VDW interaction with Lys148. Furthermore, His149 interacts with an additional amino acid, Arg150, through a VDW interaction. These additional interactions in the presence of the His149 cause an increase in the stability of the structure (ΔΔG = 0.101 kcal/mol >0) and a decrease of the molecule flexibility (ΔΔSVib ENCoM: -0.053 kcal.mol-1K-1), which is usually linked to a stronger binding affinity45,46. Moreover, the presence of His149 increased the positive charge of the surrounding area (Lys148, His149, Arg150) (Fig. S21), with an overall electrostatic energy increasing from 7.3E+03 kJ/mol (Pro149) to 7.48E+03 kJ/mol (His149) within the 5Å region and with an overall protein total electrostatic energy rising from 2.1E+05 kJ/mol (Pro149) to 2.52E+05 kJ/mol (His149). Exposed, positively charged amino acids are suggested to promote interactions with negatively charged cellular systems47. The enhanced positive charge of FabV in the presence of His 149 might support its role in participating in the breakdown of the negatively charged fatty acids.
GshB, a glutathione reductase, has been shown to contribute to V. cholerae intestinal colonization33 and to have a role in the ability of V. cholerae to mount an acid tolerance response34. In all BD-2 isolates GshB had a threonine at position 93 (Thr93), whereas in the BD-1.2, the Thr93 was only found in 21.5% of the cases, with most (78.5%) of the BD-1.2 isolates exhibiting an isoleucine (Ile93) at this position. The BD-1.2 isolates with Ile93 are associated to a higher duration of diarrhoea (1-3 days) and a higher number of stool score (16-20 times and 21+ in 24 hours) compared to the BD-1.2 isolates with Thr93. Thr93 interacts with Asp92, Ile96, Tyr97 through 13 VDW interactions and 1 H-bond; whereas Ile93 not only forms the aforementioned interactions but also creates extra VDW interactions with Tyr97 (Fig. 8A-E). These additional bonds in the presence of Ile93 cause an increase in the stability of the structure (ΔΔG = 0.384 kcal/mol >0) and a decrease of the molecule flexibility (ΔΔSVib ENCoM: -0.055 kcal.mol-1.K-1), which is usually linked to a stronger binding affinity45,46. Moreover, the presence of Ile93 increased the negative charge of the surrounding area (<5Å) (Fig. S22A-B), with an overall electrostatic energy decreasing from 7.93E+03 kJ/mol (Thr93) to 7.4E+03 kJ/mol (Ile93) within the 5Å region and with an overall protein total electrostatic energy varying from 2.1E+05 kJ/mol (Thr93) to 1.8E+05 kJ/mol (Ile93). A decrease in total electrostatic energy is often associated to folding48, protein folding stability is largely dependent on the hydrophobic interactions of nonpolar residues49. The surface, on average, has become more hydrophobic, indicating a possible reorientation of residues or a change in the surface's exposure to the solvent (Fig. S22C-D).