Signature genomic traits of the core and accessory genome of Vibrio Cholerae O1 drive lineage transmission and disease severity

doi:10.21203/rs.3.rs-4184222/v1

Download PDF

Article

Signature genomic traits of the core and accessory genome of Vibrio Cholerae O1 drive lineage transmission and disease severity

https://doi.org/10.21203/rs.3.rs-4184222/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 23 Sep, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

In Bangladesh, Vibrio cholerae lineages are undergoing genomic evolution, causing pandemics and outbreaks with increased virulence, resistance, spreading ability and disease severity. However, our understanding of the genomic determinants influencing transmission and disease severity patterns, as well as their interplay, remains incomplete. Here, we developed a computational framework based on machine-learning, genome scale metabolic modelling (GSSM) and 3D structural analysis, to identify V. cholerae signatures genomic traits linked to lineage transmission dynamics and disease severity. We analysed isolates collected from in-patients across six regions in Bangladesh from 2015 to 2021, and uncovered a core set of accessory genes, coding, and intergenic SNPs uniquely present in the most recent dominant lineage and underlying lineage transmission, with virulence, motility, colonization, biofilm formation, acid tolerance and bacteriophage resistance functions. Furthermore, we uncovered the existence of a strong correlation between a core set of V. cholerae genomic determinants and disease severity patterns (diarrhoeal duration, number of stools, abdominal pain, vomit, and dehydration). A subset of these determinants overlapped with those driving lineage transmission dynamics. Through GSMM and 3D structure analysis, we inferred the mechanistic bases underlying their selection and unveiled a complex interplay between transcription regulation, protein interactions and stability, and metabolic networks leading to severe symptoms. These connections influence lifestyle adaptation, intestinal colonization, oxidative stress and acid tolerance through modulation of ribosome, fatty acid and peptide biosynthesis, bacterial efflux systems, virulence, and resistant genes. Our computational framework allows to uncover signature traits which can provide insights for advancing therapeutics and developing targeted interventions to mitigate cholera spread.

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Data mining

Biological sciences/Genetics/Genomics

Biological sciences/Microbiology/Bacteria/Bacterial genomics

Cholera is an acute diarrheal disease. Worldwide, 1.3 billion people are estimated to be at risk and approximately 1.3 to 4 million cases occur annually, with 21,000 to 143,000 resulting in death^1,3. In Bangladesh alone, where cholera is endemic, an estimated 66 million people are at risk of cholera with at least 100,000 cases and 4,500 deaths per year ^1,2. Globally the O1 serogroup remains the primary cause of cholera^1,3. The O1 serogroup is divided into the main serotypes Ogawa and Inaba, and subdivided into two biotypes, classical and El Tor (7th pandemic), which are genotypically and phenotypically distinct^4–6. V. cholerae has shown an extraordinary capacity to undergo genetic and phenotypic changes over time, giving rise to successive waves of genetically and phenotypically diverse pandemic clones. These variants exhibit increased virulence, pathogenicity, resistance and spreading capability^7,8.

Recently, distinctive lineages belonging to the 7th pandemic El Tor (7PET) wave-3 have been observed circulating in Bangladesh^9–11. The two most prominent circulating lineages identified over the last 20 years are BD-1 and BD-2^9,11, and more recently BD-1.2, responsible for the latest 2022 massive outbreak in the country⁹. Genomic analysis revealed variations between BD-1.2 and BD-2 in the Vibrio seventh pandemic island II (VSP-II), Vibrio pathogenic island 1 (VPI-1), mobile genetic elements, phage-inducible chromosomal island-like element (PLE), and SXT-related integrating conjugative elements (SXT ICE)⁹. Despite the advances of genomic analysis, the complete genomic repertoire and the mechanisms causing the greater transmission of BD-1.2 remain unknown. Gaps persist in our knowledge regarding whether coding or non-coding SNPs, or accessory genes, drive the evolutionary shifts. It remains unclear whether gene regulation, metabolic or molecular networks, or folding events play a role. There is even less knowledge about the genomic determinants responsible for the severity of cholera resulting from these lineages. About 1 in 5 people with cholera will experience severe symptoms (diarrhoea, vomiting, dehydration)¹². Amongst the major symptoms, watery diarrhoea characteristic of cholera is caused by the cholera toxin (CT)^4-6. The V. cholerae El Tor responsible for the current cholera pandemic has become more virulent by undergoing several changes in CTX genotype¹⁶ and acquiring virulence-related gene islands¹³.

In this study, we developed a reference-agnostic machine learning method, coupled with genome-scale metabolic modelling (GSMM) and protein structural analysis, to achieve two key objectives as outlined below. The first objective was to identify the genetic variations and signatures of the BD1.2 lineage evolution beyond what has been found so far⁹. Our analysis considered 129 V. cholerae isolates from diarrhoea samples collected between 2015 and 2021, from patients admitted to the icddr,b hospital in Bangladesh. Several genomic studies investigated the evolution of lineages from 1991 to 2017, as well as in 2022^9–11. However, there remains a gap in research during the intervening period. In our analysis, we discovered a set of 77 SNPs within the coding genome (mapped to 50 known genes), along with 12 annotated accessory genes, including some associated with antibiotic resistance, virulence, motility, colonization, biofilm formation, acid tolerance and bacteriophage resistance, identified as correlated with BD-1.2 transmission. Our findings go beyond what was recently discovered^9,11 for the lineage.

The second objective was to investigate if correlations exist between the genomic determinants of BD-1.2 strains and clinical manifestations among hospitalised patients from whom the isolates were collected from. Machine learning revealed the existence of correlations between genetic determinants in V. cholerae and clinical symptoms (diarrhoeal duration, number of stools, abdominal pain, vomit, and dehydration). Overall, the analysis revealed an overlap of 11 mutations, four accessory genes, and one intergenic SNP between the unique genomic determinants associated with BD-1.2 transmission and the clinical symptoms linked to this lineage. Additionally, a distinct set of 17 mutations, 54 accessory genes, and six intergenic SNPs were found exclusively linked to the manifestation of severe clinical symptoms. Through detailed GSMMs and 3D structure analysis of these genes, we inferred the mechanistic basis behind the selection of these genomic drivers in BD-1.2 and link to severe symptoms.

From 2015 to 2021 in Bangladesh, a diverse array of genetic variations characterises the emergence of distinct circulating lineages

To explore the evolutionary dynamics of V. cholerae linked to ongoing cholera cases in Bangladesh, a genomic analysis was done considering the years 2015 to 2021. We sequenced 129 V. cholerae O1 El Tor isolates taken from stool samples of patients between September 2015 to April 2021 admitted to hospitals in six districts (Barisal, Chittagong, Dhaka, Khulna, Rajshahi and Sylhet) of Bangladesh, Table S1. During the duration of this study, isolates belonging to serotypes Inaba and Ogawa were identified, Fig.1. Consistent with previous studies^9,14, a serotype switch was observed, with Inaba predominantly present in 2016 and 2017, followed by a predominance of Ogawa samples in 2018 and 2019 (Fig. S1). Both serotypes were detected in 2015 and continued to coexist from 2020 onwards. Serotypes were significantly associated with collection years (chi-square test with p-value Bonferroni < 0.005) but not significantly associated with collection location (chi-square test with p-value Bonferroni > 0.005).

The maximum likelihood phylogeny of the 129 isolates was reconstructed based on the alignment of the core genome (3468 genes) and showed two distinctly evolved lineages, Fig. 1. Comparison with previous studies^9,11, identified these lineages as BD-1.2 (n=84) and BD-2 (n=45), Fig. S2. Apart from the previously reported genetic variations⁴, we identified additional differences existing between the two lineages, in VSP (vibrio seventh pandemic; VSP-1 and VSP-2), VPI (vibrio pathogenicity islands, VPI-1 and VPI-2) and PLE (phage inducible chromosomal island-like elements), see Fig. 1. More precisely, in VSP-2, BD-2 isolates had a tryptophan at position 249, while BD-1.2 had a leucine at this position. In addition, in VSP-2, gene VC-514 (aer) was present in all BD-2 isolates but absent in BD-1.2. In VPI-2 a SNP led to an amino-acid variation at position 150, with BD-1.2 having an aspartic acid, and BD-2 an asparagine. BD-2 samples exclusively exhibited PLE2, while BD-1.2 samples had both PLE1 and PLE2 along with PLE2. Moreover, further differences were found in nonsynonymous SNPs on core genes and presence/absence of accessory genes, as described in the following section.

The distinct phylogeny patterns of BD-2 and BD-1.2, were also confirmed through a comparative study analysing 1134 isolates from V. cholerae El Tor O1 strains across 84 countries, including our isolates, (Tables S2 and S3, Fig. S3). BD-2 isolates clustered with Indian-1 (IND-1), while BD-1, BD-1.1, and BD-1.2 isolates from Bangladesh clustered with African (T9-T13)¹⁵, Latin America-3 (LAT-3)¹⁶, Asian-2 (AS-2), and Indian-2 (IND-2) lineages (Fig. S3), in agreement with previous results⁹.

Genetic and temporal differentiation of V. cholerae BD-1.2 and BD-2 lineages correlate with SNPs on coding and non-coding regions, and accessory genes

To assess the relatedness of V. cholerae isolates in our cohort, we measured the number of different core genome SNPs in a pairwise manner across all isolates. We created a network based on clusters of related isolates with less than 15 SNPs, as done previously^17,18. Across the cohort the median SNP difference was 117 SNPs (ranging from 0 to 1710 SNPs with IQR of 1211). The resulting undirected graph (Fig. 2) revealed that BD-2 and BD-1.2 formed two disconnected graphs each composed of samples from a specific lineage, but with no distinct separations between the Ogawa and Inaba serotypes.

To identify additional potential involvement of genetic elements in shaping the differences between the BD-1.2 and BD-2 isolates in our cohort, beyond current annotations (ctxB allele, type of SXT/ICE, VSP-II, VIP-I, gyrA gene allele)⁹, we sought for patterns of similarities and differences, at a finer scale, searching for the number, type and position of accessory genes as well as mutations in the core genome and intergenic regions across all the isolates. A two-sided Fisher exact test, with Bonferroni correction, was performed to assess the relationship between the BD-2 and BD-1.2 lineages and each of the various genomic features (core and intergenic SNPs and accessory genes). Overall, we found a significantly larger proportion of core genome mutations (51.4%, 1224 core genome SNPs and 73.1%, 160 intergenic SNPs) and a small proportion of accessory genes (11.3%, 115 genes) that exhibited statistically significant differentiation between the two lineages, Table S4. Refer to Supplementary Note 1 and Fig. S4 for more details on the statistical analysis comparing the number of accessory genes, core genome SNPs and intergenic SNPs. The comparative analysis also indicated a temporal shift in the distribution of core genome and intergenic SNPs over the years, showing that BD-1.2 isolates accumulated different SNPs compared to BD-2 isolates as time progressed (Fig. S4E-F).

Out of the 115 accessory genes that differed between the two lineages, 12 were annotated while the remaining 101 were hypothetical. Among these 12 annotated genes, five – (lon_3, endA, adh, hdfR_4 and bcr_2) – were predominant (over 96% presence) in BD-1.2 and absent in BD-2, and seven (aer_3, hlyA_2, mcrC, mepM_3, mrr, tetA and tetR) were present (over 97% presence) in BD-2 and absent in BD-1.2. Of the twelve annotated genes, three are known to be antimicrobial resistance genes (bcr, tetA and tetR)¹⁹. TetA and tetR were mainly detected in BD-2 isolates (97.7%), confirmed as primarily tetracycline-resistant through susceptibility testing in both doxycycline and tetracycline antibiotics (Table S1). On the contrary, bcr, a multidrug efflux pump, was predominantly present in BD-1.2 isolates (96.4% of isolates) and completely absent in BD-2 isolates. Out of the 16 known antimicrobial resistant genes (ARGs) present in the pangenome of this cohort, only tetA, tetR and bcr were found to statistically separate both lineages. TetA and tetR were both located in a contig showing high similarity to the SXT-ICE element, SXT(HN1) in BD-2 isolates. Conversely, bcr was found in a mobile element in the BD-1.2 isolates with similarity to SXT ICE element, ICEVchBan5. The presence of these SXT elements in the BD-2 and BD-1.2 lineages was previously shown by Monir et al⁹. Both contigs contained two identical insertion sequences, mobile genetic elements MGEs, (ISShfr9 and ISVsa3), see Fig. S5. Also, among the 12 annotated genes, four (endA, hlyA, lon and mcrC) were previously found to be related to virulence^19–24. More information about the function of these genes is given in the Supplementary Note 2.

To assess the extent of our results beyond our cohort, we investigated whether the 12 annotated accessory genes that we had found were also present in other Bangladeshi and Indian lineages. We performed a comparative genomic analysis of 219 Vibrio cholerae O1 reference isolates collected in Kolkata, India, and Dhaka, Bangladesh, between the years 2004 and 2022 (ENA public database http://www.ebi.ac.uk/ena, see Table S5). The results confirmed the presence/absence patterns of the 12 genes in the BD-1.2 and BD-2 lineages in the reference isolates, aligning with our initial findings, see Supplementary Note 2.

In addition to differences in accessory gene types and patterns, missense mutations associated to allelic variations were found in BD-1.2, when compared to BD-2 strains. We identified 1385 SNPs in the core genome, including 291 non-synonymous and 934 synonymous coding variants, both representing variants in their functional protein-coding form. In addition, 160 intergenic SNPs were found, representing variants in their regulatory form. Many SNPs showcased unique allelic distribution patterns between the two lineages. When mapped back, the non-synonymous SNPs identified 291 amino acid substitutions in 105 genes, including 50 known genes and 55 hypothetical ones (see Table S4). Table 1 shows core genes with allelic distribution between BD-1.2 and BD-2 significantly different (i.e., containing polymorphic sites found exclusively in one lineage but absent in the other lineage).

Table 1. Core genes with a significant different allelic distribution between BD-1.2 and BD-2. Core genes containing non-synonymous SNPs and showcasing the allelic variants that were found exclusively in one lineage but absent in the other lineage. For each SNP, the allelic frequency of the major allele and minor allele and the P-value for the allelic distribution between BD-1.2 and BD-2 have been calculated. The reference allelic variation refers to the nucleotide present in reference genome sequence of V. cholerae N16961 El Tor (NCBI Accession ID: NC_002505.1 and NC_002506.1) and the alteration allelic variation refers to the nucleotide absent in the reference genome, as previously done by Monir et al^9,11.

Gene name	Vibrio cholerae name^a	SNP genomic location	Alleles REF^b / ALT^c	Frequency of REF^b allele in BD-1.2 / BD-2	Frequency of ALT^c allele in BD-1.2 / BD-2	Amino-acid change	P-value (Fisher exact test)
appC	VC_1571^d	676	G/A	0/100	100/0	Ala226Thr	7.97E-36
argG	VC_2642^d	847	A/G	100/0	0/100	Thr283Ala	7.97E-36
bluF	VC_1641	448	C/T	0/100	100/0	Lys149Asn	7.97E-36
bluF	VC_1641	449	A/G	0/100	100/0	Ser150Arg	7.97E-36
bluF	VC_1641	451	T/G	0/100	100/0	Leu151his	7.97E-36
bluF	VC_1641	455	T/C	0/100	100/0	Gly152Ser	7.97E-36
bluF	VC_1641	456	G/A	0/100	100/0	Phe153Ala	7.97E-36
bluF	VC_1641	461	C/T	0/100	100/0	Gln154Ala	7.97E-36
bluF	VC_1641	462	A/G	0/100	100/0	Thr155Asn	7.97E-36
bluF	VC_1641	466	C/T	0/100	100/0	Ala156Ser	7.97E-36
bluF	VC_1641	469	C/T	0/100	100/0	Ile157Arg	7.97E-36
bluF	VC_1641	472	G/A	0/100	100/0	Asp158Ile	7.97E-36
clcA	VC_A0526^d	311	G/A	0/100	100/0	Gly104Glu	7.97E-36
cobB	VC_1509^d	149	C/T	100/0	0/100	Pro50Leu	7.97E-36
ctxB	VC_A0009	58	C/A	0/100	100/0	His20Asn	7.97E-36
cysG_1	VC_1363^d	113	T/C	100/0	0/100	Val38Ala	7.97E-36
dltA_1	VC_A0149	3145	T/C	0/100	100/0	Phe1049Leu	7.97E-36
dsbD	VC_2701^d	1748	C/T	0/100	100/0	Thr583Ile	7.97E-36
ftsI	VC_2407^d	1472	G/A	0/100	100/0	Arg491His	7.97E-36
glmM	VC_0639^d	587	G/T	100/0	0/100	Arg196Leu	7.97E-36
gyrA	VC_1258	1980	G/T	100/0	0/100	Asp660Glu	7.97E-36
hrpB	VC_0601	2345	C/T	100/0	0/100	Ala782Val	7.97E-36
licH	VC_1284^d	166	G/A	0/100	100/0	Ala56Thr	7.97E-36
mak	VC0270^d	346	G/A	0/100	100/0	Gly116Arg	7.97E-36
murI	VC_0158^d	409	G/T	100/0	0/100	Ala137Ser	7.97E-36
mutL	VC_0345	1048	T/C	0/100	100/0	Cys350Arg	7.97E-36
nudF_2	VC_2435	325	C/T	100/0	0/100	Arg109Cys	7.97E-36
pctB_4	VC_0514	746	T/G	100/0	0/100	Leu249Trp	7.97E-36
phhA	VC_A0828^d	56	A/T	100/0	0/100	Gln19Leu	7.97E-36
putA		1799	C/T	100/0	0/100	Ala600Val	7.97E-36
recD	VC_2319	2039	A/G	0/100	100/0	Tyr680Cys	7.97E-36
rssB_3	VC_1652	235	C/T	0/100	100/0	Leu79Phe	7.97E-36
skp	VC_2251	146	T/G	0/100	100/0	Leu46Trp	7.97E-36
suhB	VC_0745^d	650	A/G	100/0	0/100	Glu217Gly	7.97E-36
tamA	VC_2548	797	C/T	100/0	0/100	Thr266Ile	7.97E-36
trmL_1	VC_A0627	16	A/T	100/0	0/100	Thr6Ser	7.97E-36
tyrS_1	VC_0631	1177	A/G	100/0	0/100	Thr393Ala	7.97E-36
valS	VC_2503	1796	G/A	0/100	100/0	Arg599His	7.97E-36
ycbB	VC_1268	976	C/T	100/0	0/100	Pro327Ser	7.97E-36
^a V. cholerae name: specific VC (Vibrio cholerae) gene names. ^b REF (Reference) allele: refers to the nucleotide present in reference genome sequence of V. cholerae N16961 El Tor (NCBI Accession ID: NC_002505.1 and NC_002506.1) ^c ALT (Alteration) allele: refers to the nucleotide absent in the reference genome ^d Metabolic genes found in the V. cholerae GSM model iAM-Vc960

Among the genes exhibiting lineage-specific allelic variation, some contribute to functions including growth, cell wall organization, colonization, toxigenicity and resistance, similar to what found previously⁹. Additionally, we found genes with a unique non-synonymous variant in BD-1.2, with roles in toxin transport and acid tolerance, shedding light on functions that may clarify their contribution to the recent prevalence of BD-1.2 over BD-2. See Supplementary Note 3 for more information about these genes. Notably, OmpU is another gene with a statistically significant mutation (G325D) underlying lineages’ separation. Amino acid D is predominant in BD-1.2, while the amino-acid G is prevalent in BD-2.

To understand the systemic relationships connecting the identified lineage-specific genetic signatures on a mechanistic level, we analysed the 30 core genes in Table 1 with allelic variants that were found exclusively in one lineage but absent in the other lineage using the V. cholerae GSM model iAM-Vc960 (Fig. 4). Thirteen of these genes (murI, ftsI, appC, suhB, glmM, dsbD, licH, cysG_1, cobB, clcA, argG, mak, phhA) are metabolic and have been identified as playing integral roles in amino acid metabolism, cell wall metabolism, carbon metabolism, amino sugar and nucleotide sugar metabolism, energy metabolism (see Table S6). Flux variability analysis (FVA) and flux balance analysis (FBA) were used to predict, through gene knock outs, the essentiality and the effects of the identified genetic determinants on the flow of metabolites through V. cholerae metabolic network (all known metabolic reactions) and on the growth rates of V. cholerae. The genes cysG, clcA, adh and mcrC, were found as essential for growth (i.e. knocking these genes out reduced the biomass growth to less than 0.0001h^-1)in both rich and minimal media. Furthermore, murI, glmM, and dapF displayed auxotrophic behaviour in minimal media, whereas cysG, clcA, adh, and mcrC were essential in rich media with alternative carbon sources. Additionally, three genes, murI, glmM and dapF, were found to be essential for growth in minimal media only. Next, flux variability analysis (FVA) was used to identify biochemical reactions whose flux span was significantly (greater than 10% change) changed by knocking out these genes. In total ten genes murI, glmM, cysG, clcA, argG, mak, adh, dapF, add, and mcrC when knocked out significantly changed the flux span in at least one reaction through the model by FVA analysis, Table S6. Finally, FBA analysis was used to determine the effect of gene knockouts on metabolite yield. Five genes, murI, glmM, cycG, mak, and dapF were found to reduce at least one metabolite yield to zero in the model when knocked out (given the wildtype yield was greater than 0), Table S6.

Lastly, when mapping the 160 intergenic SNPs back to genomes, we found their location in the upstream/downstream regions of 35 known genes and 34 hypotheticals genes (see Table S4). These intergenic SNPs exhibited allelic distribution, with the minor variant prevalent in the BD-2 isolates (68% to 100%), while the major variant dominated in the BD-1.2 isolates (over 98%), only one SNP in BD-1.2 had a major allelic variant at of 47% (Fisher exact test, Bonferroni correction p-value< 2.31e-08). Many of these SNPs were located within transcriptional factor binding sites (TFBs) (Table S4). Intergenic SNPs, exhibiting significantly different allelic distributions between BD-1.2 and BD-2, mapped across the TFBs of 11 TFs (ToxT, Fur, AmpR, OmpR, LuxR, LexA, ArgR, PhoP, CRP, ArcA) (Fig. S6-S16). More information about the function of these transcriptional factor binding motifs is provided in Supplementary Note 4.

Machine learning unravels correlations between genomic determinants and clinical symptoms in humans

Beyond identifying the potential involvement of new genetic traits in differentiating the BD-1.2 and BD-2 lineages, we hypothesized that the same or additional genetic features might play a significant role in the manifestation of clinical symptoms in patients when infected with Vibrio. We focused on the lineage BD-1.2, which caused the most recent outbreak in Bangladesh. To identify if and which coding and non-coding mutations and/or presence/absence of accessory genes would correlate with the different clinical symptoms, we employed a bespoke, supervised machine learning pipeline.

The pipeline is aimed at mining sequencing data to identify the genetic elements that more strongly correlate with observed clinical symptoms differences, which in this case are vomit, dehydration, number of stools, duration of diarrhoea and abdominal pain (see Methods section). The pipeline is a bespoke adaptation of ML-based data-mining methods previously developed within our team to identify correlations between genomic features with phenotypes^18,19,25,26. In the pipeline, information about different genetic features (SNPs -both from coding and non-coding regions- and presence/absence of accessory genes) can be encoded as input to ML-powered predictive models designed to estimate the likelihood of observing the selected phenotypes under each specific pattern of input values¹⁸. As long as trained with sufficient observational data, the ML-powered predictive models are able to replicate experimental evidence, in addition to providing information on what inputs correlated most strongly with each phenotypic manifestation. Through such introspective power, the pipeline is able to unravel co-occurrent, multiple mechanisms (mutations, horizontal gene transfer - HGT), variants in their functional protein-coding and regulatory forms, as well as their additive effect on the targeted phenotypes, which in this work, were different clinical symptoms.

To ensure the best performance and avoid any bias in the pipeline, we experimented with multiple, different technologies powering the pipeline, namely five classification technologies (Linear SVM, RBF SVM, Random Forest, Extra tree classifier and Logistic regression) and two meta-methods (Adaboost and XGBoost). A nested cross validation approach was adopted to compare the technologies and select the best performing one, using the Friedman and Nemenyi tests to assess final classification performance (see Methods section).

The following clinical symptoms were selected, namely: vomit, abdominal pain, diarrhoea duration, 24-hour stool count and dehydration. Each clinical symptom was handled by building a dedicated predictor pipeline, with the goal to produce separate results in terms of correlation with genetic elements. Two symptoms (vomit and abdominal pain) were encoded as binary (presence vs absence). The other three symptoms – diarrhoea duration, 24-hour stool count, and dehydration – were encoded as multi-class: dehydration as “None”, “Moderate” and “Severe”; diarrhoea duration as < 1 day, 1-3 days, 4-6 days, and 7-9 days; and stool count in 24 hours as 3-5 times, 6-10 times, 11-15 times, 16-20 times, and 21+ times. We handled the prediction of multi-class phenotypes via the implementation of concurrent binary predictors, each addressing a pairwise combination of outcomes. In the end, we were able to successfully develop six adequately performing binary prediction models for the phenotypical outcomes: i) stools 11-15 times vs. 16-20 times; ii) stools 11-15 times vs. 21+ times; iii) moderate vs. severe dehydration; iv) diarrhoea duration <1 day vs. 1-3 days; v) presence vs absence of vomit; and vi) presence vs absence of abdominal pain (Table S7). The remaining binary predictors were discarded for not performing adequately, either because of unbalanced available sets of observations (needed for training the supervised ML models), or because of more challenging separability of the phenotypes given the selected inputs (no features were statistically significant based on the Fisher exact test). Among the tested pipeline technologies mentioned earlier, logistic regression was identified by the Friedman F-test and the Nemenyi post-hoc analysis as the best performing one (Fig. S17). Of the six binary prediction models, four had an AUC greater than 0.9, Fig. 4. Table S8 indicates the performance metrics obtained by all binary predictors for each clinical symptom. Fig. 4 and S18 show the performance results for the Logistic regression classifier.

The analysis of the best-performing predictors allowed to identify the inputs features (core genome coding and intergenic SNPs and accessory genes) most strongly correlated to each phenotype (Table S9). Ninety-two different features in total were selected as relevant by extraction from the six predictor models, with 45% being present in 3 or more models (Fig. 5). No features were selected for all symptoms. Fifty-eight accessory genes (10 known genes, tufB_2, blc, yiaC, pckA, luxR_2, hcpA_1, rpoS, dcuA, oppA, luxR, and 48 hypothetical genes) and 28 core SNPs over 23 genes (14 known, clpS, gshB, dapF, fabV_1, add, tufB, lpoA, phrB, yjcS, fabH1, cysG_2, padC, pepN, tadA_2, and 9 hypothetical genes) were identified as strongly associated to the symptoms. Four known genes (add, dapF, fabV, and tufB) were all associated to three predictor models (abdominal pain, duration of diarrhoea and number of stools), with add also associated with dehydration and the other three genes also associated with vomit. The genes gshB, clpS and lopA were all associated with three predictor models (diarrhoea duration <1 day vs. 1-3 days, stools 11-15 times vs. 16-20 times, stools 11-15 times vs. 21+ times), with lpoA also associated with abdominal pain and the other two genes with vomit.

Among the 58 accessory genes linked to clinical symptoms, four hypothetical genes were also statistically significant in distinguishing the two lineages. Among the other accessory genes selected, four (blc, pckA, luxR and rpoS) have important biological functions. In particular, Blc, also known as VlpA, is a lipocalin, that is correlated to acquisition of drug resistance in V. cholera²⁷. PckA (phosphoenolpyruvate carboxykinase) is important for gluconeogenesis, a highly conserved pathway in bacteria and humans. Interfering with gluconeogenesis pathway impacts V. cholerae colonization in mouse models, highlighting its crucial role in sustaining V. cholerae growth and viability within the intestines²⁸. LuxR plays a key role in regulating biofilm production and secretion in V. cholerae²⁹. RpoS is a sigma factor that facilitates physiological adaptation to general starvation and stationary phase growth in different species. V. cholerae strains lacking the gene rpoS are impaired in the ability to survive in different environmental stresses. RpoS was also shown to be important in V. cholerae for efficient intestinal colonization³⁰.

Out of the 28 core SNPs associated to the clinical symptoms, 11 were also found previously as statistically significant in differentiating the BD-2 and BD-1.2 lineages (see above), Table S9. These 11 SNPs mapped to 11 genes (clpS, gshB, dapF, fabV_1, add, and six hypothetical). Among the SNPs mapping to known genes (clpS, gshB, dapF, fabV_1, add), three are non-synonymous SNPs mapping to clpS, gshB and fabV. In V. cholerae ClpS regulation involves cAMP receptor protein (CRP)³¹. CRP is important in intestinal colonization³¹. GshB, encodes a glutathione synthetase (GSH), a gene associated to resistance to oxidative stress. V. cholerae fabV is one of the several triclosan-resistant ENR encoding genes³².

Nine symptoms-related genes were identified as metabolic genes in the iAM-Vc960GSM model (Fig. 6). Eight of these genes were associated to five metabolic systems (Table S10). FabH1 and gshB associated with cofactor and prosthetic group metabolism; pckA is associated with carbohydrate metabolism; dcuA plays a crucial role in C4-dicarboxylate transport; dapF, pepN and gshB are significant in amino acid metabolism; add and pckA are relevant to nucleotide metabolism; oppA and fabH1 are involved in cell wall metabolism, with fabH1 relevant for fatty acid biosynthesis (Table S10).

Using FBA and FVA analysis, the knockouts of the genes dapF and gshB were found to halt production of several metabolites. The genes pckA, add, dapF, oppA, gshB were found to significantly change the reaction flux span, Table S10. Both FBA and FVA analysis can infer if potential metabolic adaptation mechanisms for V. cholerae can lead to alterations in bacterial virulence, potentially leading to worst symptoms, if genes significantly affect pathways which are associated to important functions such as colonization, biofilm production and cell wall synthesis. For example, the gshB gene, a glutathione reductase, contributes to V. cholerae intestinal colonization³³ and have a role in acid tolerance response³⁴. Similarly, dapF was found as an essential gene in minimal media and led to auxotrophic behaviour to the amino-acid lysine. As Pearcy et al.³⁵ indicated, an auxotrophic behaviour of a gene connected to amino-acid biosynthesis is important because it can provide competitive fitness advantage against commensal bacteria. During the infection stage V. cholerae engage and compete with commensal bacteria for nutrient acquisition to support rapid growth and multiplication³⁶. Moreover, the lysine pathway plays a central role in eubacteria cell wall biosynthesis, since meso-diaminopimelate is the immediate precursor for the biosynthesis of its main component, peptidoglycan, with dapF responsible for the creation of meso-diaminopimelate in the lysine pathway^37,38. The proper synthesis and maintenance of peptidoglycan is essential for bacterial virulence and its viability³⁹.

To delve deeper into understanding the functional mechanisms underlying clinical symptoms, we explored the interactome of the proteins associated to the clinical symptoms. The protein-protein interaction network (PPI) analysis revealed the interactome of 36 proteins, selected by the machine learning pipeline, with 109 other proteins, Fig. S19. The KEGG analysis indicated enrichment in ribosome proteins (e.g., RpoS) and fatty acid biosynthesis (e.g., FabH1, FabV) (Fig. S20). The colonization in the human intestine and virulence of V. cholerae is intricately connected to both fatty acid metabolism⁴⁰ and the ribosome pathway⁴¹. The GO analysis highlighted enrichment in translation, peptide biosynthetic processes, and gene expression, featuring TufA, TufB, RpoS, GshB (Table S11-S12). The peptide biosynthetic pathway plays a vital role in V. cholerae biofilm formation and colonization⁴².

None of the six intergenic SNPs selected by the machine learning pipeline were in TFBs or promoters. These SNPs were located in a region without any functional annotations within 2 kbps upstream or 0.5 kbps downstream of a gene, adhering to the standard database dbSNP cutoffs for SNP-to-gene mapping^43,44. See Table S9 for additional information about the location of these SNPs.

Structural analysis suggests evolutionary drivers of selection and mechanistic bases for BD-2 and BD-1.2 lineages evolution and associations to clinical symptoms

To further understand whether the identified alleles play a causal role in the evolution of lineages and clinical symptoms, we selected two of the top-ranked non-synonymous SNP candidates, prioritizing the following aspects in relation to the associated genes: (i) have significant allelic distribution between BD1-1.2 and BD-2; (ii) have a significant correlation, as detected by the ML pipeline, with the selected clinical symptoms; (iii) characterised as functionally important for V. cholerae metabolisms (i.e. significantly impacting reaction flux when knocked out, as highlighted by the GSM model) and/or interactome (i.e. enrichment of the functions and mechanisms related to pathogenesis); (iv) 3D structural mutation analysis could be benchmarked with experimental evidence. This resulted in three genes, all top-ranked by both the Fisher Exact test for BD-1.2 and BD-2 lineage evolution and the ML analysis for the underlying clinical symptoms, namely: fabV, gshB and clpS. We mapped the alleles of fabV, gshB and clpS to their protein structures using both experimental crystal structures and predicted homology models. However, the 3D-structure could be utilised to infer the mechanistic basis only for fabV and gshB.

In all BD-2 isolates fabV had a proline at position 149 (Pro149) whereas, in BD-1.2 isolates, the Pro149 was found in only 40.5% of cases, with the remaining 59.5% isolates exhibiting histidine at position 149 (His149). The BD-1.2 isolates with His149 showed a higher duration of diarrhoea (1-3 days) and a higher number of stool score (16-20 times and 21+ in 24 hours) compared to the BD-1.2 isolates with Pro149, featuring a lower diarrhoea duration (<1 day) and lower number of stools score (11-15 times). The amino acid 149 was located in the trans-2-enoyl-CoA reductase catalytic domain (Fig. 7A-E), when Pro149 is present, it interacts with Lys148, Ser151, Trp159 through Van der Waals (VDW) interactions, whereas His149 not only forms the aforementioned interactions but also creates an extra VDW interaction with Lys148. Furthermore, His149 interacts with an additional amino acid, Arg150, through a VDW interaction. These additional interactions in the presence of the His149 cause an increase in the stability of the structure (ΔΔG = 0.101 kcal/mol >0) and a decrease of the molecule flexibility (ΔΔS_Vib ENCoM: -0.053 kcal.mol^-1K^-1), which is usually linked to a stronger binding affinity^45,46. Moreover, the presence of His149 increased the positive charge of the surrounding area (Lys148, His149, Arg150) (Fig. S21), with an overall electrostatic energy increasing from 7.3E+03 kJ/mol (Pro149) to 7.48E+03 kJ/mol (His149) within the 5Å region and with an overall protein total electrostatic energy rising from 2.1E+05 kJ/mol (Pro149) to 2.52E+05 kJ/mol (His149). Exposed, positively charged amino acids are suggested to promote interactions with negatively charged cellular systems⁴⁷. The enhanced positive charge of FabV in the presence of His 149 might support its role in participating in the breakdown of the negatively charged fatty acids.

GshB, a glutathione reductase, has been shown to contribute to V. cholerae intestinal colonization³³ and to have a role in the ability of V. cholerae to mount an acid tolerance response³⁴. In all BD-2 isolates GshB had a threonine at position 93 (Thr93), whereas in the BD-1.2, the Thr93 was only found in 21.5% of the cases, with most (78.5%) of the BD-1.2 isolates exhibiting an isoleucine (Ile93) at this position. The BD-1.2 isolates with Ile93 are associated to a higher duration of diarrhoea (1-3 days) and a higher number of stool score (16-20 times and 21+ in 24 hours) compared to the BD-1.2 isolates with Thr93. Thr93 interacts with Asp92, Ile96, Tyr97 through 13 VDW interactions and 1 H-bond; whereas Ile93 not only forms the aforementioned interactions but also creates extra VDW interactions with Tyr97 (Fig. 8A-E). These additional bonds in the presence of Ile93 cause an increase in the stability of the structure (ΔΔG = 0.384 kcal/mol >0) and a decrease of the molecule flexibility (ΔΔSVib ENCoM: -0.055 kcal.mol-1.K-1), which is usually linked to a stronger binding affinity^45,46. Moreover, the presence of Ile93 increased the negative charge of the surrounding area (<5Å) (Fig. S22A-B), with an overall electrostatic energy decreasing from 7.93E+03 kJ/mol (Thr93) to 7.4E+03 kJ/mol (Ile93) within the 5Å region and with an overall protein total electrostatic energy varying from 2.1E+05 kJ/mol (Thr93) to 1.8E+05 kJ/mol (Ile93). A decrease in total electrostatic energy is often associated to folding⁴⁸, protein folding stability is largely dependent on the hydrophobic interactions of nonpolar residues⁴⁹. The surface, on average, has become more hydrophobic, indicating a possible reorientation of residues or a change in the surface's exposure to the solvent (Fig. S22C-D).

Bangladesh has witnessed the continual genomic evolution of V. cholerae lineages, with increased virulence, resistance, global spreading ability and disease severity. The potential of a V. cholerae isolate to have a global spreading ability and cause disease is mostly approached by studying its genomics via bioinformatics analysis. Two recent studies^9,11 explored the genomics attributes of the lineage BD-2 predominant between 2004 and 2018 and the emergent lineage BD-1.2 appearing from 2016 onwards and responsible for the 2022 outbreak^9,11. By comparing these lineages, the authors revealed mutations in ctxB allele, SXT/ICE, VSP-II, VPI-1 and gryA allele⁹potentially explaining the recent shift in lineage predominance. Despite these knowledge advances, gaps persist in understanding the entire genomic repertoire associated to transmission ability and different disease severity patterns.

Here, we developed an analysis approach that combines, ML-powered data mining, whole-genome sequencing, genome-scale metabolic modelling and 3D structural analysis to uncover, on a finer scale, unknown associations between lineage transmission dynamics, diseases severity and the genomic make-up of V. cholerae isolates. Machine learning offers a powerful opportunity to analyse entire genomes efficiently against selected phenotypes (lineages, clinical symptoms), allowing for the identification of genomic features ranked on strength of correlation with the phenotype. This provides a significant advantage to conventional genomics-only methods based on checking for presence/absence or based on similarity searches of known manually chosen determinants. Moreover, our approach allowed various genetic determinants (accessory genes, and core coding and intergenic SNPs) to be analysed simultaneously to capture the co-occurrence, synergism and additive effect of multiple mechanisms and determinants (mutations, accessory genes, horizontal gene transfer, functional, metabolic, and regulatory variants). Determinants identified by ML may contain genes with a known functional relationship with the phenotype as well as genes with no previously known association with that specific phenotype. Altogether, our reference-agnostic approach overcomes limitations of previous genomics studies that only considered one feature type (SNPs, accessory genes) at a time and known genetic elements associated to Vibrio transmission.

Using our method, in addition to confirming the aforementioned mutations identified in recent genomics studies⁹, we found further mutations in VSP, VPI, and PLE, exclusive to one lineage and absent in the other, supplementing those previously found by Monir et al.⁹. Moreover, our findings expand known mutations to a wider range of genomic determinants, including 115 accessory genes, 1225 core coding SNPs, and 160 intergenic SNPs crucial for explaining at a more-in depth scale BD-1.2 and BD-2 recent shift. Supplementing the previous knowledge on the type, number and functions of genomics determinants differentiating BD-1.2 and BD-2⁹.

For example, five core genes (skp, tamA, clcA, cysG, and valS) with a unique non-synonymous variant in BD-1.2 and playing key roles on toxin transport and acid tolerance, shed new light on functions and may help clarify their contribution to the recent prevalence of BD-1.2 over BD-2. In addition, non-synonymous SNPs, found uniquely in BD1.2, were mapped to genes with functions such as colonization, toxins export, virulence, growth, response to pH and temperature, and phage resistance. For example, the mutation G325D in ompU conferring bacteriophage resistance²⁵, was found in this work to be statistically important to differentiate the two lineages. OmpU a pore-forming protein of the outer membrane of V. cholerae has adhesive properties which may play a role in the pathogenesis of cholera⁵⁰, is critical for vibrio fitness ^51,52, for dissemination⁵¹,for protection against the bactericidal effect of bile salts⁵³, cationic peptides⁵⁴ and intestinal organic acids⁵⁵. The G325D mutation is located within the L8 loop, which has been reported to be crucial for neutralizing infection and conferring resistance against phages^56,57. Seed et al.⁵⁷, showed that in presence of the bacteriophage ICP2 (bacteriophage that preys on Vibrio cholerae and was first isolated from cholera patient stool samples⁵⁸) the OmpU virulent mutant (G325D) had a 10,000-fold enrichment over the wild-type, indicating that strong selective pressure is imposed by phage predation during V. cholerae infection.

Out of the twelve accessory genes found statistically significant to differentiate the two lineages, five (lon_3, endA, adh, hdfR_4 and bcr_2) were present uniquely in BD-1.2 with functions such as antibiotic resistance and biofilm formation. Increasing evidence indicates that V. cholerae has the capability to develop biofilm-like aggregates during infection, potentially serving as a function in pathogenesis and disease transmission. Nonetheless, the composition, control mechanisms governing the formation of these biofilms during infection, and their significance in intestinal colonization and virulence remain yet to be elucidated⁵⁹.

In addition to the coding genome, we found that regulatory networks are associated to lineage differentiation. Among the most relevant intergenic SNPs exhibiting significant allelic distribution between the two lineages is the one mapping in the TFBs of ToxT. This TF plays a crucial role in the development of V. cholerae-related symptoms⁶⁰ and selectively regulates the expression of virulence genes found in toxin-coregulated pilus (TCP) and cholerae toxin (CT)^60,61. Environmental conditions within the intestinal tract, such as the presence of bile, bicarbonate, reduced oxygen levels, and unsaturated fatty acids, play a significant role in promoting the simultaneous expression of genes responsible for the production of Tcp, CT, and various other genes linked to colonization^12,62. The activation of the ToxT regulon is also influenced by metabolic cues and quorum sensing^12,62. Although, transcription factor binding site prediction algorithms tend to over-predict sites. The correlation of experimentally determined SNPs with the predicted sites and their different nucleotide frequency provides a reasonable certainty that the observation reflects the phenomenon. The fact that we found significant intergenic SNPs in TFBs of 11 TFs and not in promoters, suggests a possible important role in such scenario. Higher frequency of SNPs close to transcriptional start sites is related to subtle alteration of gene expression which might results in lineage diversity. In addition to a wider range of genomic determinants found in this study, we also found 23 genes with mapped SNPs (tyrA, gyrA, ctxB, glmM, tamA, valS, czcA, licH, mutL, kbl, cobB, mak, znuC, phhA, nagA_1, argG, cysG_1, murI, appC, putA, suhB, fadJ and recD) in common between our analysis and Monir’s comparison of BD-1 vs BD-2¹¹ and nine genes with SNPs (rstA, ubiA, dsbD, clcA, thiG, rtxA, mltD, fadJ and recD) in common between our analysis and Monir’s comparison of BD-1.1 vs BD-1.2⁹.

Roughly 20% of people who contract toxigenic V. cholerae show cholera symptoms¹². Among symptomatic cases, approximately 5% are mild, 35% are moderate, and about 60% are severe. The disease's severity depends on pathogenic factors on the bacteria, and the host, including age, nutrition, and immune system¹². Here, we revealed the existence of correlations between a core set of genetic determinants in V. cholerae and clinical symptoms (diarrhoeal duration, number of stools, abdominal pain, vomit, and dehydration). A recent study⁶³ investigated these correlations, using machine learning, by analysing gene families in the gut microbiome of household members of Cholera patients to predict disease severity. In such study, associations were found in gene families like ribosomal proteins, RNA polymerases, and the sugar phosphotransferase system with symptomatic disease. However, the computational pipeline adopted in such work⁶³ did not produce high-performance metrics for predictive models. Our pipeline, in contrast to Levade et al⁶³, achieved superior performance metrics, and encompassed accessory genes, core genome SNPs, and intergenic SNPs. It considered variants in both functional protein-coding and regulatory forms, revealing their additive effect on diverse clinical symptoms.

Moreover, mechanistic insights were derived through GSMMs and protein-protein interaction networks. Notably, we identified genes crucial for pH homeostasis, host adaptability, colonization, virulence, motility, acid tolerance, toxin transport, biofilm formation, and bacteriophage resistance. Important pathways were found underlying these roles, such as the fatty acids biosynthesis which is important for V. cholerae since unsaturated fatty acids present in bile inhibit the expression of virulence factors and both cholesterol and unsaturated fatty acids can enhance the motility of V. cholerae⁶⁴; and biofilm production which plays a crucial role in the cholera pathogenesis and dissemination of disease⁶⁵. Furthermore, our ML analysis identified genes associated to abdominal pain that were also found important for colonization in V. cholerae. It is known that colonization of pathogenic bacteria can present clinical symptoms such as abdominal pain⁶⁶.

Three non-synonymous SNPs associated to the clinical symptoms were also found as statistically significant in differentiating the BD-1.2 and BD-2 lineages. These SNPs mapped to clpS, gshB and fabV. In V. cholerae ClpS regulation involves cAMP receptor protein (CRP)³¹. CRP is important in V. cholerae gene regulatory network lifestyle switching, adapting gene expression for quorum sensing, intestinal colonization, and toxin production to its environment³¹. GshB, encodes a glutathione synthetase (GSH), a gene associated to resistance to oxidative stress. It is part of the σ32 regulon, contributing to V. cholerae intestinal colonization³³. Glutathione controls the potassium efflux system, Kef, and pH homeostasis involved in Na+ and K+ transport⁶⁷. Impaired glutathione production may affect the stress response⁶⁷. GshB was additionally shown to have a role in the ability of V. cholerae to mount an acid tolerance response³⁴. V. cholerae fabV is one of the several triclosan-resistant ENR encoding genes³². Resistance to triclosan also affects resistance to other antibiotics, showing cross-resistance to a wide range of antibiotics (including chloramphenicol and tetracycline)⁶⁸. Moreover, fabV exhibits pleiotropic effects controlling pathogenicity in P. aeruginosa via modulation of fatty acids synthesis, production of virulence factors and motility⁶⁹.

Analysing the 3D structure based on non-synonymous mutations can provide insights into the mechanisms by which these mutations can cause disease^70–73. Changes in the stability of proteins can lead to manifestation of diseases⁷² or symptom variations^70,73. Among all types of mutations, non-synonymous SNPs have the greatest impact on protein structure and function⁷⁴. In this work we found that BD-1.2 isolates accumulated more core coding SNPs compared to BD-2 isolates that instead tended to accumulate more intergenic SNPs, suggesting different evolutionary dynamics possibly explaining the temporal shift of the two lineages. Our analysis of top-ranked non-synonymous SNPs in protein-coding regions, identified by machine learning as linked to both BD-1.2 lineage evolution and clinical symptoms, specifically FabV and GshB, unveiled that SNPs present in BD-1.2, associated with more severe cholera, led to increased protein stability. That protein stability might be relevant for disease severity is also supported by the fact that no SNPs associated to clinical symptoms were found in any TFBs or promoter signature but only in protein-coding sequences.

We are aware of the limitations of our current study. Several host factors (retinol deficiency, blood group, genetic factors, innate immune system) confer susceptibility to cholerae with higher risk of symptomatic disease⁷⁵. These factors have not been considered in this study due to lack of data. This study should be considered a proof-of-principle to be further investigated and validated with larger sample sizes and different geographical areas. With the advent of modern technologies, by strengthening bespoke analytical methods and by performing wider comparisons (asymptomatic vs. symptomatic, patients vs. households, environmental vs stool vibrio) we can potentially disentangle the intricate network of correlations between the genetic underpinnings of cholera symptoms and epidemiological transmission risk, uncovering regulatory, metabolic and signalling networks interconnectivity that might help to inform future interventions.

Ethics Statement

For the icddr,b isolates, the study protocol was approved by the Institutional Review Board of icddr,b (PR-15127). Informed written consent was taken from adult patients, or guardians on behalf of children. For the IEDCR isolates, the study was performed in accordance with protocols approved by the Institutional review board of IEDCR (IEDCR/IRB/09).

Experimental Design

For the study we used 129 V. cholerae bacterial isolates obtained from distinct stool samples of patients between 2014 and 2021 from the ongoing Nationwide Cholera Surveillance⁷⁶, jointly conducted by IEDCR and icddr,b. The isolates were collected from admitted patients from six divisions of Bangladesh (Barisal n=11, Chittagong n=6, Dhaka n=99, Khulna n=2, Rajshahi n=4, and Sylhet n=7). The isolates included in the study were gathered from patients meeting the case definition of diarrhoea and consenting to be included in the surveillance study. Stool samples were processed by either IEDCR or icddr,b research institutes. For the identification of V. cholerae, specimens were streaked onto taurocholate-tellurite gelatin agar (TTGA) and incubated overnight at 37°C. Specimens were also inoculated in alkaline peptone water for enrichment and incubated for an additional 18–24 hours⁷⁷ and plated on TTGA. Suspected colonies were serotyped with monoclonal antibody specific to V. cholerae O1 (Ogawa and Inaba) and O139 serogroups⁷⁸ for the icddr,b isolates, while for the IEDCR isolates serotyping and biotyping was carried out by slide agglutination and PCR using primers in Table S13. Confirmed isolates were tested for antimicrobial susceptibility using disk diffusion methods in accordance with CLSI protocols⁷⁹ to antibiotics: ampicillin, azithromycin, ciprofloxacin, ceftriaxone, cefixime, doxycycline, erythromycin and meropenem, using commercially available antibiotic discs (Oxoid, Basing- stoke, United Kingdom). Escherichia coli American Type Culture Collection 25922 susceptible to all antimicrobials was used as a control strain for susceptibility studies.

Clinical metadata was collected from patients corresponding to 104 isolates for the 129 isolates in our cohort. Clinical data covered 5 categories (duration of diarrhoea, number of stools, abdominal pain, vomiting, and dehydration), in addition the age and sex of the patient and location of the patient was recorded. Clinical symptoms data were binned into categorical data bins for data analysis (Table S7).

• Duration of diarrhoea: number of days the diarrhoea persisted was recorded. Data were binned as a duration score ranging from 1-3, with 1 = <1 day; 2 = 1-3 days; 3 = 4-6 days.

• Number of stools in 24 hours: The number of stools recorded in a 24-hour period during the hospital admission was recorded. Data were binned as a number of stools score ranging from 1-5 with 1= 3-5 times; 2= 6-10 times; 3=11-15 times; 4=16-20 times; 5=21+ times.

• Abdominal pain: the presence or absence of abdominal pain was recorded as a 0 for absence and 1 for present.

• Vomit: The presence or absence of any vomiting in the 24 hours prior to admission was recorded with 0 denoting no vomiting and 1 denoting the occurrence of vomiting

• Dehydration: clinical assessment of dehydration was recorded as none, moderate or severe by the clinician.

DNA purification and extraction

DNA extraction was performed at North South University. All the V. cholerae isolates were subjected to genomic DNA extraction in accordance with the manufacturers protocol of QIAamp DNA Mini Kit (Qiagen).

Library construction and whole-genome sequencing

The library preparation and sequencing of the 129 selected strains were carried out at NGRI (NSU Genomics Research Institute, North South University). To prepare the Illumina libraries, approximately 1 μg of high molecular weight V. cholerae genomic DNA was utilized. Barcoded libraries were prepared using the Illumina DNA Prep Kit (product code 20060059, NEB, USA) following the manufacturers protocol. Nextera DNA CD index codes were added to attribute sequences to each sample. Following that, paired-end sequencing with 2 × 151 cycles was performed on the Illumina MiSeq platform at NGRI.

Genome assembly and annotation

All sequences were pre-processed to using the Illumina BaseSpace sequencing hub. To clean the data adapters were trimmed and unidentified bases were removed. Genomes were assembled using SPAdes (v3.12)⁸⁰ with default parameters and a coverage cut off value of 20. Genomic contamination was assessed using ContEst16S⁸¹ with only genomes identified as V. cholerae retained for further analysis. Contigs with length shorter than 500 nucleotides were filtered out of the final assemblies. Genomes were annotated with Prokka (v1.14.6)⁸² , using default settings with –addgenesz--usegenus.

Screening of annotated genes against ABR databases, virulence and plasmid databases and in silico subtyping.

The whole-genome sequences were screened against the CARD⁸³ database (accessed 05-06-2022) using Abricate⁸⁴ with a minimum coverage of 70% and minimum identity of 90% to identify known AMR-associated genes in the isolate cohort. Genomes were also screened against the VFDB⁸⁵ database using Abricate⁸⁴ to find virulence associated genes, with 70% coverage and 90% identity) (accessed 05-06-2022). Plasmids screening was conducted using the PlasmidFinder⁸⁶ database in Abricate⁸⁴, with 70% coverage and 90% identity) (accessed 05-06-2022); no plasmids were identified in the genome sequences. Sequence types were identified through MLST⁸⁷ which mapped the sequences to the PubMLST⁸⁸ database.

Pan genome analysis and Generation of genetic features input files

All annotated genomes we used as input for pan-genome analysis using Roary v3.13⁸⁹. The core genome alignment was taken as input to produce a file of core gene SNPs present in the cohort using SNP sites 2.5.1⁹⁰. SNPs within intergenic regions (IGRs) were extracted using piggy v1.5⁹¹to generate an alignment of core intergenic clusters. Variants in this alignment were then called using SNP sites 2.5.1. The presence-absence of accessory gene was found from the output of Roary.

In addition, a further pan genome alignment was created consisting of the 129 isolates in our cohort together with 218 isolates collected in Bangladesh from 2004 to 2022 (The European Nucleotide Archive-ENA (http://www.ebi.ac.uk/ena), accession codes: PRJDB8664, PRJDB12727, PRJDB13928, PRJNA723557).

Phylogenetic analysis of V. cholerae isolates in our cohort in Bangladesh

For both our cohort alone and our cohort together with publicly available Bangladeshi isolates (as detailed above) maximum likelihood phylogenies were reconstructed. Using the core genome alignments generated in Roary v3.13⁶³, the phylogenies were reconstructed in IQ Tree (v2.2.0.3)⁹² with 10000 ultrafast bootstrap replicates and best fitted evolutionary model (HKY+F+I for our cohort only and K3Pu+F+I for the combined Bangladesh alignment) was selected using ModelFinder⁷³. The alignment length of the core genome of our cohort was 3459819 nucleotide sites of which 1486 were informative. For the core genome of the combined Bangladeshi isolates, the alignment length was 2086397 nucleotide sites with 844 informative sites. The resulting consensus trees were visualised using iTol v6⁹³, and branches with less than 95% ultrafast bootstrap support were deleted.

Phylogenetic relations between V. cholerae isolates worldwide

We used WGS data from 1140 V. cholerae isolates collected from India, Africa, Haiti, Yemen together with our Bangladesh samples (see Tables S2 and S3). To generate the input for a phylogenetic tree, SNP variants were called from each isolate against the reference genome VC N16961 (NC_002505.1; NC_002506.1) using Snippy v4.6.0 (https://github.com/tseemann/snippy). The cleaned alignment files from Snippy were concatenated via the SeqIO function of biopython⁹⁴ v1.83 then recombination was masked using Gubbins (v.2.3.4)⁹⁵. The filtered polymorphic sites output from Gubbins was further filtered using SNP-sites⁹⁶. The final SNP input contained 4033464 nucleotide sites with 26995 informative sites. This recombination-free SNP output was then used as input to reconstruct the phylogeny using IQtree (v2.2.0.3)⁹² with 1000 ultrafast bootstrap replicates and best fitted model (K3Pu+F+I+G4) was selected by modelfinder⁹⁷. The sequence ERR025382 (Indonesia-1957) was used as an outgroup, and the tree was rooted here. The resulting consensus tree was visualised using iTol v6⁹³, and branches with less than 95% ultrafast bootstrap support were deleted.

Transcriptional binding motifs

Motif searches were conducted using FIMO (Find Individual Motif Occurrences)⁹⁸ within the MEME (Multiple Em for Motif Elicitation)⁹⁹ suite (https://meme-suite.org/meme/tools/fimo). Reference sequences of intergenic regions of DNA from our isolates were generated in Piggy as described above; these were used as input for FIMO. To predict the TFBs the following databases were used: CollecTF (Bacterial TF Motifs); Prokaryotes (Prodoric Release 8.9); Prokaryotes (RegTransBase v4); Combined Prokaryotes. Intergenic regions where motifs were found were variant called using SNP-sites⁹⁶ and then aligned to the motif sequences using Clustal Omega v1.2.4¹⁰⁰. For visualisation of intergenic regions, alignment maps of the intergenic regions were created using Jalview 2.11.3.2 with easyfig python genome figure package¹⁰¹.

Promoter analysis for Intergenic SNPs

BPROM/softberry¹⁰² was used to predict promoter region and oligonucleotides from known TF binding sites close to the promoter region.

Genome-scale metabolic model

All simulations were performed using the Python cobra toolkit v0.26.2. The model, iAM-Vc960 of V. cholerae O1 N16961, was taken from Abdel-Haleem et al²⁰. As, the model does not include the subsystems of reactions, so the Subsystems of reactions were manually added to the iAM-Vc960 model using the model iJO1366¹⁰³ and iML1515¹⁰⁴ of E. coli K-12 MG1655 strain; STM_v1_0¹⁰⁵ of S. enterica serovar Typhimurium LT2 strain; iYL1228¹⁰⁶ of Klebsiella pneumoniae MGH 78578 strain and iJN1463 of Pseudomonas putida KT2440 strain all downloaded from BIGG database¹⁰⁷ via cameo python toolbox v0.13.6¹⁰⁸. In total, 73 subsystems were included in the model. Flux variability analysis (FVA) was applied to the wild-type model and each knockout model using the cobra toolbox in python¹⁰⁹. FVA calculates the minimum and maximum flux through each reaction in the model, given a set of constraints, resulting in the range of possible fluxes for each reaction (flux span). FVA was simulated using glucose as the only carbon source in aerobic minimal M9 medium conditions. Note that reaction loops in the solution were not allowed. Networkx’s greedy modularity algorithm¹¹⁰ was applied to assign genes and reactions to a cluster in order to identify groups of genes that have a similar impact on the metabolic fluxes. We identified metabolic pathways that were enriched in each cluster using hypergeometric enrichment tests using the Scipy function hypergeom. We considered a pathway as significantly enriched in a cluster using hypergeometric enrichment tests if the false discovery rate (FDR) was <1% and used the Benjamini-Hochberg method for correction against multiple testing. We considered two sets of pathway lists for the enrichment. The first used the 40 subsystems as defined in the GSM model, described above. A second list of pathways was downloaded from the BioCyc database using the SMART tables for Vibrio cholerae O1 biovar El Tor strain N16961, which provided a more extensive list of specific metabolic pathways. To create metabolic system diagrams the KEGG (Kyoto Encyclopaedia of Genes and Genomes) was used to create the pathway for each species.

Network analysis based on core genome SNPs

Network of our cohort of 129 V. cholera isolates was created using a pairwise hamming distance comparison based on core genome SNPs in python (NetworkX¹¹⁰ v2.8.4 and Matplotlib¹¹¹ v3.6.2). Each node represents an isolate while the edge represents the hamming distance between two isolates multiplied by the total number of SNPs found in our cohorts (2,382 SNPs). A threshold of 15 or less SNPs difference was used to filter the edges in the network as suggested by Ludden et al (2019)¹¹² and used by us previously^18,19.

Statistical analysis and machine learning of genomic features correlated to a specific lineage or clinical symptoms

To assess if the genomic features were associated with a lineage or to a clinical symptom, we employed a fisher exact test^9,11. Furthermore, to analyse the relationship between genomic features of the BD-1.2 lineage and clinical symptoms a machine learning pipeline was employed. Clinical data were collected from 104 out of 129 V. cholerae isolates of which 63 belonged to the BD-1.2 lineage. These clinical symptoms were be divided into two groups: binary (vomit and abdominal pain) and multi-class (dehydration, number of stools and duration of diarrhoea), with the binning within each group described above. In the multiclass group, we applied a one-vs-one approach, i.e., each class is compared individually to another class. For example, dehydration class “moderate” is compared against class “severe”. For both binary and multiclass groupings, as the classes were unbalanced, we oversampled the minority class as a pre-processing step using a Synthetic Minority Over-sampling Technique approach (SMOTE)¹¹³. The Python package Scikit-learn version 1.2.1¹¹⁴ was used to make the classification and the package Scipy version 1.9.3¹¹⁵ was used to select the most important features based on a Fisher exact test. The pipeline first oversamples the minority class using a SMOTE approach. Then based on the oversampled data it selects the most important features using a Fisher exact test (p-value < 0.1). Next, a panel of machine learning methods (logistic regression (LR), linear support vector machine (L-SVM), radial basis function support vector machine (RBF-SVM), extra tree classifier, random forest, adaboost and xgboost) was used to predict the clinical symptoms classes based on the pre-selected features described above. As per previous works^18,19,25,26: (i) nested cross-validation^116,117 was employed to assess the performance and select the hyper-parameters of the proposed classifiers and to compare the results obtained by the seven different classifiers used; (ii) a Friedman Statistical F-test (F_F) with Iman-Davenport correction was used for statistical comparison of multiple classifiers across multiple analysis¹¹⁸; (iii) a post-hoc Nemenyi test was employed to find if there is a single classifier or a group of classifiers that performs statistically better in terms of their average AUC rank after the F_F test has rejected the null hypothesis that the performance of the comparisons on the individual classifiers over the different datasets is similar¹¹⁸; (iv) an undirected graph was created using NetworkX¹¹⁰ to visualize how the features (accessory genes, core genome SNPs and intergenic SNPs) correlate between different clinical symptoms models.

Protein-protein interaction network and building protein 3D structures

Protein-protein interaction networks of the protein encoded of the genes associated with clinical symptoms were obtained using STRING database v12.0 (using reference genome V. cholerae O1 biovar El Tor str. N16961) and analysed in Cytoscape 3.10.1¹¹⁹. Eighty-one accessory and core genes selected by machine learning were used as input for the PPI, of these only 60 could be mapped to the STRING database. The interactome was constructed using first and second neighbour proteins. Disconnected nodes and nodes with interaction scores lower than medium confidence level (interaction scores <0.400) were filtered out. Functions of the protein in the network were annotated with Gene Ontology terms (biological process, molecular function, cellular component and KEGG pathways) in StringDB¹²⁰. Three-dimensional AlphaFold predicted models were obtained by aligning the protein FASTA sequence to reference sequences from the Uniprot database¹²¹ to find a 3D protein structure. 3D protein structures were then visualised using UCSF Chimera¹²² and UCSF ChimeraX¹²³. Protein stability analysis and the effect of each mutation were performed with dDUET¹²⁴, DynaMut¹²⁵ and SIFT¹²⁶. The electrostatic potential was analysed and visualised using PDB2PQR and APBSaccessed online¹²⁷, UCSF ChimeraX¹²³ and APBS Coloring¹²⁷.

Statistical Analysis

Statistical comparisons were made using the SciPy package implementing: 1. A two-sided chi-squared test with Bonferroni correction to evaluate the similarities between the serotypes and the collection year and location of the isolates (p-value < 0.005); 2. A two-sided Mann Whitney U test was used to evaluate the distribution of the counts of accessory genes, coding and non-coding SNPs in BD-1.2 and BD-2 lineages and along the different collection years (p value < 0.005); 3. A two-sided Fisher exact test, with Bonferroni correction, was performed to assess the relationship between the BD-2 and BD-1.2 lineages and different genomic features - core and intergenic SNPs and accessory genes (p value < 0.005); 4. A two-sided hypergeometric enrichment tests (two-sided) with false discovery rate (FDR) was used for the GSM analysis (p-value < 0.01); and 5. a two-sided Friedman Statistical F-test (FF) with Iman-Davenport correction for statistical comparison of multiple datasets over the seven different classifiers used (p-value < 0.05). With 7 classifiers and 6 clinical symptom models, the Friedman test is distributed according to the F distribution with 7−1 = 6 and (7−1)×(6−1)=30 degrees of freedom. The critical values, therefore, was F(6,30) for p value = 0.05 is 2.42052319. The post-hoc Nemenyi test was used to find if there is a single classifier or a group of classifiers that performs statistically better in terms of their average rank after the FF test has rejected the null hypothesis that the performance of the comparisons on the individual classifiers over the different datasets is similar.

Data Availability

Short-read sequence data for all 129 isolates used in this study are deposited in the NCBI SRA and can be found associated with BioProject number PRJNA1021874 publicly available on: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1021874

Code Availability

The code used in this study is available in the following GitHub repository: https://github.com/tan0101/VibrioCARE.

Baddam, R. et al. Genome dynamics of Vibrio cholerae isolates linked to seasonal outbreaks of cholera in Dhaka, Bangladesh. mBio 11, (2020).
Ali, M., Nelson, A. R., Lopez, A. L. & Sack, D. A. Updated global burden of cholera in endemic countries. PLoS Negl Trop Dis 9, (2015).
Banerjee, R., Das, B., Balakrish Nair, G. & Basak, S. Dynamics in genome evolution of Vibrio cholerae. Infection, Genetics and Evolution vol. 23 32–41 Preprint at https://doi.org/10.1016/j.meegid.2014.01.006 (2014).
Kaper, J. B., Morris, J. G. & Levine, M. M. Cholera. Clin Microbiol Rev 8, 48–86 (1995).
Karaolis, D. K. R. et al. A Vibrio cholerae pathogenicity island associated with epidemic and pandemic strains. Proceedings of the National Academy of Sciences 95, 3134–3139 (1998).
Son, M. S., Megli, C. J., Kovacikova, G., Qadri, F. & Taylor, R. K. Characterization of Vibrio cholerae O1 El Tor Biotype Variant Clinical Isolates from Bangladesh and Haiti, Including a Molecular Genetic Analysis of Virulence Genes. J Clin Microbiol 49, 3739 (2011).
Wozniak, R. A. F. et al. Comparative ICE genomics: Insights into the evolution of the SXT/R391 family of ICEs. PLoS Genet 5, (2009).
Faruque, S. M. & Mekalanos, J. J. Pathogenicity islands and phages in Vibrio cholerae evolution. Trends Microbiol 11, 505–510 (2003).
Monir, M. M. et al. Genomic attributes of Vibrio cholerae O1 responsible for 2022 massive cholera outbreak in Bangladesh. Nat Commun 14, 1154 (2023).
Morita, D. et al. Whole-Genome Analysis of Clinical Vibrio cholerae O1 in Kolkata, India, and Dhaka, Bangladesh, Reveals Two Lineages of Circulating Strains, Indicating Variation in Genomic Attributes. (2020) doi:10.1128/mBio.
Monir, M. M. et al. Genomic Characteristics of Recently Recognized Vibrio cholerae El Tor Lineages Associated with Cholera in Bangladesh, 1991 to 2017. Microbiol Spectr 10, (2022).
Baker-Austin, C. et al. Vibrio spp. infections. Nature Reviews Disease Primers 2018 4:1 4, 1–19 (2018).
Rashid, M. U. et al. CtxB1 outcompetes CtxB7 in Vibrio cholerae O1, Bangladesh. J Med Microbiol 65, 101–103 (2016).
Jubyda, F. T. et al. Vibrio cholerae O1 associated with recent endemic cholera shows temporal changes in serotype, genotype, and drug-resistance patterns in Bangladesh. Gut Pathog 15, 17 (2023).
Weill, F. X. et al. Genomic history of the seventh pandemic of cholera in Africa. Science (1979) 358, 785–789 (2017).
Domman, D. et al. Integrated view of Vibrio cholerae in the Americas. Science (1979) 358, 789–793 (2017).
Peng, Z. et al. Whole-genome sequencing and gene sharing network analysis powered by machine learning identifies antibiotic resistance sharing between animals, humans and environment in livestock farming. PLoS Comput Biol 18, (2022).
Baker, M. et al. Convergence of resistance and evolutionary responses in Escherichia coli and Salmonella enterica co-inhabiting chicken farms in China. Nat Commun 15, 206 (2024).
Peng, Z. et al. Whole-genome sequencing and gene sharing network analysis powered by machine learning identifies antibiotic resistance sharing between animals, humans and environment in livestock farming. PLoS Computational Biology vol. 18 (2022).
Abdel-Haleem, A. M. et al. Integrated Metabolic Modeling, Culturing, and Transcriptomics Explain Enhanced Virulence of Vibrio cholerae during Coinfection with Enterotoxigenic Escherichia coli. mSystems 5, (2020).
Karp, P. D. et al. The EcoCyc Database. EcoSal Plus 8, (2018).
Zhang, H., Luo, Q., Gao, H. & Feng, Y. A new regulatory mechanism for bacterial lipoic acid synthesis. Microbiologyopen 4, 282 (2015).
Ramamurthy, T. et al. Virulence Regulation and Innate Host Response in the Pathogenicity of Vibrio cholerae. Front Cell Infect Microbiol 10, 520 (2020).
Jugder, B. E. et al. Vibrio cholerae high cell density quorum sensing activates the host intestinal innate immune response. Cell Rep 40, 111368 (2022).
Baker, M. et al. Machine learning and metagenomics reveal shared antimicrobial resistance profiles across multiple chicken farms and abattoirs in China. Nat Food 4, 707–720 (2023).
Maciel-Guerra, A. et al. Dissecting microbial communities and resistomes for interconnected humans, soil, and livestock. ISME Journal 1–15 (2022) doi:10.1038/s41396-022-01315-7.
Bishop, R. E. The bacterial lipocalins. Biochimica et Biophysica Acta (BBA) - Protein Structure and Molecular Enzymology 1482, 73–83 (2000).
Wang, J. et al. Gluconeogenic growth of Vibrio cholerae is important for competing with host gut microbiota. J Med Microbiol 67, 1628 (2018).
Ball, A. S., Chaparian, R. R. & van Kessel, J. C. Quorum sensing gene regulation by LuxR/HapR master regulators in vibrios. J Bacteriol 199, (2017).
Merrell, D. S., Tischler, A. D., Lee, S. H. & Camilli, A. Vibrio cholerae requires rpoS for efficient intestinal colonization. Infect Immun 68, 6691–6696 (2000).
Manneh-Roussel, J. et al. cAMP Receptor Protein Controls Vibrio cholerae Gene Expression in Response to Host Colonization. mBio 9, (2018).
Massengo-Tiassé, R. P. & Cronan, J. E. Vibrio cholerae FabV defines a new class of enoyl-acyl carrier protein reductase. Journal of Biological Chemistry 283, 1308–1316 (2008).
Slamti, L., Livny, J. & Waldor, M. K. Global gene expression and phenotypic analysis of a Vibrio cholerae rpoH deletion mutant. J Bacteriol 189, 351–362 (2007).
Merrell, D. S. et al. Host-induced epidemic spread of the cholera bacterium. Nature 2002 417:6889 417, 642–645 (2002).
Pearcy, N. et al. Genome-Scale Metabolic Models and Machine Learning Reveal Genetic Determinants of Antibiotic Resistance in Escherichia coli and Unravel the Underlying Metabolic Adaptation Mechanisms. mSystems 6, e00913-20 (2021).
Pukatzki, S. & Provenzano, D. Vibrio cholerae as a predator: Lessons from evolutionary principles. Front Microbiol 4, 70337 (2013).
Velasco, A. M., Leguina, J. I. & Lazcano, A. Molecular evolution of the lysine biosynthetic pathways. J Mol Evol 55, 445–459 (2002).
Alvarez, L., Hernandez, S. B. & Cava, F. Cell Wall Biology of Vibrio cholerae. https://doi.org/10.1146/annurev-micro-040621-122027 75, 151–174 (2021).
Juan, C., Torrens, G., Barceló, I. M. & Oliver, A. Interplay between Peptidoglycan Biology and Virulence in Gram-Negative Pathogens. Microbiology and Molecular Biology Reviews 82, (2018).
Huber, M., Fröhlich, K. S., Radmer, J. & Papenfort, K. Switching fatty acid metabolism by an RNA-controlled feed forward loop. Proc Natl Acad Sci U S A 117, 8044–8054 (2020).
Bekaert, M., Goffin, N., McMillan, S. & Desbois, A. P. Essential Genes of Vibrio anguillarum and Other Vibrio spp. Guide the Development of New Drugs and Vaccines. Front Microbiol 12, 755801 (2021).
Jugder, B., Batista, J., Gibson, J., reports, P. C.-C. & 2022, undefined. Vibrio cholerae high cell density quorum sensing activates the host intestinal innate immune response. cell.comBE Jugder, JH Batista, JA Gibson, PM Cunningham, JM Asara, PI WatnickCell reports, 2022•cell.com.
Brodie, A., Azaria, J. R. & Ofran, Y. How far from the SNP may the causative genes be? Nucleic Acids Res 44, 6046 (2016).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308–311 (2001).
Kastritis, P. L. & Bonvin, A. M. J. J. On the binding affinity of macromolecular interactions: daring to ask why proteins interact. J R Soc Interface 10, (2013).
Du, X. et al. Insights into Protein–Ligand Interactions: Mechanisms, Models, and Methods. International Journal of Molecular Sciences 2016, Vol. 17, Page 144 17, 144 (2016).
Cotten, M. & Phan, M. V. T. Evolution of increased positive charge on the SARS-CoV-2 spike protein may be adaptation to human transmission. iScience 26, (2023).
Musil, M., Konegger, H., Hon, J., Bednar, D. & Damborsky, J. Computational Design of Stable and Soluble Biocatalysts. ACS Catal 9, 1033–1054 (2019).
Zhou, H. X. & Pang, X. Electrostatic Interactions in Protein Structure, Folding, Binding, and Condensation. Chem Rev 118, 1691–1741 (2018).
Sperandio, V., Giron, J. A., Silveira, W. D. & Kaper, J. B. The OmpU outer membrane protein, a potential adherence factor of Vibrio cholerae. Infect Immun 63, 4433–4438 (1995).
Kamp, H. D., Patimalla-Dipali, B., Lazinski, D. W., Wallace-Gadsden, F. & Camilli, A. Gene Fitness Landscapes of Vibrio cholerae at Important Stages of Its Life Cycle. PLoS Pathog 9, e1003800 (2013).
Fu, Y., Waldor, M. K. & Mekalanos, J. J. Tn-seq analysis of vibrio cholerae intestinal colonization reveals a role for T6SS-mediated antibacterial activity in the host. Cell Host Microbe 14, 652–663 (2013).
Provenzano, D., Lauriano, C. M. & Klose, K. E. Characterization of the role of the ToxR-modulated outer membrane porins OmpU and OmpT in Vibrio cholerae virulence. J Bacteriol 183, 3652–3662 (2001).
Mathur, J. & Waldor, M. K. The Vibrio cholerae ToxR-regulated porin OmpU confers resistance to antimicrobial peptides. Infect Immun 72, 3577–3583 (2004).
Scott Merrell, D., Bailey, C., Kaper, J. B. & Camilli, A. The ToxR-mediated organic acid tolerance response of vibrio cholerae requires OmpU. J Bacteriol 183, 2746–2754 (2001).
Li, H., Zhang, W. & Dong, C. Crystal structure of the outer membrane protein OmpU from Vibrio cholerae at 2.2 Å resolution. Acta Crystallogr D Struct Biol 74, 21–29 (2018).
Seed, K. D. et al. Evolutionary consequences of intra-patient phage predation on microbial populations. Elife 3, 1–10 (2014).
Lim, A. N. W., Yen, M., Seed, K. D., Lazinski, D. W. & Camilli, A. A tail fiber protein and a receptor-binding protein mediate ICP2 bacteriophage interactions with vibrio cholerae OmpU. J Bacteriol 203, (2021).
Silva, A. J. & Benitez, J. A. Vibrio cholerae Biofilms and Cholera Pathogenesis. PLoS Negl Trop Dis 10, (2016).
Weber, G. G. & Klose, K. E. The complexity of ToxT-dependent transcription in Vibrio cholerae. Indian J Med Res 133, 201 (2011).
Lowden, M. J. et al. Structure of Vibrio cholerae ToxT reveals a mechanism for fatty acid regulation of virulence genes. Proc Natl Acad Sci U S A 107, 2860–2865 (2010).
The complexity of ToxT-dependent transcription in Vibrio cho... : Indian Journal of Medical Research. https://journals.lww.com/ijmr/fulltext/2011/33020/The_complexity_of_ToxT_dependent_transcription_in.12.aspx.
Levade, I. et al. Predicting Vibrio cholerae Infection and Disease Severity Using Metagenomics in a Prospective Cohort Study. J Infect Dis 223, 342 (2021).
Ravcheev, D. A., Gelfand, M. S., Mironov, A. A. & Rakhmaninova, A. B. Purine Regulon of Gamma-Proteobacteria: A Detailed Description. Russ J Genet 38, 1015–1025 (2002).
Silva, A. J. & Benitez, J. A. Vibrio cholerae Biofilms and Cholera Pathogenesis. PLoS Negl Trop Dis 10, e0004330 (2016).
Lopez, C. M., Kovler, M. L. & Jelin, E. B. Case report of extreme gastric distention and perforation with pathologic Sarcina ventriculi colonization and Rett syndrome. Int J Surg Case Rep 73, 210 (2020).
Conner, J. G., Teschler, J. K., Jones, C. J. & Yildiz, F. H. Staying Alive: Vibrio cholerae ’s Cycle of Environmental Survival, Transmission, and Dissemination . Microbiol Spectr 4, (2016).
Carey, D. E. & McNamara, P. J. The impact of triclosan on the spread of antibiotic resistance in the environment. Front Microbiol 5, 123128 (2014).
Huang, Y. H., Lin, J. S., Ma, J. C. & Wang, H. H. Functional characterization of triclosan-resistant enoyl-acyl-carrier protein reductase (fabV) in pseudomonas aeruginosa. Front Microbiol 7, 227929 (2016).
Singh, S. M., Kongari, N., Cabello-Villegas, J. & Mallela, K. M. G. Missense mutations in dystrophin that trigger muscular dystrophy decrease protein stability and lead to cross-β aggregates. Proceedings of the National Academy of Sciences 107, 15069–15074 (2010).
Wang, Z. & Moult, J. SNPs, protein structure, and disease. Hum Mutat 17, 263–270 (2001).
Scheller, R. et al. Toward mechanistic models for genotype–phenotype correlations in phenylketonuria using protein stability calculations. Hum Mutat 40, 444–457 (2019).
Rakoczy, E. P., Kiel, C., McKeone, R., Stricher, F. & Serrano, L. Analysis of Disease-Linked Rhodopsin Mutations Based on Structure, Function, and Protein Stability Calculations. J Mol Biol 405, 584–606 (2011).
Wall, S. M. The Renal Physiology of Pendrin (SLC26A4) and Its Role in Hypertension. Epithelial Anion Transport in Health and Disease: The Role of the SLC26 Transporters Family 231–243 (2008) doi:10.1002/0470029579.CH15.
Harris, J. B., LaRocque, R. C., Qadri, F., Ryan, E. T. & Calderwood, S. B. Cholera. The Lancet 379, 2466–2476 (2012).
Khan, A. I. et al. Epidemiology of cholera in bangladesh: Findings from nationwide hospital-based surveillance, 2014-2018. Clinical Infectious Diseases 71, 1635–1642 (2020).
Bwire, G. et al. Alkaline peptone water enrichment with a dipstick test to quickly detect and monitor cholera outbreaks. BMC Infect Dis 17, 1–8 (2017).
Rahman, M., Sack, D. A., Mahmood, S. & Hossain, A. Rapid diagnosis of cholera by coagglutination test using 4-h fecal enrichment cultures. J Clin Microbiol 25, 2204–2206 (1987).
Clinical and Laboratory Standards Institute. M100 Performance standards for antimicrobial susceptibility testing. (Clinical and Laboratory Standards Institute, 2018).
Bankevich, A. et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19, 455–477 (2012).
Lee, I. et al. ContEst16S: An algorithm that identifies contaminated prokaryotic genomes using 16S RNA gene sequences. Int J Syst Evol Microbiol 67, 2053–2057 (2017).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Alcock, B. P. et al. CARD 2020: Antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res 48, D517–D525 (2020).
Seemann, T. ABRicate: Mass screening of contigs for antimicrobial resistance or virulence genes. https://github.com/tseemann/abricate Preprint at https://github.com/tseemann/abricate (2020).
Liu, B., Zheng, D., Jin, Q., Chen, L. & Yang, J. VFDB 2019: A comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res 47, D687–D692 (2019).
Carattoli, A. et al. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother 58, 3895–3903 (2014).
Seemann, T. MLST. https://github.com/tseemann/mlst (2022).
Jolley, K. A. & Maiden, M. C. J. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 11, 1–11 (2010).
Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).
Page, A. J. et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. bioRxiv 038190 (2016) doi:10.1101/038190.
Thorpe, H. A., Bayliss, S. C., Sheppard, S. K. & Feil, E. J. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. Gigascience 7, 1–11 (2018).
Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 37, 1530–1534 (2020).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49, W293–W296 (2021).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422 (2009).
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 43, e15–e15 (2015).
Page, A. J. et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom 2, e000056 (2016).
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods 2017 14:6 14, 587–589 (2017).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME Suite. Nucleic Acids Res 43, W39–W49 (2015).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7, 539 (2011).
Sullivan, M. J., Petty, N. K. & Beatson, S. A. Easyfig: a genome comparison visualizer. Bioinformatics 27, 1009–1010 (2011).
Salamov, V. S. A. & Solovyevand, A. Automatic annotation of microbial genomes and metagenomic sequences. Metagenomics and its applications in agriculture, biomedicine and environmental studies 61–78 (2011).
Orth, J. D. et al. A comprehensive genome-scale reconstruction of Escherichia coli metabolism—2011. Mol Syst Biol 7, 535 (2011).
Monk, J. M. et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nature Biotechnology 2017 35:10 35, 904–908 (2017).
Thiele, I. et al. A community effort towards a knowledge-base and mathematical model of the human pathogen Salmonella Typhimurium LT2. BMC Syst Biol 5, 1–9 (2011).
Liao, Y. C. et al. An experimentally validated genome-scale metabolic reconstruction of Klebsiella pneumoniae MGH 78578, iYL1228. J Bacteriol 193, 1710–1717 (2011).
Norsigian, C. J. et al. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Res 48, D402–D406 (2020).
Cardoso, J. G. R. et al. Cameo: A Python Library for Computer Aided Metabolic Engineering and Optimization of Cell Factories. ACS Synth Biol 7, 1163–1166 (2018).
Heirendt, L. et al. Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nature Protocols 2019 14:3 14, 639–702 (2019).
Hagberg, A., Swart, P. & S Chult, D. Exploring network structure, dynamics, and function using networkx (Conference) | OSTI.GOV. https://www.osti.gov/biblio/960616 (2008).
Hunter, J. D. Matplotlib: A 2D Graphics Environment. Comput Sci Eng 9, 90–95 (2007).
Ludden, C. et al. One health genomic surveillance of escherichia coli demonstrates distinct lineages and mobile genetic elements in isolates from humans versus livestock. mBio 10, 2618–2693 (2019).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002).
Pedregosa, F. \emphet al. et al. Scikit-learn: Machine Learning in {P}ython. Journal of Machine Learning Research 12, 2825–2830 (2011).
Oliphant, T. E. Python for scientific computing. Comput Sci Eng 9, 10–20 (2007).
Wainer, J. & Cawley, G. Empirical evaluation of resampling procedures for optimising SVM hyperparameters. Journal of Machine Learning Research 18, 475–509 (2017).
Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, 2079–2107 (2010).
Demsar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006).
Shannon, P. et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res 13, 2498–2504 (2003).
Szklarczyk, D. et al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res 51, D638–D646 (2023).
Bateman, A. et al. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 51, D523–D531 (2023).
Pettersen, E. F. et al. UCSF Chimera—A visualization system for exploratory research and analysis. J Comput Chem 25, 1605–1612 (2004).
Pettersen, E. F. et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Science 30, 70–82 (2021).
Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res 42, W314–W319 (2014).
Rodrigues, C. H. et al. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Res 46, W350–W355 (2018).
Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812–3814 (2003).
Jurrus, E. et al. Improvements to the APBS biomolecular solvation software suite. Protein Science 27, 112–128 (2018).

There is NO Competing Interest.

TableS1.xlsx
Table S1
TableS2.xlsx
Table S2
TableS3.xlsx
Table S3
TableS4.xlsx
Table S4
TableS5.xlsx
Table S5
TableS6.xlsx
Table S6
TableS7.csv
Table S7
TableS8.xlsx
Table S8
TableS9.xlsx
Table S9
TableS10.xlsx
Table S10
TableS11.xlsx
Table S11
TableS12.xlsx
Table S12
TableS13.xlsx
Table S13
SupplementaryMaterial.pdf

Download PDF

Journal Publication

published 23 Sep, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Signature genomic traits of the core and accessory genome of Vibrio Cholerae O1 drive lineage transmission and disease severity

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

From 2015 to 2021 in Bangladesh, a diverse array of genetic variations characterises the emergence of distinct circulating lineages

Genetic and temporal differentiation of V. cholerae BD-1.2 and BD-2 lineages correlate with SNPs on coding and non-coding regions, and accessory genes

Machine learning unravels correlations between genomic determinants and clinical symptoms in humans

Structural analysis suggests evolutionary drivers of selection and mechanistic bases for BD-2 and BD-1.2 lineages evolution and associations to clinical symptoms

Discussion

Methods

Ethics Statement

Experimental Design

DNA purification and extraction

Phylogenetic analysis of V. cholerae isolates in our cohort in Bangladesh

Phylogenetic relations between V. cholerae isolates worldwide

Transcriptional binding motifs

Promoter analysis for Intergenic SNPs

Genome-scale metabolic model

Network analysis based on core genome SNPs

Statistical analysis and machine learning of genomic features correlated to a specific lineage or clinical symptoms

Protein-protein interaction network and building protein 3D structures

Statistical Analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1