Actionable prediction of Klebsiella phage-host specificity at the subspecies level

doi:10.21203/rs.3.rs-3101607/v1

Download PDF

Article

Actionable prediction of Klebsiella phage-host specificity at the subspecies level

https://doi.org/10.21203/rs.3.rs-3101607/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 22 May, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Phages are increasingly considered as promising alternatives to target drug-resistant bacterial pathogens. However, their often-narrow host range can make it challenging to find matching phages against bacteria of interest. As of yet, current computational tools do not accurately predict interactions at the subspecies level in a way that is relevant and properly evaluated for practical use. We present PhageHostLearn, a machine learning system that predicts subspecies-level interactions between receptor-binding proteins and bacterial receptors for Klebsiella phage-bacteria pairs. We evaluate this system both in silico and in the laboratory, in the clinically relevant setting of finding matching phages against bacterial strains. PhageHostLearn reaches a cross-validated ROC AUC of 83.0% in silico and maintains this performance in laboratory validation. Our approach provides a framework for developing and evaluating phage-host prediction methods that are useful in practice, which we believe to be a meaningful contribution to machine-learning-guided development of phage therapeutics and diagnostics.

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Computational models

Biological sciences/Microbiology/Phage biology

Biological sciences/Computational biology and bioinformatics/Software

Biological sciences/Computational biology and bioinformatics/Predictive medicine

phage-host interactions

machine learning

Klebsiella

receptor-binding proteins

K-locus proteins

Phages, bacterial viruses, are among Earth’s most abundant viruses [1]. They typically have a limited host range at the subspecies level [2, 3], although broad host range phages infecting multiple species have also been described [4, 5]. For this reason, phages and the proteins they encode have the potential to become precise therapeutics and diagnostics that can target (multidrug-resistant) bacteria [6]. However, it can be challenging to find matching phages against specific hosts of interest, both in ecological and therapeutic settings [7, 8].

In recent years, novel computational tools that predict interactions between phages and their potential hosts have tried to overcome this bottleneck [8, 9]. Most of these tools are either based on measuring the similarity between a query phage genome and potential host genomes or exploiting the similarity between the query phage genome and other phage genomes (e.g., codon usage) with known hosts [10]. Two such approaches are iPHoP and CHERRY. iPHoP is a two-stage machine learning framework developed by Roux et al. (2023). It integrates multiple existing methods to make host predictions at the genus level for a broad range of phages, with the goal of maximizing correct host predictions for metagenome-assembled viruses. This framework attains a low false discovery rate (< 10%). CHERRY uses a graph-based deep learning model that predicts hosts at the species level by incorporating multiple types of interaction information (e.g., genome sequence similarity, CRISPR signals and others) in a multimodal graph [11]. Interestingly, CHERRY can predict new interactions starting from a virus query as well as from a prokaryote query. As Roux et al. (2023) argue, predictions at the genus or species level are essential within the context of viral ecology. In clinical applications however, knowing a phage’s specificity at the subspecies level is typically desired [12–14], which remains a bottleneck for the development of phage therapeutics and diagnostics. Practically, this could be overcome by implementing tools that predict phage-host interactions and make these actionable. For example, prediction scores can be used to prioritize which phage-host combinations should be tested in the laboratory, reducing labor-intensive work to a minimal set of predicted top candidates to be validated. Correspondingly, the overall tool should also be evaluated in a manner that is representative of practical use.

Protein language models are increasingly popular frameworks for machine learning applications at the protein level [15–17]. These are state-of-the-art deep learning models that are trained on large amounts of protein sequence data in a self-supervised manner. Specifically, the models learn to predict the occurrence of amino acids in the context of other amino acids. By training at an enormous scale, these models effectively learn the underlying distribution of naturally occurring proteins. A trained protein language model can be used to generate accurate numerical representations for proteins or be further tweaked for specific problems using a much smaller dataset, a process called fine-tuning [15]. As a result, these large models remove the need for explicit feature engineering and allow for an end-to-end approach.

We have previously proposed a general, biology-informed multilayer machine learning approach to elucidate phage-host interactions at the subspecies level [18]. In that approach, the first layer represents the initial interaction between the phage receptor-binding proteins (RBPs) and the bacterial surface receptors. Typically, RBPs constitute the primary determinant of host specificity [19]. The study by Sørensen et al. (2021) aligns well with this first layer [2]. The authors computationally analyzed tail spike protein diversity in 99 Ackermannviridae phages to determine phage host specificity at the level of the interacting O-antigen receptors, effectively at the subspecies level. However, this approach is not geared towards applications in a clinical context, nor is it built into a tool that other researchers can easily use.

In the present study, we develop and validate a new machine learning approach called PhageHostLearn, which predicts the initial interactions between RBPs and bacterial receptors for Klebsiella phage-bacteria pairs, at the subspecies level (Fig. 1a). Klebsiella pneumoniae is among the most prominent multidrug-resistant pathogens worldwide [20]. The unique public availability of interaction data for Klebsiella phage-host interactions enables the development of prediction methods at the subspecies level, a remaining bottleneck for most other bacterial species [21]. For a majority of Klebsiella phage-host interactions, this first interaction between RBPs and the capsular polysaccharide is the primary and most important determinant of host specificity [3, 22]. Therefore, PhageHostLearn processes phage and bacterial genomes into phage RBPs and bacterial K-locus proteins, respectively. We use the ESM-2 protein language model to encode protein sequences as numerical vector representations, which are used as input for an Extreme Gradient Boosting (XGBoost) model, a widely used and broadly applicable method [23]. PhageHostLearn allows to make interaction predictions in both directions (i.e., from phage to bacterium and vice versa), which is a typical example of pairwise learning [24]. Furthermore, our approach outputs a ranking of potential phage candidates for in vitro validation of a given bacterium. We thoroughly evaluate this approach both in silico and in vitro, in the clinically relevant setting of finding matching phages against a new bacterial strain (Fig. 1b,c). We show that PhageHostLearn reaches a cross-validated ROC AUC of 83.0% in silico and is able to hold on to this performance in the laboratory. Our approach is made publicly available as a tool that can be further improved over time. We believe this to be a meaningful first step in machine-learning-guided development of phage therapeutics and diagnostics.

Sequence data collection and processing

Phage genome sequence data, bacterial genome sequence data and their in vitro verified phage-bacteria interactions were collected from the Institute for Integrative Systems Biology (I²SysBio) in Valencia, Spain as described by Beamud et al. (2023), supplemented by an additional, unpublished collection of phage-host interaction data for phages isolated on Klebsiella spp. reference strains. In total, 105 phage genome sequences and 200 bacterial genome sequences were collected. Spot tests were performed to test for phage-bacteria interactions. Out of 10,006 spot tests performed in total, 274 are confirmed interactions (2.74%). Interactions are considered positive if a spot was visible using a 1:10 phage dilution, reflecting an initial interaction between phage RBPs and host receptors (but not necessarily a productive replication).

Phage genome sequences were processed in three consecutive steps (Fig. 1a): (1) PHANOTATE was used to identify genes in each of the phage genomes (McNair et al., 2019); (2) phage RBPs were detected among the translated protein sequences of the identified genes, following our method outlined in Boeckaerts et al. (2022) [25]; and (3) detected phage RBPs shorter than 200 amino acids and longer than 1500 amino acids were discarded, according to the range in length in which we expect RBPs [26]. In total, 9,727 genes were detected with PHANOTATE, and subsequently 274 phage RBPs were detected among those identified genes. We detected at least one RBP in each phage genome, and up to eight RBPs in a single phage genome (Supplementary material S1).

Bacterial genome sequences were processed with Kaptive [27, 28] to identify the capsule synthesis locus (K-locus) in each of the bacterial genomes (Fig. 1a). On average, each K-locus consisted of 19 proteins that constitute the K-antigen (the number of proteins was between 10 and 25). A total of 89 different K-types were identified using Kaptive (K13 was most often represented, while 45 different K-types were only represented once, Supplementary material S2).

Multi-instance feature representations

We transformed phage RBPs and bacterial loci proteins into combined numerical vector representations (so-called joint features), to serve as input to the machine learning model (Fig. 1a). These representations are so-called multi-instance representations [29], combining one or multiple RBPs per phage and multiple K-locus proteins per bacterium.

We used the pretrained ESM-2 protein language model (t33_650M_UR50D configuration) to transform each of the RBPs and loci proteins into a unique numerical vector [30]. The vectors corresponding to the RBPs of the same phage or the K-locus proteins of the same bacterium were averaged into multi-instance representations for each phage or bacterium. Finally, for each known interaction in the dataset, the multi-instance representations of each phage and each bacterium were concatenated into a final combined numerical vector that represents a known phage-host pair.

A classification model that predicts interactions

We trained a binary XGBoost classifier to output prediction scores reflecting how likely a phage-host pair will interact, based on the combined ESM-2 numerical vector representations described above (Fig. 1a). The maximum depth of each tree, the learning rate and the number of estimators were tuned using stratified five-fold cross-validation (Table 1). The optimal maximum depth was 7, the optimal learning rate was 0.3 and the optimal number of estimators was 250.

Table 1

**Hyperparameters and their tested values in the PhageHostLearn model.** The optimal values of the hyperparameters for the model are indicated in bold.
Hyperparameters	Tested values
Maximum depth	3, 5, 7, 9
Learning rate	0.2, 0.3, 0.4
Number of estimators	250, 500, 750

In silico evaluation of the model

We have evaluated our model both in silico and in the laboratory in the practical setting of finding which phages in the collection are the most active against a given bacterial strain. A predictive model is useful if it can effectively suggest the most appropriate phages to test, in that way minimizing manual analysis and laborious experimental work. We have simulated this representative setting in silico by iteratively holding out one bacterial genome with its phage interactions at a time from the training set. In each iteration, the held-out interactions were predicted by the model and their prediction scores were used to construct a ranking of the predicted phages. The hit ratio was computed across the top-k ranked phages by comparing the ranked predictions to the ground truth labels to quantify how well our model does in finding matching phages. This process was repeated for values of k ranging from 1 to the total number of phages, was repeated for each of the bacterial genomes in the dataset and finally averaged across all the iterations over the bacterial genomes (Fig. 2a). This mean hit ratio @ k provides a meaningful visualization of the average probability of finding at least one hit in the top-k candidates suggested by the model. For example, with our model we expect to find at least one hit in the top-10 in around 84% of the cases on average (dark blue curve). We have also simulated an informed microbiologist approach by manually selecting from a subset of phages that are known to infect the same K-type as the bacterial strain at hand (red curve). Our model slightly outperforms this approach in suggesting positive interactions near the top. Additionally, we visualize the receiver operating characteristic (ROC) curve (Fig. 2b) and measure its area-under-the-curve (AUC) as a general performance metric of our model. This ROC AUC can be interpreted as the probability that the model will score a randomly chosen interacting phage-host pair higher than a randomly chosen phage-host pair that does not interact. Our model reaches a ROC AUC of 83.0%. Expectedly, the mean hit ratio differs across different K-types, and there is a strong contrast between the top-10 mean hit ratio for the best and worst predicted K-types (Fig. 2c). Therefore, we constructed histograms of the number of confirmed interactions per bacterial strain belonging to the best and worst predicted K-types as well as the group in between. We observed that the performance across K-types can be related to the number of confirmed interactions in those K-types (Fig. 2d,e,f), highlighting the need for an extensive training dataset with sufficient confirmed interactions for each K-type for optimal performance.

In vitro validation of the model with spot tests

A total of 28 carbapenem-resistant K. pneumoniae clinical isolates were collected and sequenced in collaboration with the National Microbiology Center (CNM) in Madrid, Spain. These K. pneumoniae clinical isolates comprised high-risk clones that are currently circulating in Spain and included a total of eight different K-types (KL17, KL24, KL27, KL64, KL102, KL107, KL112 and KL151). Each of these K-types was also present (at least once) in the training data. For each of these K. pneumoniae clinical isolates, PhageHostLearn was used to predict interactions and construct a ranking for the unpublished collection (I²SysBio) of 59 phages isolated on Klebsiella spp. reference strains and for which the full genome was available. As these phages were isolated on Klebsiella spp. reference strains, they were not tested on all the K-types present in the test set of clinical isolates. Moreover, none of the phages was tested before on these specific clinical isolates. The top-five ranked phages for each K. pneumoniae clinical isolate were validated in the laboratory using spot tests at a 1:10 phage dilution in duplicate or triplicate (for discrepant results). Spot tests were used for consistency across model training and in vitro validation, and because of our focus on the initial interaction between phage RBPs and host receptors (not necessarily reflecting a productive replication). In an additional effort, all 17 unique phages that were identified across the different top-five lists, were tested against all the 28 clinical isolates to examine potential false negatives.

One or more interactions were confirmed with spot tests for 16 out of the 28 bacterial strains. PhageHostLearn was able to correctly predict hits in the top-five phage candidates for 15 of these isolates, corresponding to a top-five hit ratio of 93.8% (Fig. 3a). Comparing the different K-types, PhageHostLearn only missed 7 hits in total, for strains of KL17, KL24 and KL27 (Supplementary material S3). Overall, the PhageHostLearn system retains its in silico performance, reaching a ROC AUC of 79.3% in this in vitro validation, compared to 83.0% in silico (Fig. 3b).

Overall, the top candidates predicted by the XGBoost model are often phages that have a broader host range, such as K65PH164, K30λ2.2, K2064PH2 and K7PH164C4 (Supplementary material S3). These phages appear across the true positives, false positives, and false negatives (Table 2). Considering that these phages are not K-locus specific, this result was to be expected based on our focus on RBPs and K-locus proteins. Interestingly, the model does suggest a strategy that a microbiologist would think of as well: testing all the broad host range phages by default. The model also suggests some K-type specific phages correctly, such as K54λ1.1.1 and K17α62 and misses only few of these (i.e., most false negatives involve broad host range phages such as K30λ2.2, K2064PH2 and K7PH164C4). One K-type specific phage (K17α61) was wrongly predicted as false positive in combination with some bacterial strains but was a false negative prediction in combination with other bacterial strains. These wrong predictions are more challenging to explain from a biological perspective and could equally be explained because of a lack of sufficiently similar data from which the models can learn.

Table 2

**Concordance of the predictions by our model with laboratory confirmations by means of a confusion table.** Counts of true positives, false positives and false negatives with the most prevalent phages in each category, across all the top-five laboratory-confirmed interactions (140 in total), supplemented by the interactions tested in all 17 unique phages across the different top-five lists for counting the false negatives. The true positives were the predictions in the top-five recommendations that were confirmed in the *in vitro* validation. The false positives were the predictions in the top-five recommendations that could not be confirmed in the *in vitro* validation. Finally, the false negatives were the interactions that could be confirmed in the lab across the 17 tested phages that were not predicted in the top-five recommendations.
Case	Model prediction	In vitro result	Count	Prevalent phages
True positive	Top-five	Interaction	26	K65PH164, K30λ2.2, K2064PH2, K7PH164C4
False positive	Top-five	No interaction	114	K65PH164, K2064PH2, K29PH164C1, K17α61
False negative	Outside of top-five	Interaction	7	K2064PH2, K7PH164C4, K17α61, K30λ2.2
True negative	Outside of top-five	No interaction	Not considered	-

Moreover, the model correctly suggested 78.8% (= 26 / [26 + 7]) of all the confirmed interactions in the top-five. However, these seven false negatives can be an underestimation, as we have not comprehensively tested all 59 phages against the 28 clinical isolates. In addition, the model also suggested 114 interactions that turned out to be negative. This is intrinsic to using a ranking approach because top suggestions are tested regardless of the prediction scores that are assigned by the model, thus could also comprise phages that do not adsorb to the host strain. Concretely, when all interactions for a given bacterium receive low scores, the five with the highest scores were still tested.

In this work, we developed PhageHostLearn, a machine learning system that overcomes three current bottlenecks for phage-host interaction prediction in the context of phage therapeutics and diagnostics. First, the system predicts phage-host interactions at the subspecies level. Second, it outputs prediction scores that can be used to recommend top candidates, resulting in a more effective laboratory validation. Third, we evaluated this system in the practical setting that it will be used for: predicting matching phages for new bacteria.

We specifically trained and evaluated our system to make actionable predictions for Klebsiella phage-host pairs. The unique data availability for Klebsiella allowed us to construct a machine learning system capable of making predictions at the subspecies level. Moreover, we have deliberately chosen to focus on phage RBPs and bacterial K-locus proteins, as these proteins are involved in the first step of the phage infection cycle and known to be a very determining factor of phage-host specificity for many Klebsiella phages [3, 22]. We suggest that the same approach could be extended to predict phage-host interactions with similar biological characteristics at the level of phage RBPs and host receptors. For example, Escherichia coli, Salmonella enterica and Acinetobacter baumannii all have characteristic O-antigens that many phages bind to using their RBPs [31, 32]. In addition, PhageHostLearn can be extended to include other typical phage receptors such as outer membrane proteins, flagella and others, given that they can be annotated in the genome.

A combined, multi-instance feature representation was computed using the ESM-2 protein language model (Fig. 1a). This way of computing features is inspired by how state-of-the-art deep learning architectures process natural language into numerical representations. The advantage of this type of approach is that it bypasses the need for explicit feature engineering, such as computing codon usage or k-mer frequencies, which is seen in many earlier approaches. These methods result in combined multi-instance representations that represent the phage-host pairs together. The XGBoost model on top of these combined multi-instance representations outputs prediction scores that can be used to propose top phage-host candidates to test. We show that our machine learning system suggests top candidates better compared to a typical microbiologist approach. However, we hypothesize that there may be better ways of aggregating individual ESM-2 protein representations into multi-instance representations, compared to our simple approach of computing a column-wise mean.

PhageHostLearn produces prediction scores that can be used to construct a ranking of top phage candidates for a given bacterium, a practical output format that is directly actionable and can guide effective in vitro validations (Fig. 4). At the same time, a ranking removes the need to set an arbitrary cutoff on the prediction score above which predictions are interpreted as interacting phage-host pairs. The ranking is closely linked to the way we evaluate our model in silico and in vitro: by instructing the model to predict interactions of given phages against a new bacterium and evaluating to what extent the model is useful in assigning higher prediction scores to matching phages (effectively resulting in a ranking that is useful in practice). Here, we notice that such a ranking can include (many) false positives, an inherent trait of a ranking approach. For Klebsiella phage-host interactions, we expect few interacting phage-host pairs overall. As a result, we argue that avoiding false negatives is more important than avoiding false positives, up to a certain extent. Two strategies can be further explored to balance false positives and false negatives: (1) setting an additional (albeit low) threshold on the prediction score or (2) considering a flexible top-k ranking that would depend on the K-type. For K-types for which the model is very accurate, a smaller top-k can be considered, while for K-types that are more difficult to accurately predict, a larger top-k can be tested in the laboratory.

Importantly, we have not explicitly evaluated PhageHostLearn at the level of individual RBPs and their ability to bind to a specific K-antigen, as RBP-level validated interactions are unavailable. While some phages in the dataset consist of only a single RBP, and we would expect the model to learn these direct relationships between RBPs and their interacting K-antigen, most phages consist of two or more RBPs. For this reason, we do not know how accurate the model is in predicting K-type specificity at the level of individual RBPs, nor can we assess how useful the model would be in assisting RBP engineering efforts to adjust host range. These application settings can be further explored to broaden the model’s usefulness.

More generally, PhageHostLearn represents a specific approach to predict phage-host interactions at the level of the initial recognition of bacterial receptors by phage RBPs. Several other approaches exist for predictions at the species, genus or higher levels, which are useful within different contexts [8, 9]. For Klebsiella phage-host interactions, we argue that a focus on RBPs and their interacting K-antigens is appropriate and useful. However, phage-host pairs that do not interact with the K-antigen will currently get missed. This limitation exists both for phage-host pairs within Klebsiella, as detailed by Beamud et al. (2023), and for phage-host pairs involving other bacterial species [3]. This limitation could be overcome by considering surface receptors beyond the K-antigen. To further include essential steps of the infection process beyond the initial interaction (e.g., phage defense systems) [33], more elaborate approaches are needed. For example, different models focusing on each step in an infection process can be combined hierarchically, which is the multi-layer approach for digital phagograms we have proposed earlier [18]. Alternatively, high-capacity deep learning models might provide another way of modeling the infection process in its entirety, providing that sufficient amounts of data are available to train these complex models.

In summary, this work represents a first-of-its-kind approach that demonstrates the feasibility of predicting phage-host interactions at the subspecies level, given a comprehensive dataset of interacting phage-host pairs and their genomes. Moreover, the PhageHostLearn system is actionable and is evaluated in a practical setting. In that way, we believe PhageHostLearn meaningfully contributes to ongoing efforts in machine-learning-guided development of phage therapeutics and diagnostics.

Sequence data collection and processing

Phage genome sequence data, bacterial genome sequence data and their in vitro verified interactions were collected from the Institute for Integrative Systems Biology (I²SysBio) in Valencia, Spain as described by Beamud et al. (2023), supplemented by an additional, unpublished collection of phage-host interaction data for phages isolated on Klebsiella spp. reference strains that were earlier sequenced with Illumina sequencing. For both sets of data, spot tests were carried out before in triplicate to verify Klebsiella phage-host interactions at a tenfold phage dilution, reflecting an initial interaction between phage RBPs and host receptors (but not necessarily a productive replication). In addition to the spot tests, Beamud et al. (2023) further confirmed phage-host interactions with positive spot tests using a planktonic killing assay, measuring bacterial growth inhibition at OD_600nm for at least 16h.

Phage genome sequences were processed in three consecutive steps. In the first step, PHANOTATE was used to identify the genes in each of the phage genomes [34]. Genes were identified without the use of tRNAscan-SE [35]. The second step involved translating the phage genes into proteins and detecting phage RBPs among them, for which we followed the method outlined in Boeckaerts et al. (2022) [25]. Briefly, this detection involves (1) computing HMM bit scores for each of the phage proteins against a manually curated set of RBP-related HMMs, (2) computing ProtBert-BFD embeddings for each of the proteins and (3) using both the bit scores and embeddings together as numerical representations in an XGBoost classifier that discriminates phage RBPs from other phage proteins. The code for this method was made publicly available. Finally, the third step in processing phage genomes involved discarding detected phage RBPs shorter than 200 amino acids and longer than 1500 amino acids, which is the range in length in which we expect RBPs, based on Latka et al. (2019) [26].

The bacterial genome sequences were processed with Kaptive [27, 28]. More specifically, Kaptive was used to identify the capsule synthesis locus (K-locus) in each of the bacterial genomes using BLASTN against published K-locus reference sequences. The coding genes in each detected K-locus were translated into protein sequences and stored for further transformation into numerical features. When Kaptive detected missing genes, the corresponding reference gene of the best-matching K-type was used for further processing. All the code for these analyses and processed data are made available through GitHub (https://github.com/dimiboeckaerts/PhageHostLearn) and Zenodo (https://doi.org/10.5281/zenodo.8052911).

Multi-instance feature representations

Phage RBPs and bacterial K-locus proteins were transformed into combined numerical vector representations (so-called features), representing both the phage and the bacterium together. We computed multi-instance representations using the pretrained ESM-2 protein language model (t33_650M_UR50D configuration) that takes a single protein sequence as input and outputs a 1280-dimensional real vector that represents the protein [30]. Using ESM-2, each of the RBPs and loci proteins was transformed into a unique numerical vector. Next, the vectors of the RBPs corresponding to the same phage were averaged into a multi-instance representation for each phage. In the same way, the vectors of the K-locus proteins corresponding to the same bacterium were averaged into a multi-instance representation for that bacterium. Finally, these two multi-instance representations were concatenated into a combined, 2560-dimensional vector representing a phage-host pair. These combined multi-instance feature representations then served as input for our machine learning model to learn interactions between known phage-host pairs and predict new interactions.

A classification model that predicts interactions

The ESM-2-based feature representation was used as an input to train a binary XGBoost classifier. XGBoost is a nonlinear machine learning method that sequentially fits decision trees to improve the overall performance of the ensemble model. It is widely used for its broad applicability and performance on unstructured data [23]. Three hyperparameters were tuned using a stratified five-fold cross-validation: the maximum depth of each tree (which influences the complexity of the model), the learning rate (which controls the optimization process) and the number of estimators (which refers to the number of boosting rounds that are done).

In silico evaluation of the model

We have simulated the practical setting of finding candidate phage to test against a host of interest in silico by using our model to construct rankings of the predicted phages based on their prediction score. Then, we computed two metrics to evaluate our model performance in this setting: the mean hit ratio @ k and the ROC AUC. The mean hit ratio @ k provides a meaningful visualization of the average probability of finding at least one hit in the top-k candidates suggested by the model. More specifically, we have iteratively held out one bacterial genome and its interactions at a time from the training set. After training, the model predicted the held-out interactions (consisting of different phages for one bacterial strain) and the prediction scores were used to construct a ranking. This ranking was used to compute the hit ratio by comparing the top-k ranked predictions to their ground truth label. This process was repeated for values of k ranging from 1 to the total number of phages, resulting in a value for hit ratio for each of the values of k. Finally, this process was repeated for each of the bacterial genomes in the dataset, iteratively taking out one at the time, training the model on all the remaining data and computing the hit ratio @ k for the constructed ranking. All these values for hit ratio were then averaged across the number of bacterial genomes to produce a final mean hit ratio @ k, reflecting the average probabilities of finding hits in the top k candidates suggested by the model. Practically, we accomplished this evaluation by implementing a leave-one-group-out cross-validation scheme (LOGOCV) in which each group represents a bacterial genome and its associated interactions that were iteratively held-out one by one for testing. We have also simulated an ‘informed microbiologist’ approach, in which phages that are known to infect the same K-type were prioritized to construct the ranking. If such phages were found for a given bacterium at hand, the order of suggested phages was further prioritized based on the number of other K-types they additionally infect, i.e., prioritizing narrow host range phages, as they typically exhibit a higher fitness [36]. Conversely, if no phages were found that infect the same K-type, the broadest host range phages were prioritized.

In addition, we computed and visualized the ROC curve and computed its area-under-the-curve (AUC) in the same LOGOCV, without constructing a ranking. The ROC AUC can be interpreted as the probability that the model will score a randomly chosen interacting phage-host pair higher than a randomly chosen phage-host pair that does not interact.

In vitro validation of the model with spot tests

A total of 28 currently circulating and carbapenem-resistant Klebsiella pneumoniae clinical isolates were collected and sequenced with Illumina in collaboration with the National Microbiology Center (CNM) in Madrid, Spain. Bacteria were isolated in several Spanish hospitals from urine samples, blood samples, and abscess, wound, ulcer and rectal exudates. The bacterial genomes were sequenced with Illumina sequencing. Afterwards, the bacterial genome sequences were processed with Kaptive as before. Each of these genomes was used as an input into PhageHostLearn to predict interactions and construct a ranking for the unpublished collection (I²SysBio) of 59 phages isolated on Klebsiella spp. reference strains and for which the full genome was available. The top-five ranked phages for each K. pneumoniae clinical isolate were validated in the laboratory using spot tests in semi-solidified media at a 1:10 phage dilution (in liquid broth) in duplicate or triplicate (for discrepant results). First, bacterial cultures were inoculated from glycerol stocks and grown overnight at 37°C in liquid broth. Phage stocks were aliquots of an amplification in liquid broth and stored at -80°. Spots were done by adding drops of 1 µl at a 1:10 phage dilution to bacterial lawns of 200 µL of each of the 28 K. pneumoniae isolates and 3.5 mL of 0.3% LB top agar in petri plates. To assure the quality of the phage stocks, each phage was also tested on its isolation strain at 1:10 and 1:10³ phage dilutions as a positive control. Plates were incubated for 24h at 37ºC.

A spot was considered positive if it was observed in the 1:10 dilution, both considering clear plaques and cases in which it was not possible to distinguish clear plaques (potentially indicating lysis from without but a positive RBP-receptor interaction), both for phages tested against the clinical isolates and the positive controls. Absence of spots was considered a negative result. At least two replicates of the experiment were performed, and a third replicate was performed if discrepancies were observed. In those cases, the final result was negative if spots for at least two replicates were absent, and positively scored if at least spots for two replicates were confirmed at a 1:10 phage dilution. Finally, laboratory confirmations were used to visualize both the hit ratio @ k and ROC AUC in the same way as before.

The biological materials used in the in vitro validation of this study are available from the Institute for Integrative Systems Biology (I2SysBio, contact: [email protected]) and the Spanish Microbiology Center (CNM, contact: [email protected]) under a data use agreement upon request.

Data availability

We provide full availability to (1) the raw sequence data collected in FASTA format from Beamud et al. (Cell Reports, 2023) and from an unpublished collection of the Institute for Integrative Systems Biology (I2SysBio) in Spain; (2) the processed data that were used in the analyses and to train and evaluate the machine learning model and (3) the phage-host interaction data in a .csv format. These data are available through Zenodo (https://doi.org/10.5281/zenodo.8095914).

Code availability

We provide full availability of our code through GitHub (https://github.com/dimiboeckaerts/PhageHostLearn) and Zenodo (https://zenodo.org/record/8059735). Sequence data were processed using PHANOTATE v1.5.0 (https://github.com/deprekate/PHANOTATE), PhageRBPdetection v2.1.3 (https://github.com/dimiboeckaerts/PhageRBPdetection) and Kaptive v2.0.0 (https://github.com/klebgenomics/Kaptive). Feature representations of processed sequences were computed using ESM-2 v1.0.3 (https://github.com/facebookresearch/esm). The machine learning model used XGBoost v1.5.0 (https://github.com/dmlc/xgboost) and we evaluated the model using cross-validation and metrics implemented in Scikit-learn v0.24.2 (https://scikit-learn.org/stable/). Furthermore, our code pipeline uses python v3.9.7, biopython v1.79, joblib v1.1.0, json v4.2.1, matplotlib v3.4.3, numpy v1.20.3, pandas v1.3.4, pickle 0.7.5 and seaborn v0.11.2.

Author statements

Acknowledgement and funding information

DB is supported by the Research Foundation – Flanders (FWO), grant number 1S69520N. MS and BDB received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program. Project PID2020-112835RA-I00 funded by MCIN/AEI /10.13039/501100011033, and project SEJIGENT/2021/014 funded by Conselleria d’Innovació, Universitats, Ciència i Societat Digital (Generalitat Valenciana) to P.D-C. P.D-C. was financially supported by a Ramón y Cajal contract RYC2019-028015-I funded by MCIN/AEI/10.13039/501100011033, ESF Invest in your future.

Author contributions

Conceptualization & Methodology: DB, MS, BDB, RS, PD-C, YB; Data Curation, Software, Formal Analysis, Validation & Original Draft preparation: DB; Experimental design: DB, MS, RS, PD-C, BDB, YB. Experimental validation: CF-G; Molecular characterization of K. pneumoniae high-risk clones: JO-I; Manuscript Review & Editing: DB, MS, BDB, RS, PD-C, YB; Supervision: MS, BDB, RS, PD-C, YB.

Competing interests statement

The authors declare that there are no conflicts of interest.

Inclusion and ethics statement

This research is a collaboration between Belgian and Spanish researchers related to five different research groups. The research is especially relevant in both countries, focusing on Klebsiella pneumoniae, a highly relevant pathogen in clinical settings. Klebsiella is by extension also globally relevant including in low and middle income countries. The research made use of currently circulating and carbapenem-resistant Klebsiella pneumoniae clinical isolates, working together with the National Microbiology Center (CNM) in Madrid, Spain. Responsibilities for this research were agreed amongst collaborators ahead of the research. Where possible, local relevant research was taken into account in citations.

Clokie MRJ, Miljard AD, Letarov AV, Heaphy S. Phages in nature. Bacteriophage. 2011, 1(1), 31–45.
Sørensen AN, Woudstra C, Sørensen MCH, Brøndsted L. Subtypes of tail spike proteins predicts the host range of Ackermannviridaephages. Comput Struct Biotechnol J. 2021, 19, 4854-4867. doi:10.1016/j.csbj.2021.08.030
Beamud B, García-González N, Gómez-Ortega M, González-Candelas F, Domingo-Calap P, Sanjuan R. Genetic determinants of host tropism in Klebsiella phages. Cell Rep. 2023. 42(2), 112048. doi:10.1016/j.celrep.2023.112048
Schwarzer D, Buettner FF, Browning C, et al. A multivalent adsorption apparatus explains the broad host range of phage phi92: a comprehensive genomic and structural analysis. J Virol. 2012. 86(19), 10384-10398. doi:10.1128/JVI.00801-12
Hanson CA, Marston MF, Martiny JB. Biogeographic Variation in Host Range Phenotypes and Taxonomic Composition of Marine Cyanophage Isolates. Front Microbiol. 2016. 7, 983. doi:10.3389/fmicb.2016.00983
Klumpp J, Dunne M, Loessner MJ. A perfect fit: Bacteriophage receptor-binding proteins for diagnostic and therapeutic applications. Curr Opin Microbiol. 2023, 71, 102240. doi:10.1016/j.mib.2022.102240
Keen EC. Tradeoffs in bacteriophage life histories. Bacteriophage. 2014. 4(1), e28365. doi:10.4161/bact.28365
Coclet C, Roux S. Global overview and major challenges of host prediction methods for uncultivated phages. Curr Opin Virol. 2021, 49, 117-126. doi:10.1016/j.coviro.2021.05.003
Versoza CJ, Pfeifer SP. Computational Prediction of Bacteriophage Host Ranges. Microorganisms. 2022, 10(1), 149. doi:10.3390/microorganisms10010149
Roux S, Camargo AP, Coutinho FH, et al. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 2023, 21(4), e3002083. doi:10.1371/journal.pbio.3002083
Shang J, Sun Y. CHERRY: a computational method for accurate prediction of virus-prokaryotic interactions using a graph encoder-decoder model. Briefings in Bioinformatics. 2022, 23(5), bbac182. doi:10.1093/bib/bbac182
Schooley RT, Biswas B, Gill JJ, et al. Development and Use of Personalized Bacteriophage-Based Therapeutic Cocktails To Treat a Patient with a Disseminated Resistant Acinetobacter baumannii Infection. Antimicrob Agents Chemother. 2017, 61(10), e00954-17. doi:10.1128/AAC.00954-17
Dedrick RM, Guerrero-Bustamante CA, Garlena RA, et al. Engineered bacteriophages for treatment of a patient with a disseminated drug-resistant Mycobacterium abscessus. Nat Med. 2019, 25(5), 730-733. doi:10.1038/s41591-019-0437-z
Eskenazi A, Lood C, Wubbolts J, et al. Combination of pre-adapted bacteriophage therapy and antibiotics for treatment of fracture-related infection due to pandrug-resistant Klebsiella pneumoniae. Nat Commun. 2022, 13(1, 302. doi:10.1038/s41467-021-27656-z
Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J. 2021, 19, 1750-1758. doi:10.1016/j.csbj.2021.03.022
Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021, 118(15), e2016239118. doi:10.1073/pnas.2016239118
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022, 38(8), 2102-2110. doi:10.1093/bioinformatics/btac020
Lood C, Boeckaerts D, Stock M, et al. Digital phagograms: predicting phage infectivity through a multilayer machine learning approach. Curr Opin Virol. 2022, 52, 174-181. doi:10.1016/j.coviro.2021.12.004
Nobrega FL, Vlot M, de Jonge PA, et al. Targeting mechanisms of tailed bacteriophages. Nat Rev Microbiol. 2018, 16(12), 760-773. doi:10.1038/s41579-018-0070-8
Antimicrobial Resistance Collaborators. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet. 2022, 399(10325), 629-655. doi:10.1016/S0140-6736(21)02724-0
Leite D.M.C., et al. Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinform. 2018, 19(420), 151–159. doi:10.1186/s12859-018-2388-7
Squeglia F, Maciejewska B, Łątka A, et al. Structural and Functional Studies of a Klebsiella Phage Capsule Depolymerase Tailspike: Mechanistic Insights into Capsular Degradation. Structure. 2020, 28(6), 613-624.e4. doi:10.1016/j.str.2020.04.015
Chen, T, Guestrin, C. XGBoost: A Scalable Tree Boosting System. Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 785–794
Stock M, Piot N, Vanbesien S, Meys J, Smagghe G, De Baets B. Pairwise learning for predicting pollination interactions based on traits and phylogeny. Ecological Modelling. 2021, 451, 109508. doi:10.1016/j.ecolmodel.2021.109508
Boeckaerts D, Stock M, De Baets B, Briers Y. Identification of Phage Receptor-Binding Protein Sequences with Hidden Markov Models and an Extreme Gradient Boosting Classifier. Viruses. 2022, 14(6), 1329. doi:10.3390/v14061329
Latka A, Leiman PG, Drulis-Kawa Z, Briers Y. Modeling the Architecture of Depolymerase-Containing Receptor Binding Proteins in Klebsiella Phages. Front Microbiol. 2019, 10, 2649. doi:10.3389/fmicb.2019.02649
Wyres KL, Wick RR, Gorrie C, et al. Identification of Klebsiella capsule synthesis loci from whole genome data. Microb Genom. 2016, 2(12), e000102. doi:10.1099/mgen.0.000102
Lam MMC, Wick RR, Judd LM, Holt KE, Wyres KL. Kaptive 2.0: updated capsule and lipopolysaccharide locus typing for the Klebsiella pneumoniae species complex. Microb Genom. 2022, 8(3), 000800. doi:10.1099/mgen.0.000800
Babenko B. Multiple Instance Learning: Algorithms and Applications. Dept. of Comp. Sci. & Eng., University of California, San Diego. 2008.
Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023, 379(6637), 1123-1130. doi:10.1126/science.ade2574
Pires DP, Oliveira H, Melo LD, Sillankorva S, Azeredo J. Bacteriophage-encoded depolymerases: their diversity and biotechnological applications. Appl Microbiol Biotechnol. 2016, 100(5), 2141-2151. doi:10.1007/s00253-015-7247-0
Oliveira H, Costa AR, Konstantinides N, et al. Ability of phages to infect Acinetobacter calcoaceticus-Acinetobacter baumannii complex species through acquisition of different pectate lyase depolymerase domains. Environ Microbiol. 2017, 19(12), 5060-5077. doi:10.1111/1462-2920.13970
de Jonge PA, Nobrega FL, Brouns SJJ, Dutilh BE. Molecular and Evolutionary Determinants of Bacteriophage Host Range. Trends Microbiol. 2019, 27(1), 51-63. doi:10.1016/j.tim.2018.08.006
McNair K, Zhou C, Dinsdale EA, Souza B, Edwards RA. PHANOTATE: a novel approach to gene identification in phage genomes. Bioinformatics. 2019, 35(22), 4537-4542. doi:10.1093/bioinformatics/btz265
Lowe TM, Chan PP. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res. 2016, 44(W1), W54-W57. doi:10.1093/nar/gkw413
Sant DG, Woods LC, Barr JJ, McDonald MJ. Host diversity slows bacteriophage adaptation by selecting generalists over specialists. Nat. Ecol. Evol. 2021, 5, 350-359. doi:10.1038/s41559-020-01364-1

There is NO Competing Interest.

BoeckaertsPhageHostLearnS3pdf.pdf
Supplementary Tables
BoeckaertsPhageHostLearnsupplementsS12.docx
Supplementary Figures

Download PDF

Journal Publication

published 22 May, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Actionable prediction of Klebsiella phage-host specificity at the subspecies level

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Sequence data collection and processing

Multi-instance feature representations

A classification model that predicts interactions

Discussion

Materials and methods

Sequence data collection and processing

Multi-instance feature representations

A classification model that predicts interactions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1