Sequence data collection and processing
Phage genome sequence data, bacterial genome sequence data and their in vitro verified phage-bacteria interactions were collected from the Institute for Integrative Systems Biology (I2SysBio) in Valencia, Spain as described by Beamud et al. (2023), supplemented by an additional, unpublished collection of phage-host interaction data for phages isolated on Klebsiella spp. reference strains. In total, 105 phage genome sequences and 200 bacterial genome sequences were collected. Spot tests were performed to test for phage-bacteria interactions. Out of 10,006 spot tests performed in total, 274 are confirmed interactions (2.74%). Interactions are considered positive if a spot was visible using a 1:10 phage dilution, reflecting an initial interaction between phage RBPs and host receptors (but not necessarily a productive replication).
Phage genome sequences were processed in three consecutive steps (Fig. 1a): (1) PHANOTATE was used to identify genes in each of the phage genomes (McNair et al., 2019); (2) phage RBPs were detected among the translated protein sequences of the identified genes, following our method outlined in Boeckaerts et al. (2022) [25]; and (3) detected phage RBPs shorter than 200 amino acids and longer than 1500 amino acids were discarded, according to the range in length in which we expect RBPs [26]. In total, 9,727 genes were detected with PHANOTATE, and subsequently 274 phage RBPs were detected among those identified genes. We detected at least one RBP in each phage genome, and up to eight RBPs in a single phage genome (Supplementary material S1).
Bacterial genome sequences were processed with Kaptive [27, 28] to identify the capsule synthesis locus (K-locus) in each of the bacterial genomes (Fig. 1a). On average, each K-locus consisted of 19 proteins that constitute the K-antigen (the number of proteins was between 10 and 25). A total of 89 different K-types were identified using Kaptive (K13 was most often represented, while 45 different K-types were only represented once, Supplementary material S2).
Multi-instance feature representations
We transformed phage RBPs and bacterial loci proteins into combined numerical vector representations (so-called joint features), to serve as input to the machine learning model (Fig. 1a). These representations are so-called multi-instance representations [29], combining one or multiple RBPs per phage and multiple K-locus proteins per bacterium.
We used the pretrained ESM-2 protein language model (t33_650M_UR50D configuration) to transform each of the RBPs and loci proteins into a unique numerical vector [30]. The vectors corresponding to the RBPs of the same phage or the K-locus proteins of the same bacterium were averaged into multi-instance representations for each phage or bacterium. Finally, for each known interaction in the dataset, the multi-instance representations of each phage and each bacterium were concatenated into a final combined numerical vector that represents a known phage-host pair.
A classification model that predicts interactions
We trained a binary XGBoost classifier to output prediction scores reflecting how likely a phage-host pair will interact, based on the combined ESM-2 numerical vector representations described above (Fig. 1a). The maximum depth of each tree, the learning rate and the number of estimators were tuned using stratified five-fold cross-validation (Table 1). The optimal maximum depth was 7, the optimal learning rate was 0.3 and the optimal number of estimators was 250.
Table 1
Hyperparameters and their tested values in the PhageHostLearn model. The optimal values of the hyperparameters for the model are indicated in bold.
Hyperparameters
|
Tested values
|
Maximum depth
|
3, 5, 7, 9
|
Learning rate
|
0.2, 0.3, 0.4
|
Number of estimators
|
250, 500, 750
|
In silico evaluation of the model
We have evaluated our model both in silico and in the laboratory in the practical setting of finding which phages in the collection are the most active against a given bacterial strain. A predictive model is useful if it can effectively suggest the most appropriate phages to test, in that way minimizing manual analysis and laborious experimental work. We have simulated this representative setting in silico by iteratively holding out one bacterial genome with its phage interactions at a time from the training set. In each iteration, the held-out interactions were predicted by the model and their prediction scores were used to construct a ranking of the predicted phages. The hit ratio was computed across the top-k ranked phages by comparing the ranked predictions to the ground truth labels to quantify how well our model does in finding matching phages. This process was repeated for values of k ranging from 1 to the total number of phages, was repeated for each of the bacterial genomes in the dataset and finally averaged across all the iterations over the bacterial genomes (Fig. 2a). This mean hit ratio @ k provides a meaningful visualization of the average probability of finding at least one hit in the top-k candidates suggested by the model. For example, with our model we expect to find at least one hit in the top-10 in around 84% of the cases on average (dark blue curve). We have also simulated an informed microbiologist approach by manually selecting from a subset of phages that are known to infect the same K-type as the bacterial strain at hand (red curve). Our model slightly outperforms this approach in suggesting positive interactions near the top. Additionally, we visualize the receiver operating characteristic (ROC) curve (Fig. 2b) and measure its area-under-the-curve (AUC) as a general performance metric of our model. This ROC AUC can be interpreted as the probability that the model will score a randomly chosen interacting phage-host pair higher than a randomly chosen phage-host pair that does not interact. Our model reaches a ROC AUC of 83.0%. Expectedly, the mean hit ratio differs across different K-types, and there is a strong contrast between the top-10 mean hit ratio for the best and worst predicted K-types (Fig. 2c). Therefore, we constructed histograms of the number of confirmed interactions per bacterial strain belonging to the best and worst predicted K-types as well as the group in between. We observed that the performance across K-types can be related to the number of confirmed interactions in those K-types (Fig. 2d,e,f), highlighting the need for an extensive training dataset with sufficient confirmed interactions for each K-type for optimal performance.
In vitro validation of the model with spot tests
A total of 28 carbapenem-resistant K. pneumoniae clinical isolates were collected and sequenced in collaboration with the National Microbiology Center (CNM) in Madrid, Spain. These K. pneumoniae clinical isolates comprised high-risk clones that are currently circulating in Spain and included a total of eight different K-types (KL17, KL24, KL27, KL64, KL102, KL107, KL112 and KL151). Each of these K-types was also present (at least once) in the training data. For each of these K. pneumoniae clinical isolates, PhageHostLearn was used to predict interactions and construct a ranking for the unpublished collection (I2SysBio) of 59 phages isolated on Klebsiella spp. reference strains and for which the full genome was available. As these phages were isolated on Klebsiella spp. reference strains, they were not tested on all the K-types present in the test set of clinical isolates. Moreover, none of the phages was tested before on these specific clinical isolates. The top-five ranked phages for each K. pneumoniae clinical isolate were validated in the laboratory using spot tests at a 1:10 phage dilution in duplicate or triplicate (for discrepant results). Spot tests were used for consistency across model training and in vitro validation, and because of our focus on the initial interaction between phage RBPs and host receptors (not necessarily reflecting a productive replication). In an additional effort, all 17 unique phages that were identified across the different top-five lists, were tested against all the 28 clinical isolates to examine potential false negatives.
One or more interactions were confirmed with spot tests for 16 out of the 28 bacterial strains. PhageHostLearn was able to correctly predict hits in the top-five phage candidates for 15 of these isolates, corresponding to a top-five hit ratio of 93.8% (Fig. 3a). Comparing the different K-types, PhageHostLearn only missed 7 hits in total, for strains of KL17, KL24 and KL27 (Supplementary material S3). Overall, the PhageHostLearn system retains its in silico performance, reaching a ROC AUC of 79.3% in this in vitro validation, compared to 83.0% in silico (Fig. 3b).
Overall, the top candidates predicted by the XGBoost model are often phages that have a broader host range, such as K65PH164, K30λ2.2, K2064PH2 and K7PH164C4 (Supplementary material S3). These phages appear across the true positives, false positives, and false negatives (Table 2). Considering that these phages are not K-locus specific, this result was to be expected based on our focus on RBPs and K-locus proteins. Interestingly, the model does suggest a strategy that a microbiologist would think of as well: testing all the broad host range phages by default. The model also suggests some K-type specific phages correctly, such as K54λ1.1.1 and K17α62 and misses only few of these (i.e., most false negatives involve broad host range phages such as K30λ2.2, K2064PH2 and K7PH164C4). One K-type specific phage (K17α61) was wrongly predicted as false positive in combination with some bacterial strains but was a false negative prediction in combination with other bacterial strains. These wrong predictions are more challenging to explain from a biological perspective and could equally be explained because of a lack of sufficiently similar data from which the models can learn.
Table 2
Concordance of the predictions by our model with laboratory confirmations by means of a confusion table. Counts of true positives, false positives and false negatives with the most prevalent phages in each category, across all the top-five laboratory-confirmed interactions (140 in total), supplemented by the interactions tested in all 17 unique phages across the different top-five lists for counting the false negatives. The true positives were the predictions in the top-five recommendations that were confirmed in the in vitro validation. The false positives were the predictions in the top-five recommendations that could not be confirmed in the in vitro validation. Finally, the false negatives were the interactions that could be confirmed in the lab across the 17 tested phages that were not predicted in the top-five recommendations.
Case
|
Model prediction
|
In vitro result
|
Count
|
Prevalent phages
|
True positive
|
Top-five
|
Interaction
|
26
|
K65PH164, K30λ2.2, K2064PH2, K7PH164C4
|
False positive
|
Top-five
|
No interaction
|
114
|
K65PH164, K2064PH2, K29PH164C1,
K17α61
|
False negative
|
Outside of top-five
|
Interaction
|
7
|
K2064PH2, K7PH164C4, K17α61, K30λ2.2
|
True negative
|
Outside of top-five
|
No interaction
|
Not considered
|
-
|
Moreover, the model correctly suggested 78.8% (= 26 / [26 + 7]) of all the confirmed interactions in the top-five. However, these seven false negatives can be an underestimation, as we have not comprehensively tested all 59 phages against the 28 clinical isolates. In addition, the model also suggested 114 interactions that turned out to be negative. This is intrinsic to using a ranking approach because top suggestions are tested regardless of the prediction scores that are assigned by the model, thus could also comprise phages that do not adsorb to the host strain. Concretely, when all interactions for a given bacterium receive low scores, the five with the highest scores were still tested.