We will describe the construction of the data set in Section 2.1 and describe the classification method in Sections 2.2 and 2.3.
2.1. Dataset
2.1.1. Extracting data
To the best of our knowledge, there are currently no publicly available standard data on human pathogenic microorganisms.We extracted metadata from the Pathosystems Resource Integration Center (PATRIC) and the Integrated Microbial Genomes (IMG) . Both databases provide researchers with a variety of metadata on microbial genome projects.
We downloaded the IMG and PATRIC database on 05/07/2019, IMG contained ~71 thousand sequenced bacterial genomes, PATRIC contained ~220 thousand sequenced bacterial genomes.
We further identified the human colonization and whole genome sequences (WGS) data by finding ‘Humans sapiens’ or ‘Homo sapiens’ in the ‘host name’ column and finding ‘WGS’ in the ‘Genome Status’ column, then merged and deduplicated both databases by NCBI Taxon ID column, and finally downloaded fasta format files of sequences according to taxon ids from NCBI website.
2.1.2. Labelling method
To infer the pathogen label, we created heuristic rules to label data based on metadata of the IMG and PATRIC databases, comparing with BacPaCS (Cosentino et al., 2013) and PaPrBaG (Deneke et al., 2017). The heuristic method was described as follows:
- We labelled genomes as opportunity human pathogens (OHP) if they satisfied any of the following criteria: One of the fields ‘Isolation Source’, ‘Host Health’, or ‘Comments’ includes an OHP term, as defined in Table 1.
- Excluding the generated OHP list, we labelled genomes as human pathogens (HP) if they satisfied any of the following criteria:
- The ‘Disease’ field is not empty and does not contain an OHP/Commensal term, as defined in Table 1.
- One of the fields ‘Isolation Source’, ‘Host Health’, or ‘Comments’ includes an HP term. In addition, the same fields cannot include any of the non-human pathogens (NHP) terms, as defined in Table 1.
- Excluding the generated OHP and HP list, we labelled genomes as NHP if they satisfied any of the following criteria: One of the fields ‘Isolation Source’, ‘Host Health’, or ‘Comments’ includes an NHP term.
The term groups were used for the criteria, as defined in Table 1:
We labelled 87,574 genomes as human pathogens (HP), 107 genomes as opportunity human pathogens (OHP), 670 genomes as non-human pathogens (NHP) and 954,763 as inconclusive.
2.1.3. Balancing dataset
Since most bacterial samples are collected from clinically sick individuals, most of the sequencing bacteria are HP bacteria. In our data set, the ratio of HP:NHP:OHP bacteria is 200:6:1, and the data is imbalanced.
However, machine learning algorithms can be affected by imbalanced data set, and when the sample size ratio varies greatly, it may seriously affect the learning process of the algorithm, resulting in poor classification results.
These methods are often used to imbalanced learning: data-level methods (over-sampling, under-sampling and hybrid-sampling.) and algorithm-level methods (cost-sensitive learning, ensemble methods etc). Since the labeled training samples are difficult to obtain, and the generated synthetic samples have no definite meaning and are difficult to explain, so the data-level methods are not suitable, we choose cost-sensitive learning method to improve the importance of the minority class samples, set the weights of HP/NHP/OHP class to 1/200,1/6,1 respectively.
2.2 Training the Model
The workflow of our method is shown in Figure 1.
2.2.1 Extracting features
The effective features are very important for building a high performance classifier.
Bacterial genomes are densely packed with proteins (Patthy,1999), and since the protein sequences are more conserved evolutionarily than the DNA sequences, wherein the peptide can thus provide valuable information reliably.
We extracted protein physicochemical properties features and outer membrane proteins features to capture the information implied in sequencing reads.
- physicochemical properties
Amino acids are the basic unit of protein. The physicochemical properties of some amino acids have been determined by extensive experiments and theoretical studies.
AAindex is a numerical index database that assigns a fraction of each property (usually based on the peptide's secondary structure) to each residue, representing various physicochemical and biochemical properties of amino acid and amino acid pairs (Kawashima et al.,2008). The latest version 9.2 contained 566 properties when we accessed the website on 05/07/2019 (AAindex, 2019). AAindex is widely used to analyze the physicochemical and biochemical properties of protein sequences.
We selected the 32 of 566 properties as features of the model according to Deneke et al., 2017, as follows in Table 2.
The values of the feature were obtained by multiplying the amino acid frequency and its associated index score.
- outer membrane proteins features
Membrane proteins are proteins that are part of the interaction with biological membranes (Johnson and Cornell,1999; Alenghat and Golan,2013), that are the targets of over 50% of all modern medicinal drugs (Overington et al.,2006). It is estimated that 20–30% of all genes in most genomes encode membrane proteins (Krogh et al.,2001; Liszewski, 2015). Bacterial outer membrane vesicle (OMV)-mediated delivery of proteins to host cells is an important mechanism for host-pathogen communication (Koeppen et al.,2016). For example, the highly conserved Gram-negative outer membrane protein murein lipoprotein (MLP) has been detected in the plasma of septic patients (Hellman et al, 2000; Suzuki et al, 1978) and is the primary antigen that induces systemic infection. Both IgG in mice and humans have a homeostasis (Melody Y. Zeng et al., 2016) .Bertoletti and Gehring have shown that the removal of 26-30 AA from preS1, an outer membrane protein on the surface of HBV, cannot infect hepatocytes, confirming its significance in HBV infection (Bertoletti and Gehring,2009).
Protein secondary structure is the three-dimensional form of local segments of proteins, it’s simpler than tertiary structure. Membrane proteins can be predicted based on the protein secondary structure. Many methods have been developed to help predict protein secondary structure definitions and biochemical functions of proteins from its amino acid sequence (Cheng et al.,2005; Morten et al.,2012; Magnan and Baldi,2014; Mills et al.,2015;Yang et al.,2015;Zheng et al.,2019).
Read sequence was first translated into a peptide sequence, then we used standalone TMHMM version 2.0 (Krogh et al.,2001) to predict the transmembrane helices in proteins, and then used the normalized frequencies of transmembrane groups and helices as features of the model.
2.2.2 Generating Model
The Gradient boosting decision trees (GBDT), also known as Gradient Boosting Machine (GBM ) is a boosting integrated learning model with the advantages of highly customizable flexibility, fast training and parallelization, easy to adjust and interpretable (Friedman,2001).
The GBDT methods have been widely adopted in academia, industry and competitive data science due to their most advanced performance in many machine learning tasks, being almost the gold standard for structured data sets.
Researchers have done some comparative study of the three most popular GBDT packages: XGBoost, LightGBM and Catboost.
XGBoost is an efficient, flexible and portable distributed gradient enhancement library. It implements machine learning algorithms under the Gradient Boosting framework and provides parallel tree enhancements to quickly and accurately solve many data science problems such as regression, classification, and ranking (Chen et al., 2016).
LightGBM is also a tree-based gradient boosting framework that uses a novel Gradient-based One-Side Sampling (GOSS) technique to filter out the data instances for finding a split value, while XGBoost uses the pre-sorting algorithms and histogram-based algorithms to calculate the optimal split (Ke et al.,2017).
Catboost is an open-source gradient boosting algorithm developed by Yandex team in 2017. It is a machine learning algorithm which allows users to handle categorical features for a large data set quickly and this differentiates it from XGBoost & LightGBM. Catboost can be used to solve regression, classification and ranking problems (Prokhorenkova et al.,2018).
The results show that Catboost outperformed the other two in terms of both speed and accuracy results for datasets with categorical features (Prokhorenkova et al.,2018; Anghel et al.,2018; Nguyen et al.,2018;Pulicherla et al.,2019). In addition, for datasets with a large number of features, XGBoost may not work due to memory limitations, and Catboost will converge into a good solution in the shortest time.
Therefore, this paper chooses to implement the IMLA model based on CatBoost, using the above features and pathogenic labels to train a classifier for each genome in the training data set.
2.2.3 Evaluation metrics
Different metrics are used to measure the accuracy of machine learning algorithms. Sensitivity and specificity are metrics which are well known for balanced datasets. Sensitivity (also known as recall or true positive rate TPR) is the proportion of well-classified positive samples, while specificity (also known as true negative rate TNR) is the proportion of well-classified negative samples.
However, the above both metrics would result in misleading scores due to the imbalance of the data. We choose the Precision-Recall curve (PR AUC) and Matthews correlation coefficient (MCC) metrics to deal with imbalanced datasets:
Precision-Recall curve is the area under the precision-recall curve, is a more reliable and informative indicator than the receiver operating characteristic curve, especially for imbalanced datasets (Hand and Till, 2001; Saito and Rehmsmeier,2015).
MCC was a method introduced by biochemist Brian W. Matthews in 1975 to measure the quality of binary classification, it was more informative than other confusion matrix measures (such as F1 score and accuracy) , it has also been generalized to the multiclass case (Gorodkin,2004; Chicco,2017).
2.3 Interpreting Model
Doshi et al. define the interpretability of a machine learning system as an ability to explain to humans or present in understandable terms (Doshi et al., 2017). Interpretability is critical for model debugging, detection bias, interpersonal collaboration, compliance, and mission-critical applications such as healthcare, where understanding, verification, editing, and trust models are important (Caruana et al., 2015). .
There are three methods for solving this problem: interpretable model, model-specific interpretation method and model-agnostic interpretation method (Molnar et al.,2019). Interpretable models often have a big disadvantage: predictive performance is reduced compared to other machine learning models. The downside of the model-specific interpretation method is that it also binds you to a model type and makes it difficult to switch to other models. Model-agnostic interpretation methods have great flexibility because they separate interpretation from machine learning models, and machine learning developers are free to use any machine learning model when applying interpretation methods to any model (Ribeiro et al., 2016).
We use the following model-agnostic interpretation methods to interpret model:
2.3.1 Features Importance
The displacement feature importance metric was introduced by Breiman for random forests (Breiman, 2001). Based on this idea, Fisher et al. proposed a model-agnostic version of feature importance and called it model dependency (Fisher et al., 2018).
Feature importance provides a highly compressed global insight into the behavior of the model, automatically considering all interactions with other features. When the value of the feature is modified, the prediction error of the model increases, which destroys the relationship between the feature and the true outcome.
2.3.2 Accumulated local effects
Accumulated local effects (D.W. Apley. 2016) describe how the feature affects the prediction of the machine learning model on average, even if the features are correlated, it still works. ALE plots are faster to compute and scale with O(n), and are a faster, more unbiased alternative to partial dependence plots (Friedman, 2001). The ALE plots are centered at zero, which makes the value of each point of the ALE curve the difference from the mean prediction, so the interpretability is very good.
The interpretation of the ALE diagram is clear. Given a value, the relative impact of the changed feature on the prediction can be derived from the ALE plots.
2.3.3 Shapley value
The Shapley value is a method in coalitional game theory that allocates payout to players based on their contribution to total payout, which is the average marginal contribution of the value of the feature in all possible coalitions (Shapley, 1953).
The Shapley value is the only interpretation based on solid theory (efficiency, symmetry, dummy and additivity axioms), and the difference between the predicted value and the average predicted value is fairly distributed between the feature values of the instance. Furthermore, the Shapley value allows for comparative interpretation without comparing the prediction to the average prediction of the entire data set, but rather comparing the Shapley value to a subset or even a single data point.