In this study, we utilized a subtractive genomics approach to prioritize drug targets against Y. pestis. We selected 28 strains of Y. pestis to identify potential drug targets. Various databases and computational tools, as depicted in Fig. 1, were employed to determine therapeutic targets.
The subtractive genomics approach is a robust method for identifying proteins that are unique to a pathogen and essential for its survival, yet absent in the host. This approach involves a comparative analysis of the proteomes and metabolic pathways of both the host and the pathogen. By subtracting proteins common to both, subtractive genomics reveals pathogen-specific proteins crucial for its survival, thereby ensuring these targets do not interfere with the host's metabolic processes. This methodological framework is instrumental in pinpointing potent drug targets for pathogens.
Retrieval of core proteomes of Y. pestis
The core proteomes of 28 strains of Y. pestis (Fig. 2) were retrieved from the PANX database [9]. PANX is a comprehensive platform for pan-genome analysis and exploration, allowing the extraction of core proteomes specific to pathogens. This streamlined process enables more efficient subsequent data analysis by eliminating irrelevant accessory protein sequences.
Removal of paralogous protein sequences
To ensure a non-redundant dataset, paralogous protein sequences were removed using the Cluster Database at High Identity with Tolerance (CD-HIT) at a 75% identity threshold. Proteins with sequence identities greater than 75% were identified as paralogous, and their complete sequences were excluded from the dataset [10]. Additionally, proteins with fewer than 100 amino acids were removed. This process yielded a refined list of non-paralogous protein sequences for further analysis.
Identification of non-homologous protein sequences
To avoid cross-reactivity with the human proteome and prevent potential therapeutic molecules from binding to host homologous proteins, a BLASTp analysis was conducted. The non-paralogous proteins of Y. pestis were compared against the human (Homo sapiens) RefSeq proteome in the Ensembl genome database [11]. Proteins were considered homologous to human proteins if any significant hits above a threshold value of 0.005 were found, with sequence identity less than 50%.
Selection of essential non-homologous proteins
The non-homologous protein sequences identified previously were further analyzed using the Database of Essential Genes (DEG, Version 11) [12]. The DEG database provides crucial information on genes and proteins essential for the survival and growth of pathogens, which is instrumental for identifying potential drug targets. Using a threshold E-value of 10− 4 and an alignment length cutoff of 1%, we screened for proteins in Y. pestis that are both essential and non-homologous. These essential proteins play significant roles in synthetic biology and offer potential as high-value therapeutic targets.
Druggability of essential proteins
To assess the druggability of the essential non-homologous proteins, we conducted a BLASTp analysis against proteins that are therapeutic targets and approved by the Food and Drug Administration (FDA). These target proteins were obtained from the DrugBank Database (Version 5) [13] and Therapeutic Target Database (TTD) [14]. This analysis aimed to identify essential proteins with drug-target-like characteristics, prioritizing those that are novel and unique therapeutic targets. An E-value cut-off of 10− 4 was used for this analysis.
Removal of proteins homologous to human ‘anti-targets’
To mitigate the risk of adverse side effects caused by drugs interacting with essential human proteins, we identified and removed proteins homologous to human "anti-targets." Anti-targets are human proteins that, if inhibited by pathogen-targeting drugs, could cause hazardous side effects. Using data compiled from literature [15], which identified a total of 451 high-confidence anti-targets, along with their accession numbers. The corresponding protein sequences were obtained from the NCBI Protein database.
To ensure that our identified drug targets do not share significant similarity with these human anti-targets, a BLASTp analysis was performed using the NCBI BLAST program. The parameters for this analysis were an E-value threshold of < 0.005, a query length > 30%, and an identity > 50%.
Conservancy analysis of druggable essential proteins
To determine the potential broad-spectrum applicability of the predicted drug targets, conservancy analysis was performed. The predicted drug target sequences were analyzed for homology with other common pathogenic strains using the BLASTp suite on the NCBI server. In this analysis, protein-protein BLAST was conducted against 222 pathogenic bacteria [16, 17]. Parameters for this analysis included an E-value of < 10− 4 and sequence identity > 20%. Proteins identified in more than 50 distinct pathogenic strains were classified as broad-spectrum targets.
Host pathogen interaction analysis
To refine our selection of drug targets, we analyzed the sequence similarity of these targets with microbial proteins known to interact with the human host. Sequence information for host-pathogen interacting proteins was obtained from online databases such as HPIDB (version 2.0) [18], PHIbase (version 4.2) [19], and PHISTO (version 2) [20]. Using the BLAST algorithm, we calculated sequence similarity with an E-value threshold of 10− 4 and an alignment length cutoff of 1%. This step ensured that the identified targets have potential interactions relevant to human pathogenic conditions.
Identification of gut microflora non-homologous proteins
The next selection criteria involved screening out proteins with high sequence similarity to human gut microbiota, to avoid potential side effects on gastrointestinal function. The gut microbiota plays a pivotal role in maintaining homeostasis and overall health. To compile a relevant database, sequence information was sourced from published literature [21, 22]. The BLAST algorithm in NCBI was used to compute protein sequence similarity. Parameters for this BLAST analysis included an E-value threshold of < 0.005, query length > 30%, and identity > 50%. This step ensured that potential drug targets would not adversely affect beneficial gut flora.
Protein resistance analysis
To identify potential drug targets that can effectively counteract antibiotic resistance, we included only proteins with significant antibiotic resistance for subsequent structure-based investigations. Targeting these specific proteins could potentially inhibit the mechanisms through which Y. pestis evades existing treatments. For this analysis, the proteins that had successfully passed through all previous screening steps were subjected to a BLASTp analysis against the Comprehensive Antibiotic Resistance Database (CARD) [23, 24]. The resultant sequences with high antibiotic resistance would be selected for further structure-based studies.
Prediction of subcellular localizations
Knowing the subcellular localization of a protein is crucial for understanding its function and potential as a drug target. Y. pestis being a Gram-negative bacterium, has five possible subcellular locations: cytoplasm, inner membrane, periplasm, outer membrane, and extracellular space. PSORTb version 3.0.2 was employed to predict the subcellular localizations of proteins that passed through the antibiotic resistance screening [25]. PSORTb works by conducting a BLAST search of non-homologous proteins against proteins with known subcellular localizations.
Drug target prioritization
To further refine the selection of screen ideal drug target candidates for Y. pestis, several crucial protein properties were considered: molecular weight, presence of transmembrane helices, stability, and involvement in biological processes. Protparam tool in the ExPASy server was used to predict the molecule weight and stability of the proteins [26]. Ideal drug targets were those with stable physicochemical properties and a molecular weight of less than 100 kDa, making them easier to handle experimentally. Then, transmembrane helix analysis was conducted using TMHMM-2.0 [27]. Proteins without transmembrane helices were preferred, as these proteins are generally easier to express and clone, facilitating further laboratory studies and drug development processes. The Interpro Database (version 90.0) was utilized to identify the potential function of protein targets, as well as the biological processes they are involved in [28]. This step ensured that the selected proteins play critical roles in the pathogen's lifecycle, thereby validating their relevance as drug targets.
Structure prediction and homology modelling
To advance our understanding of the shortlisted protein targets, we examined their structures in the Protein Data Bank (PDB). The BLASTp method was employed to identify a suitable template for protein structure modeling. When a three-dimensional structure was unavailable, we employed AlphaFold2 to model the protein structure [29, 30].
Validation of protein structure
The quality of the modeled structures was rigorously validated using several computational tools, including PROCHECK [31], PROSA [32], and ERRAT [33], to ensure their suitability for docking experiments. PROCHECK was used to evaluate the stereochemistry composition of a protein structure. It analyzed the residue-by-residue geometry as well as the overall structural geometries of the protein. PROSA was employed to evaluate the quality of the 3-dimensional structural model against the available structures supplied from PDB based on Z-score. ERRAT analyzed the statistics of protein model atom interactions, identifying potential errors in the protein structure.
Active site prediction
Upon confirmation of the protein structures, it was essential to identify the active sites where ligands could bind to exert their effects. For this purpose, the online tool POCASA was utilized. POCASA is an automated program that uses the Roll algorithm to predict binding sites by detecting pockets and cavities within proteins of known 3D structure [34].
Ligand extraction
Ligands were extracted from Traditional Chinese Medicine (TCM) Bank database and DrugBank database. The TCMBank (https://tcmbank.cn) is the largest comprehensive traditional Chinese medicine database, providing standardized information on traditional Chinese herbs, ingredients, diseases, and their corresponding gene targets. It contains data on 9,191 herbs, 61,965 ingredients and 32529 diseases currently (released 2024-6-10) [35]. We extracted 61,965 ingredients from the TCM Bank to ensure a broad spectrum of natural compounds for potential therapeutic applications. DrugBank is an open-source of archives of drugs, consisting of 3820 approved molecules, 11,212 experimental drugs, and 5,371 investigational drug molecules at present (version 5.1.10, released 2024-6-20) [36]. From DrugBank, we collected molecules that are approved, under experimental evaluation, and in clinical trials, expanding our pool of potential drug candidates. The combination of traditional and modern pharmacological compounds allowed us to explore a diverse range of bioactive molecules, enhancing the likelihood of identifying effective inhibitors against the selected drug targets in Y. pestis.
Molecular docking and ADMET analysis
Molecular docking simulations were performed using the GLIDE module of Schrödinger [37]. The identified ligands from previous steps were docked with the active sites of the curated protein targets. The protein structures were preprocessed using the Protein Preparation Wizard, which involved adding missing hydrogen atoms, assigning bond orders, and optimizing hydroxyl groups and hydrogen bonds. Energy minimization was performed using the OPLS4 force field, with non-hydrogen atoms minimized to an RMSD value of 0.3 Å. The receptor grid was generated based on the centroid of residues identified by POCASA. Ligand structures were prepared using LigPrep, optimizing their geometries with the OPLS4 force field, and generating structures within the default pH range using Epik.
Docking was executed using the Ligand Docking module, employing the standard precision (SP) method. The top 100 molecules were predominantly selected according to binding affinity and run three docking times to ensure consistency. From these, 10 molecules were shortlisted based on the average binding affinities. These molecules were then screened using Lipinski's rule of five with the ChemAxon tool [38].
To evaluate the pharmacokinetic properties of the selected compounds, ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) analyses were conducted using the HelixADMET and ADMETlab (version 2.0) web servers [39, 40]. The absorption of the molecules was predicted using human intestinal absorption (HIA), Caco-2 permeability, and human oral bioavailability (HOB). Distribution of drug molecules was assessed by considering blood-brain barrier (BBB) penetration, plasma protein binding (PPB), and volume distribution. Metabolism was evaluated using models for inhibition by the detoxification enzyme cytochrome P450, including CYP2C9, CYP2D6, CYP3A4, and CYP2C19. Excretion of drugs was predicted using half-life, while toxicity was assessed based on the AMES toxicity, hepatotoxicity, and carcinogenicity.
In conjunction with the outcomes derived from the previously described screening process, the most promising molecular candidates will be subsequently subjected to molecular dynamics (MD) simulations, with the aim of elucidating the structural stability, integrity, flexibility and compactness of the protein-ligand complex.
Molecular dynamics simulation
To assess the dynamic behavior of the docked complexes following ADMET analysis, they were subjected to Molecular Dynamics (MD) simulations using the Schrödinger Biosuite's Desmond software. The aim was to scrutinize the stability of these complexes, specifically the interactions between the target proteins and the most promising ligands identified through virtual screening, over an extended 100-nanosecond (ns) simulation period.
The initial stages of MD simulation involved configuring the solvent model, force field, and boundary conditions through the system builder. The OPLS4 force field was employed for these simulations, with the complexes centered in an orthorhombic box and a 10 Å distance imposed as boundary conditions. The solvent molecules were modeled using the simple point charge (SPC) model, and counter ions were added to neutralize the system's charge. A 0.15mol/L NaCl solution was included to simulate the concentration of physiological saline in the human body. Subsequently, the energy of the complex structures was minimized using the steepest descent algorithm, forming part of the equilibration protocol within the Desmond framework.
The MD simulations were conducted for a 100 ns trajectory within the NPT ensemble class. Temperature control was maintained at a constant 310 K using the Nose-Hoover chain method, and pressure was stabilized at 1 bar using the Martyna-Tobias-Klein method. Comprehensive analyses of the simulation events included assessing the Root Mean Square Deviation (RMSD) of protein and ligand bonds, Root Mean Square Fluctuation (RMSF) of the complex, Radius of Gyration (RoG), Solvent Accessible Surface Area (SASA), and intermolecular hydrogen bonds. These analyses provided valuable insights into the stability of the ligands within the binding sites of the protein complexes. Additionally, the solvent-accessible properties of the protein-ligand complexes were examined to understand their interaction with the surrounding environment.