The flowchart shown in Fig. 1 illustrates the methods used in this research. The six major steps were as follows: (i) collection and selection of spike glycoproteins, (ii) clustering analysis of spike glycoproteins, (iii) identification of binding molecules from the BindingDB, (iv) identification of SMs from the KNApSAcK database, (v) docking of secondary metabolites to corresponding protein clusters, and (vi) bioavailability analysis of docked secondary metabolites.
i. Collection and Selection of Spike Glycoproteins
We searched the PDBJ database using the keyword ‘COVID-19’and found 1277 journals (Table S1 in Supplementary Material I), referring to 1202 FASTA files that were curated and divided into different protein sequences of SARS-CoV-2 (Table S2 in Supplementary Material I). The structural proteins of SARS-CoV-2 include a range of different proteins, the most important of which are the membrane (M) glycoprotein, the envelope (E) protein, the nucleocapsid (N) protein, and the spike (S) glycoprotein [22], [23], [24]. Spike glycoproteins on the surface of a virus are responsible for forming peplomers that facilitate the entry of the virus into host cells for their intended purposes [12], [25]. As the objective of this study was to discover natural compounds that could obstruct the entry of the SARS-CoV-2 S protein into host cells, only 204 S glycoproteins from 1202 FASTA files were considered (Table S3 in Supplementary Material I).
ii. Clustering Analysis of Spike Glycoproteins
This study employed a basic clustering algorithm from DPClusSBO [26], which is a unified software that implements the DPClusO [27], [28], [29], and BiClusO [30] algorithms. This tool provides a graphical user interface for performing simple and bipartite graph clustering, as well as filtering and merging clusters, hierarchical node analysis, and visualization of cluster sets. Using these clustering techniques, we identified potential subclasses of spike proteins that might exhibit common characteristics of varying exposure to different natural compounds. The results of cluster analysis could serve as a guide for downstream molecular docking studies and aid in the identification of potential natural compounds for targeted therapy.
iii. Binding Molecule Identification from the BindingDB
We retrieved protein sequences similar to spike glycoproteins and associated binding molecules from BindingDB, a publicly available database that stores experimental data on protein–binding molecule interactions [31]. We searched for binding molecules in the BindingDB database using 204 spike protein sequences. This search incorporates the BLASTp algorithm to identify matching protein sequences and corresponding binding molecules. Similarity between protein sequences is measured utilizing sequence identity (SI) value that quantifies the exact matches between two different sequences [32]. The sequence identity (SI) between two proteins A and B was calculated as follows:
$$\:SI=\frac{L\left(i\right)}{\text{m}\text{i}\text{n}({length}_{{S}_{A}},{length}_{{S}_{B}})}$$
where, \(\:L\left(i\right)\) denotes the number of aligned residues with identical or equivalent properties, \(\:{S}_{A}\) and \(\:{S}_{B}\) sequences of proteins A and B. While searching the BindingDB database, we utilized 0.3 as the minimum threshold SI value. Each identified binding molecule was assigned an identifier in the form of a DrugID.
iv. Identification of Secondary Metabolites from the KNApSAcK Database
We utilized SMs from the KNApSAcK database [33], [34] owned by our laboratory, which contains a broad collection of secondary metabolite data from plants. The database is used for plant metabolomics research, and each metabolite in the database is assigned a unique Metabolite ID (MetaID) associated with the corresponding data, such as its molecular structure, physicochemical properties, and species. The KNApSAcK database contains more than 53,000 SMs derived from diverse sources, such as plants, bacteria, protozoa, fungi, and animalia.
We determined SMs similar to binding molecules using a fingerprint-based molecular similarity calculation measure known as the Tanimoto coefficient (TC). The Tanimoto coefficient is calculated as follows:
$$\:TC\left(A,B\right)=\frac{(A\cap\:B)}{(A\cup\:B)}$$
where, \(\:A\) denotes the set of fingerprints of the BMs and \(\:B\) is the set of fingerprints of the KNApSAcK SMs. The TC value ranges from 0 (indicating no similarity) to 1 (indicating similarity). We narrowed the selection of SMs by eliminating low-TC SMs, focusing on those with a TC value of at least 0.85.
v. Docking the Secondary Metabolites to Corresponding Protein Cluster
Prior to the docking process, the previously identified SMs were mapped to the corresponding cluster according to the result of similarity matching between binding molecule (DrugID) and metabolite (MetaID). By docking analysis, we can verify whether the identified SMs are likely to bind with the spike proteins. Therefore, to identify potential NPs that could bind strongly to SARS-CoV-2 spike proteins, selected SMs were subjected to docking analysis. Docking analysis provides valuable insights into the interactions between secondary metabolites and their targets [35]. Specifically, it provides an indication of the binding affinity between molecules, which is crucial for assessing their therapeutic potential [36]. The protein structures of each cluster were prepared by removing water molecules, ions, and small ligands and by adding hydrogen using the LePro module of the LeDock program (www.lephar.com). After the protein was sanitized, the predicted binding pocket was determined using Fpocket [37]. The SMs found in each spike protein cluster were also prepared by converting the MOL files to MOL2 files. All prepared protein structures and SMs were docked using SMINA [38], [39].
vi. Bioavailability Analysis of the Docked Secondary Metabolites
Previously analyzed SMs were assessed for their drug-like properties and physicochemical characteristics using SwissADME (www.swissadme.ch). The SMILES strings of SMs docked to each protein cluster were subjected to SwissADME analysis. Lipinski’s rules of five was used to evaluate the drug–likeness of the secondary metabolites [40]. Moreover, the bioavailability radar was also used to represent the drug-likeness properties of compounds in a rapid appraisal format, offering a quick visualization of a molecule's drug-likeness. The pink area in the bioavailability radar represents the optimal range for each property, such as lipophilicity (XLOGP3 between − 0.7 and + 5.0), size (MW between 150 and 500 g/mol), polarity (TPSA between 20 and 130 A2), solubility (log S not higher than 6), saturation (fraction of carbons in sp3 hybridization not less than 0.25), and flexibility (no more than nine rotatable bonds) [41]. The effectiveness of a substance as a drug or therapeutic agent was assessed using the bioavailability score (BAS), a numerical measure that reflects its potential for easy absorption into biological systems [42], [43]. The BAS is determined by a semi-quantitative rule-based score that considers factors such as total charge, TPSA, and violation with the Lipinski rule, which divides compounds into four classes with probabilities of 11%, 17%, 56%, or 85%. A compound should have an oral bioavailability probability of at least 10% [42]. While Lipinski's rule can be utilized to choose secondary metabolites, we also employ bioavailability radar analysis to conduct an additional screening process. This two-step screening approach maximizes the likelihood of identifying viable drug candidates. This approach guarantees that only secondary metabolites with favorable profiles across all important factors are chosen for further development.
vii. Final Drug Candidate Selection Criteria
Secondary metabolites that pass all the above analyses will be assessed using selection criteria that consider: (a) those that do not violate Lipinski's rule, (b) those with physicochemical properties within the pink area on the radar, and (c) those with high-probability BAS values. The secondary metabolites that meet these criteria will be sorted by their binding affinity values, with the top five selected for further consideration.