Workflow construction and molecular data base constitutions
This integrated screening workflow is divided into four parts (Fig. 1). First, a molecular library was constructed, and the affinity ranking was obtained through molecular docking. Then, cluster analysis was performed to reveal the molecular characteristics of high-affinity clusters and extract the top clusters. Furthermore, combined with high-throughput methods applied to determine the binding affinity and predict the level of activity, the molecules with both excellent properties were finally verified experimentally to determine effective compounds.
In the foremost step, we retrieved 49 Chinese medicinal materials involved in the 6 compound herbal formula in TMTP, and ultimately acquired a total of 3272 NPs to constructed the TMTP molecular library eventually. This library include the Chinese herbal compound prescriptions and the representative Chinese medicines from TMTP as well as the main chemical compositions. On the other hand, we collected 301 of SARS-CoV and 84 of SARS-CoV-2 3CLpro inhibitors, which was performed as comprehensively as possible. The former was used to build DL models, and the latter were treated as a test set. A complete list of the molecules and related information for 3272 TMTP compounds library, compound libraries and SARS-CoV and SARS-CoV-2 3CLpro inhibitors is given in the Supplementary Data 1.
20 clusters divided from the TMTP compound library by cluster analysis
To classify the structural similarity of high-affinity molecules to further narrow the range of lead compounds, a total of 8 combinations between similarities of fingerprint maps and different cluster agglomeration methods were individually used for cluster analysis. Based on the agglomerative coefficient from agnes, we found that the combination of Euclidean and Ward2 exhibited the highest value (agglomerative coefficient = 0.975) compared with that of the other groups, and the agglomerative coefficients of the 8 groups are listed in Supplementary Table 1. Thus, we adopted the Euclidean and Ward2 combination to plot a clustering that contained 20 clusters (k = 20) (Supplementary Fig. 1, Supplementary Table 2).
Dominant clusters determined by means of molecular docking and ECR ranking
Our molecular docking approach was used to obtain the binding ability of the TMTP molecular library with SARS-CoV-2 3CLpro, as well as the affinity score between positive inhibitors with SARS-CoV or SARS-CoV-2 3CLpro for DL modeling. Docking analysis was carried out independently using the programs Autodock Vina, Glide, and MOE. Then, the exponential consensus ranking (ECR) strategy was implemented to reduce the number of false positives. This approach transformed docking scores of a single compound into a decimal number to indicate the comprehensive binding level for the target-ligand complex. Subsequent analyses were performed using ECR values instead of docking scores (Supplementary Data 1).
To compare the binding capacity to 3CLpro among clusters, we calculated and ranked the median, mean and quantile value, etc. of the ECR in each cluster. Then, the dominant clusters were defined as those with a mean ECR value greater than 0.6 (Supplementary Table 3), and 9 dominant clusters were ultimately acquired. Among the 9 dominant clusters, the average ECR value of the royalblue cluster and brown cluster was greater than 0.7, indicating that these two clusters have higher target affinity, while the large number of compounds in the brown cluster suggested that it may have very promising 3CL inhibitors.
Combining binding affinity of SPR and inhibitory activity prediction by DL analysis to narrow the range of hit compounds
In current study, surface plasmon resonance (SPR) was used to rapidly identify molecules in the dominant clusters that have the ability to bind to SARS-CoV-2 3CLpro. As a result, 21 molecules demonstrated high affinity for 3CLpro (Supplementary Table 4). DL was applied in parallel with SPR analysis to predict the 3CLpro inhibitory efficiency of compounds in the dominant clusters and further eliminate the molecules that would be nonspecifically bound in the SPR analysis. As previously described, the collected information on 3CLpro inhibitors (301 SARS-CoV and 84 SARS-CoV-2 inhibitors), including IC50, pIC50, SMILES, and CID, is shown in Supplementary Data 2. After the molecular docking process, we acquired the docking scores and ECR ranking between the SARS-CoV 3CLpro inhibitors and 3CLpro of SARS-CoV and SARS-CoV-2 using three different software programs. Overall, no significant difference between docking scores or ECR of SARS-CoV and SARS-CoV-2 was observed, which was ascribed to the high homology of the SARS-CoV and SARS-CoV-2 3CL proteins [30]. We calculated the similarity index (0.710) of two proteins binding or activity based on the docking matrix. Then, the predicted IC50 of SARS-CoV-2 3CLpro was computed based on the Eq. (1). By means of Rcpi, the molecular descriptors of 301 compounds were extracted as the quantitative structure (Supplementary Data 2). Thereafter, we constructed the quantitative relationship between structure and activity by random forest (RF) and support vector machine (SVM) training classification models. The activity of the 84 compounds for SARS-CoV-2 3CLpro was tested using the training model. We found that the area under curve (AUC) of receiver operating characteristic curve (ROC) in RF was higher than that of SVM (RF: 0.629, SVM: 0.493). Suggesting that the predicted inhibitory value calculated by the RF method was closer to the experimental value than that calculated by the SVM method. Finally, we predicted the activity of 9 dominant clusters; here, a predicted value greater than 0.5 was considered to have an inhibitory effect, and vice versa. A total of 156 compounds were predicted to be active based on statistical analysis. A complete list of the predicted values can be found in Supplementary Data 3.
The molecules that showed affinity with 3CLpro in the SPR analysis and were predicted to be effective via DL were combined and further comprehensively analyzed. Finally, 11 compounds considered promising inhibitor candidates were obtained from the resulting intersection. Interestingly, these high-activity compounds were enriched in the brown, midnightblue and red clusters (Supplementary Table 5). In the above clusters, the brown cluster mainly contains flavonoids and their glycosides. The midnightblue cluster is composed of dammarane and oleanane or their derivative parent nucleus and corresponding glycosides. The compounds in the red cluster are composed of polyhydroxy conjugated systems such as hydroxytyrosol and caffeic acid to connected with sugar units. These types of compounds often exhibit a wide range of biological activities and have also been used in the field of anti-virus [31–32].
5 NPs identifying as potent inhibitors of SARS-CoV-2 3CLpro in vitro
Eleven compounds selected by the virtual screening and DL analysis were subsequently tested using the inhibition assay against SARS-Cov-2 3CLpro. After the initial screening, only five compounds at a concentration of 100 µM demonstrate over 50% inhibitory active against the enzyme. These compounds were able to achieve inhibition at lower concentrations. According to results shown in Fig. 2, narcissoside (MOL003686), kaempferol-3-O-gentiobioside (MOL012143), rutin (MOL000415), vicenin-2 (MOL001543) and isoschaftoside (MOL004958) presented IC50 values of 38.142, 35.892, 31.259, 38.856 and 30.220 µM, respectively. Remarkably, they are all flavonoids. The results of affinity screening by SPR showed that flavonoids accounted for 10 of the 21 compounds and their KD values ranged from 1.525 to 12.46, exhibited strong affinity with 3CLpro (Fig. 3). As demonstrated in Table 1, the KD values are well correlated with IC50 values. Notably, kaempferol-3-o-gentiobioside, vicenin-2 and isoschaftoside were reported as anticoronavirus candidates for the first time due to their inhibition of 3CLpro of SARS-CoV-2. They are very promising for further research to develop compounds with high inhibition efficiency.
Table 1
Summary of ECR scores, equilibrium dissociation constants (KD) and IC50 values for SARS-CoV-2 3CLpro inhibitors
Compound | Molecular Weight (Da) | ECR | KD (µM) | IC50 (µM) |
Narcissoside | 624.544 | 0.744 | 9.995 | 38.142 |
Kaempferol-3-O-gentiobioside | 610.518 | 0.861 | 2.694 | 35.892 |
Rutin | 610.518 | 0.801 | 1.525 | 31.259 |
Vicenin-2 | 594.518 | 0.771 | 8.583 | 38.856 |
Isoschaftoside | 564.492 | 0.765 | 11.370 | 30.220 |
Molecular dynamics simulation revealed the stable binding mode of the 5 selected drugs with SARS-CoV-2 3CLpro
The dynamic binding interactions of the five compounds with inhibitory activity were analyzed, and 100 ns molecular dynamics (MD) simulations of ligand-protein complexes were performed. The root mean square deviation (RMSD) of the ligand trajectory was analyzed, revealing that the compound rapidly reached equilibrium within the first 5 ns of the simulation (Fig. 4a), with each value lying between 1.5 and 3.5 Å. Narcissoside and vicenin-2 fluctuated greatly, indicating a flexible bingding to the active site of 3CLpro. In contrast, compounds kaempferol-3-O-gentiobioside, rutin, and isoschaftoside are more fixed.
To explore the binding affinity of each ligand to 3CLpro, the binding free energy was calculated based on MM/PBSA (Table 2). Van der Waals (ΔEvdW) and electrostatic (ΔEele) interactions make major contributions to the binding free energy (Fig. 4e). We observed that rutin exhibited the highest binding affinity to 3CLpro, followed by kaempferol-3-O-gentiobioside, isoschaftoside, vicenin-2 and narcissoside. Analysis of the energy decomposition results of the five compounds suggested that the residues Thr25, Thr26, Ley27, His41, Cys44, Tgr45, Ser46, Met49, Asn142, Gly143, Cys145, His163, His164, Met165, Asp187 and Gln189 mainly contributed to hydrophobic and electrostatic interactions in the 3CLpro-ligand complex (Supplementary Table 6).
Table 2
The results of molecular MM/PBSA free energy calculation (kcal/mol) and relevant ECR scores
Compound | ΔEvdw | ΔEele | ΔEPB | ΔESA | ΔGTot | ECR |
Narcissoside | -58.32 | -21.59 | 73.90 | -6.30 | -16.31 | 0.744 |
Kaempferol-3-O-gentiobioside | -47.22 | -58.54 | 90.81 | -5.89 | -20.84 | 0.861 |
Rutin | -45.20 | -43.78 | 78.27 | -5.59 | -22.31 | 0.801 |
Vicenin-2 | -42.68 | -53.40 | 80.17 | -5.80 | -18.72 | 0.771 |
Isoschaftoside | -50.61 | -66.58 | 95.47 | -5.95 | -19.67 | 0.765 |
Specifically, from the analysis of the binding interactions, narcissoside showed the least electrostatic interaction (-21.59 kcal/mol), forming hydrogen bonds with Ser46, Gly143, His164, and Glu166. Kaempferol-3-O-gentiobioside forms multiple hydrogen bonds with Phe140, Leu141, Asn142, Gly143, Ser144, Cys145, Glu166, Pro168, and Arg188. Rutin forms hydrogen bond interactions with Thr26, Tyr54, Phe140, Asn142, Gly143, Glu166, and Gln189. Vicenin-2 demonstrated the highest number of H-bonds, forming hydrogen bonds with Thr26, Phe140, Asn142, Gly143 and Glu166. In the analysis of binding energy with isoschaftoside, the contribution of electrostatic interactions to the total binding energy was − 66.58 kJ/mol, which was highest among the 5 compounds, forming H-bond interactions with Thr26, Tyr54, Phe140, Asn142, Gly143 and Glu166, Gln189. The above analysis suggested that flavonoid glycosides provided higher flexibility after forming chains with sugars because of their rotatable bonds, which can bind into pockets and form abundant hydrogen bonds with some key residues. From the perspective of amino acid energy decomposition (Fig. 4d), the compound has a strong interaction with His41, Met49, and Cys145. His41 and Met49 are also the active site residues of 3CLpro [33]. To facilitate the analysis, we first colored the region of the residues His41 and Met49, and then divided the five flavonoids into two categories according to their structural similarity. The active cavity of 3CLpro presented strong hydrophobicity, while the aromatic ring of the flavonoid aglycone provided the main hydrophobic energy contribution in the site. For type A (Fig. 4b), narcissoside, as the only inhibitor with methoxy group. The group has the function of enhancing hydrophobic action of ligand (-58.32 kcal/mol), making the benzene ring easily inserting into the hydrophobic region of the cavity (Fig. 5a, f), resulting in the overall structure extending outside of the cavity and reducing the interaction with residues, eventually reducing the contribution of the binding free energy. In contrast, the flavonoid skeleton of kaempferol-3-O-gentiobioside is close to the cavity (Fig. 5b, g). Furthermore, rutin is inserted into the cavity (Fig. 5c, h), which makes the binding tighter and presents the lowest binding free energy (Table 2). For type B (Fig. 4c), the overall structure shifted in the active pocket due to prolongation of the rigid flavonoid part in vicenin-2 (Fig. 5d, i) and isoschaftoside (Fig. 5e, j), resulting in the distance from the active site being farther than that for kaempferol-3-O-gentiobioside and rutin. However, they did not demonstrate much difference in their total binding free energies.
Notably, from the analysis of the binding interaction, with the key residues, we observed that the interaction strength between His41 and Met49 with the ligand was positively correlated with the affinity of the ligand and 3CLpro binding. In addition, the total energy (Supplementary Table 7) of residues based on the region of His41 to Met49 also exhibited this rule. On the other hand, ΔGTot calculated by MM/PBSA also matched the ECR rank of the molecule (Table 2). The clear binding pattern and significant inhibitory activity of these five flavonoids against 3CLpro indicated that they are promising candidates for anti-SARS-CoV-2 activity. These results prove the correctness of our screening strategy.