We evaluated the most suitable parameters for high accuracy of CNER using NBC based on IA values. The highest IA values were achieved with context windows one and two, i.e., one (two) token before and one (two) token after the target token (see Fig. 2) depending on the class. For the systematic class, the highest IA values are obtained for context window one, multi-n-grams with n = 5 (IA = 0.988), and context window two, multi-n-grams with n = 7 (IA = 0.992), and the difference in accuracy values for these parameters is insignificant. For the class Trivial, the highest IA value was achieved for the following parameters: context window one, multi-n-grams with n = 7 (IA = 0.984). Figure 2 shows that the values of IA increase along with an increment of the maximal context window up to two tokens before and two after a target token for the class Systematic; however, this is not the case for the class Trivial, for which the highest IA values were obtained for context one. The highest average IA values can be achieved for the following parameters: context window one and multi-n-grams with n = 5 (IA = 0.986); almost similar values of IA can be obtained with context window two and n = 7 (IA = 0.988). Therefore, the IA values do not grow dramatically with an increase in the context window of more than one token and multi-n-grams of more than five symbols. These observations allow us to conclude that a combination of a context window, one token before and after a target token, and maximum multi-n-grams of five can be used to achieve the highest accuracy of CNER based on the naïve-Bayes approach, and it provides reasonable computational complexity.
The IA values and balanced accuracy (BA) of CNER for various classes obtained in leave-one-out-cross-validation are given in Table 2. The IA, sensitivity (recall), precision, specificity, and BA values for various thresholds of B-statistics are provided in the Supplementary Materials.
We surmise that the decrease in accuracy for n-grams of seven or more symbols may be associated with the high uniqueness of such long n-grams in the training set.
These results may also be associated with the peculiarities of the text fragment formation: the higher number of tokens arranged to the type may lead to difficulties in recognizing the features of the target token. At the same time, a minimal context of one token before and after the target token can help consider whole words or the parts of terms that can point to a chemical named entity, such as “inhibitor”, “drug”, “chemical”, and “substance”.
Table 2
Accuracy of chemical named entity recognition using the PASS approach based on the representation of texts using n-grams equal to five symbols and a context window of one token before and after analysis.
Type
|
N*
|
R**
|
IA***
(loo cv)
|
Abbreviation
|
12506
|
118
|
0.99
|
Formula
|
13466
|
110
|
0.99
|
Family
|
19017
|
78
|
0.97
|
Systematic
|
32510
|
46
|
0.99
|
Trivial
|
25140
|
59
|
0.98
|
CNE
|
102639
|
14
|
0.98
|
Non-CNE
|
1480509
|
1.01
|
0.98
|
* - N is the number of fragments of texts used for training. |
** - R is the ratio of the number of all tokens to the number of tokens belonging to a certain type, indicating a measure of dataset imbalance |
*** - IA invariant accuracy |
We should emphasize that the number of tokens of the “NON-CNE” type is approximately 50 times higher than the number of tokens of types “Systematic” and “Trivial”. The results of Table 2 show that the data imbalance does not influence the values of accuracy IA. Similar results we obtained earlier for the Bayes-based approach applied to the prediction of HIV resistance [19]. Based on the interpretation of accuracy for “CNE” and single, complex substance NEs, we can propose that CNEs can be extracted from the texts of abstracts for further analysis using the prediction results.
Utilizing PASS estimates for chemical named entities, we investigated the relationship between recall (sensitivity), precision, specificity, and balanced accuracy for the context window of one and multi-n-grams with an n value of one to five for the CHEMDNER corpus.
Figure 3 shows the relationships between the values of accuracy metrics (precision, recall, specificity, BA) and B-statistics for the “Systematic” and “Trivial” types (the most represented classes in the training set), CNE, and non-CNE type. The relationship between the accuracy metrics and B-statistics for all other types is provided in the Supplementary Materials.
As shown in Fig. 3, the recall and BA values have similar patterns of curve growth and decline, while the patterns of recall and precision curves are different. In particular, the precision curve increases while the recall curve decreases, and vice versa. It is obvious and occurs because the number of false-positives decreases while the number of false-negatives increases. A small number of positive samples (Systematic and Trivial types in Fig. 3, a, b) in the training set may explain the more flattened pattern of the precision curve. Comparison of the curve growth and decline patterns reveals that they are not related to any chemical named entity and provides the opportunity to compare accuracy metrics. For instance, for Trivial and Systematic types, the precision curve character changes significantly depending on the threshold, while for the non-CNE type, the situation is different. It allows making a conclusion that the precision is rather sensitive to the imbalance of the data, while sensitivity (recall), specificity, and balanced accuracy are not sensitive to the data imbalance. Another feature of precision is its sensitivity to the threshold choice (see Fig. 3). When a method is designed to extract information on chemical named entities from texts, the values of specificity and sensitivity (recall) are essential for validation because they help estimate the proportion of false-positives. A method with high specificity and sensitivity (recall) values provides the possibility to extract the correct chemical named entity based on the estimates of probabilities that indicate belonging to CNE and non-CNE as a consensus result.
Our CNER algorithm also allows evaluating each symbol in an FoT. A set of n-grams including a particular position in the FoT is used. In Fig. 4, for the letter "o" in the token "cyclohexane", the set of n-grams {O, LO, OH, CLO, LOH, OHE, YCLO, CLOH, LOHE, OHEX, CYCLO, YCLOH, CLOHE, LOHEX, OHEXA} with n = 5 is used for estimation. The values Pc=0.915 and Pnc=0.002 are calculated for the letter "o" in the token "cyclohexane" for class "SYSTEMATIC". On such bases in our PASS software, the colouring of FoTs is used. The colour of the letter corresponds to light green for Pc=1 (Pnc=0), light red for Pnc=1 (Pc=0), and blue when Pc and Pnc are both close to zero.
Chemical named entities can be extracted after tokenization of texts and making a prediction for each token based on the values of Pc and Pnc. Extraction of a chemical named entity can be performed by concatenating the tokens predicted to belong to a CNE class.
Validation of the PASS-based approach in the task of extracting chemical named entities
Extracting chemical named entities based on the CHEMDNER corpus
We checked the applicability of our approach for extracting chemical named entities and tested it in a case study of CNER extraction using CHEMDNER.
To extract chemical named entities, we ought to determine the best strategy for extracting chemical named entities by a naïve Bayes-based approach.
First, we evaluated the threshold for extracting chemical named entities based on the prediction results. We calculated a set of values (Pc-Pnc) that corresponds to the highest accuracy of distinguishing tokens that belong and do not belong to CNE. The value of the threshold was obtained empirically based on the maximum normalized by standard deviation values of sensitivity (recall), precision, specificity, and balanced accuracy (Fig. 5). The threshold value of 0.30 was chosen for CNER.
Then, we extracted named entities as the concatenated sequence of tokens with (Pc-Pnc) above the T value. To improve the extraction procedure, we applied some filters aimed at exclusion of tokens that obtain high values of (Pc-Pnc) because they are overrepresented in the training set (for instance, numerical values, single brackets, etc.). In addition, the named entities with incorrect encoding were removed from the set of extracted CNEs, disregarding the prediction results. The full set of filters is provided in the Supplementary Materials.
The values of precision, sensitivity (recall), specificity, and balanced accuracy for CHEMDNER were evaluated using five-fold cross-validation. For the CHEMDNER dataset, the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92. These values of accuracy represent the approximate performance of recognition for the whole chemical named entities, not their parts.
Extracting named entities of potential anti-SARS-CoV-2 agents
To evaluate the applicability of our approach for solving practical tasks that may have a high clinical and biological impact, we investigated the possibility of extracting named entities of chemicals that can inhibit SARS-CoV-2 main protease (SARS-CoV-2 Mpro) and slow down COVID-19 progression. We chose to extract inhibitors of SARS-CoV-2/COVID-19 as a case study because of the availability of large collections of texts relevant to SARS-CoV-2 studies.
We extracted a total number of 8071 named entities corresponding to 2649 unique CNEs. The extracted CNEs are provided in the Supplementary Materials. Then, we calculated the precision and recall values for the extracted examples.
First, we performed automated queries of the PubChem [34] and ChEMBL [35] databases. Such queries allowed us to estimate the number of true-positive samples automatically. The results of automated queries were then manually checked. In total, 4374 named entities were found in the databases correctly (1201 CNEs without duplicates). Manual examination of retrieved CNEs allowed us to identify 1407 NEs extracted correctly (507 CNEs without duplicates). The number of false-negatives was obtained based on the manual analysis of the test set. Based on the results of manual and automated validation, we calculated precision, recall, and F1-score for the SARSCoV-2 Mpro set. For the set, sensitivity (recall) is 0.94, precision is 0.72, specificity is 0.87, and balanced accuracy is 0.92.
During a manual inspection of the entities recognized as belonging to CNEs according to PASS prediction, we noticed that some entities were identified correctly but were not found in the PubChem and ChEMBL databases. Some of them (1%) were identified as CNE by the PASS algorithm, but they were not found in the databases because of misprints (for instance, such named entities include "hydroxybenzoagte" (the correct name: hydroxybenzoate) and "dithiazone" (the correct name: dithizone). Another part consisting of 1% found entities were codes of chemical compounds provided in the publication and therefore had the context indicating that the entity is CNE. Approximately 6% were recognized but were not found in PubChem because they belong to chemical families. The PASS model was based on the merged class CNE, which includes chemical families; therefore, they were recognized by PASS but, naturally, were not found in PubChem. Examples of such named entities include “ginsenosides”, “flavonoids”, “triterpenoids”. Chemical named entities that are natural compounds have not been found via automated queries of the PubChem database. The names of bioactive peptides and incomplete chemical named entities as well as all other terms were regarded as false-positives.
Manual analysis of the true-positive chemical named entity mentions allowed us to identify several names of chemical compounds that were evaluated for inhibition of SARS-CoV-2 (for instance, hydroxychloroquine, chloroquine, quercetin, rutin, curcumin, darunavir, saquinavir, and flavonoids).
Although chloroquine and hydroxychloroquine are the most thoroughly investigated drugs and therefore appeared in the set of chemical named entities extracted from the texts collected by a query associated with SARS-CoV-2 Mpro, they were considered ineffective after a set of studies [36]. Quercetin was experimentally tested for its activity against SARS-CoV-2 Mpro and demonstrated inhibitory activity [37]. Flavonoids represent a group of natural compounds (secondary plant metabolites) that are widely discussed in the scientific literature and are considered to have anti-inflammatory effects and the ability to modulate cytokines [38]. The inhibitory effect of some flavonoids (tangeretin, gardenin B) on SARS-CoV-2 was demonstrated [38]. The anti-inflammatory activity and inhibitory activity of dihydromyricetin on SARS-CoV-2 Mpro were evaluated in a FRET assay (fluorescence resonance energy transfer) [39]. It was shown that the half-maximal inhibitory concentration of SARS-CoV-2 Mpro by dihydromyricetin reached 1.76 µM. Additionally, the authors [40] confirmed the activity of dihydromyricetin on the proteins included in the TGF-β 1/Smad pathway, which are responsible for the development of pulmonary fibrosis.
These results demonstrate the applicability of the Bayes-based CNER approach to the extraction of CNEs in the text of abstracts relevant to a particular task and therefore allow the scientific community to enrich the knowledge about potential chemical compounds effective against particular targets and can be used for the treatment of specific diseases, including novel humanity threats such as COVID-19.
A place of the naïve-based CNER among other methods
Texts of publications represent low-formalized data, and their classification may be difficult even for experts in the field. In contrast to approaches that take any semantical or grammatical features of a token, our method takes the text data as input without any additional processing into parts of speech and other grammatical or semantic features.
Many various artificial intelligence (AI) approaches aimed at chemical and biological named entity recognition have been developed [15, 18, 20]. Most approaches that have been under recent development for several years are based on the usage of neural networks with different variants of long-short term memory (LSTM) architecture or conditional random fields (CRF) [16, 40]. Most AI-based approaches initially convert text into vectors or use sparse word text representation created with preprocessing of a text corpus and further creation of a fixed-size vocabulary that contains features only included in the corpus used for vector preparation (for instance, such approaches include word embedding preparation or the one-hot-encoding technique). Compared to the methods that are based on the vector-based or sparse word representation of texts, the proposed naïve-based CNER approach does not require the usage of a fixed-size vocabulary to produce text representation. The latter feature provides the versatility of our method in its application to very different text styles and language peculiarities. It should be noted that the performance of CNER using the naïve-based approach obtained in the test set is comparable with earlier published methods, including machine learning and deep neural network approaches [16, 18, 21–23, 35]. It should be noted that we evaluated the accuracy of CNE extraction in addition to the accuracy of a particular token belonging to the specific class. Therefore, texts relevant to various queries can be processed efficiently using the developed naive-based approach. Extraction of novel chemical entities may be rather helpful for the purposes on novel drug design including both experimental studies and cheminformatics apporaches, virtual screening that represent a group of powerful approaches for exploring large chemical space [4,41].