In this work, we have designed a workflow to the plague-related literature mining, identify plague-related entities and extract relations between them. We split the obtained literature abstracts into 26695 independent sentences, analyze the pipeline to mark 33 kinds of entity types mentioned (i.e., gene, protein, drug, treatment, species, and geographical location), and construct the plague entity dictionary. Then we use several methods to process the relations between entities, and finally construct the plague knowledge graph by Neo4j.
The workflow is shown in Figure 1.
Materials and tools
We construct the plague knowledge graph by a semi-automatic and semi-manual approach. Firstly, we use PubTator to recognize the entities in the abstract, and then extract the association relations (subject-relation-object) by OpenIE. In this section, we introduce the datasets and tools used in this study briefly.
PubMed:
PubMed is a free literature search engine. Although the database has restrictions on the access to the full text, the abstracts of general papers include research purposes, methods, results and conclusions, which can basically cover the main idea of the paper. In this paper, we obtain relevant literature abstracts automatically, and use natural language processing techniques for text preprocessing. Meanwhile, we remove sectional irrelevant data such as author details and publisher information, only retains the necessary information of PubMed Unique Identifier (PMID), title and abstract for subsequent knowledge extraction work.
PubTator:
PubTator Central (PTC) [24] is a Web-based system that can be accessed interactively through a Web browser and downloaded in bulk via File Transfer Protocol (FTP). PubTator is selected as an entity recognition tool for three reasons: (i) it only requires the PMID number of article to get the title and abstract, (ii) it integrates multiple named entity recognition tools and can recognize five types of entities, such as GeneTUKit for gene extraction [25], tmVar for variant extraction [26], DNorm for disease extraction [27], dictionary-based method for chemical extraction, and SR4GN for species extraction [28], (iii) the identified genes and variants are marked with gene ID numbers and variant SNP numbers in the National Center for Biotechnology Information (NCBI) Entrez gene database, so that detailed information of genes and variants can be easily obtained.
OpenIE:
In this paper, we use the OpenIE tool for semi-automatic filtering. OpenIE is a relation extraction tool that extracts structured triples from plain text without specifying relationships in advance [29], and does not require extensive annotation to train the corpus correctly.
Neo4j:
Neo4j is a native graph database engine with a unique storage structure, index-free neighbor node storage method, and a corresponding graph traversal algorithm, which makes retrieval speed faster. Its performance will not be affected by the growth of data, hence it has a very high query performance. Neo4j database is very scalable and flexible. Compared with other graph databases, the data that Neo4j stored and used is processed by the native graph data structure. As an open source database, neo4j's open source community version attract many third parties’ utilization and promotion [30].
Data source
In this paper, we use PubMed database as the data source to build the knowledge graph of plague, search the literature from PubMed with "pestis" as the keyword and get the PMID numbers of all relevant articles. Then, obtaining the totally 5388 corresponding article abstracts according to the PMID number automatically, and stored them in format of txt. In order to enrich the content of the knowledge base, all aliases of each entity type are obtained in the database of NCBI, so as to facilitate the retrieval of the content of the knowledge graph at a later stage.
Data Pre-processing
Since the export of the target literature by PubTator covers some information that is unrelated to the research (i.e., article publication time, author information, and doi), articles are firstly pre-processed by a script to remove them, so as to facilitate the subsequent work of named entity identification and relation extraction. Due to the copyright restriction, the information that only contains the article title also need to be deleted.
Named Entity Recognition
We recognize 33 types of entities through a combination of PubTator, dictionary and rule-based methods and manual annotation methods. PubTator can only recognize 6 types of entities: Gene, Disease, Chemical, Mutation, Species, and CellLine, therefore, in order to expand the coverage of biomedical concepts and make the plague knowledge graph more detailed and comprehensive, we add another 27 entity types on top of it. The following 9 entity types:Amino Acid, Peptide, Protein, Lipid, Enzyme, Nucleotide, Nucleic Acid, Toxin and Vaccine are further expanded based on the chemical entity type that provided by PubTator, while the remaining 18 entity types (i.e., Viruses, Phenomenon, Technique, Anatomy, Diagnosis, Geographic Location, Symptom, Environment, Social Sciences, Etiology, Assay, Genome Equipment, Therapeutics, Time, Person, SNV, Food) are constructed manually and individually according to the text content and actual needs.
A plague entity dictionary is constructed based on the obtained entities and their types. In the process of constructing the dictionary, there are cases where the same entity corresponds to multiple different entry terms, so it is necessary to uniquely identify an entity based on the NCBI_ID. Medical Subject Headings (MeSH) is the most frequently accessed database when we perform entity queries. It is an authoritative hierarchical subject term list compiled by the National Library of Medicine for indexing articles in PubMed. In the process of retrieving entity categories, we select the most appropriate subject terms as entity type based on the hierarchical relationship of the MeSH thesaurus, following the principle that the subject terms have high citation rates and match the actual text content information. Then the unique entity mesh number is given. Since there are some errors when using PubTator to identify entities, such as classifying "California" as a disease type, we need to annotate the unidentifiable entities manually and correct the annotated entities and their NCBI_ID.
The resultant data obtained after the named entity recognition is shown in Table 1. It includes PMID, Starting position, End position, Entity, Type and Attribute.
Table 1 The result data of named entity recognition
PMID
|
Starting position
|
End Position
|
Entity
|
Type
|
Attribute
|
896271
|
461
|
465
|
mice
|
Animals
|
10090
|
235822
|
971
|
980
|
tularemia
|
Disease
|
MESH:D014406
|
896271
|
461
|
465
|
guinea pigs
|
Animals
|
10141
|
1472717
|
1016
|
1020
|
LcrD
|
Proteins
|
MeSH:C071579
|
848916
|
229
|
245
|
Benzylpenicillin
|
Amides
|
MeSH:D010400
|
…
|
…
|
…
|
…
|
…
|
…
|
As shown in Table 2, the constructed entity dictionary includes article PMID, Entity name, Entity type and uniquely specified NCBI_ID.
Table 2 The dictionary of entity
PMID
|
Entity name
|
Entity type
|
NCBI_ID
|
117851
|
human
|
Species
|
9606
|
200563
|
mouse
|
Species
|
10090
|
200563
|
phenol
|
Chemical
|
MESH:D019800
|
24786165
|
lymphadenopathy
|
Disease
|
MESH:D000072281
|
24786165
|
vomiting
|
Symptom
|
MESH:D014839
|
23588087
|
IL-10
|
Peptide
|
MESH:D016753
|
1695896
|
insertion mutagenesis
|
Phenomena
|
MESH:1695896
|
32315702
|
JC221
|
Viruse
|
2654973
|
21712421
|
alpha-D-galactose
|
Enzyme
|
MESH:D000519
|
21219468
|
c-di-GMP signalling
|
Nucleotide
|
MESH:C062025
|
…
|
…
|
…
|
…
|
Relation Extraction
In this paper, the article abstracts are firstly divided into sentences by scripts, then the sentences are de-duplicated, finally obtain total 26,695 sentences. Stanford OpenIE is used to extract the entity relations semi-automatically according to the entity vocabulary. After extracting by OpenIE, the output results are in the form of triple as {'subject:', 'relationship: ','object:'}. Because this research belongs to the vertical domain knowledge category, compared with the open domain, the relations are more complex and need to be extracted more detailed. While OpenIE is a general domain relation extraction tool and its extraction results will be presented in a multi-possibility way, so further manual screening and proofreading are required after extraction. The total number of obtained relations is 9583, corresponding to the following relationship: Gene - disease, disease - drug, disease - test, gene - protein, disease - species, etc. Each pair of entities has detailed relation information. The examples are shown in Table 3.
Table 3 The detailed information of relation
subject
|
relation
|
object
|
sentence
|
deer mice
|
are occasionally
exposed to
|
Y. pestis
|
29700709|a|While they may not be primary reservoirs, results supported the premise that deer mice are occasionally exposed to and infected by Y. pestis and instead may be spillover hosts.
|
Biofilm formation
|
is critical for
transmission of
|
Y. pestis
|
30333962|a|Biofilm formation is critical for blocking flea foregut and hence for transmission of Y. pestis by flea biting.
|
Y. pestis
|
is in
|
Russia
|
30380206|a|We studied the prevalence of the intact and dis- rupted porin genes among 240 strains of Y. pestis from 39 natural centers in Russia and abroad, and 68 strains of Yersinia pseudotuberculosis from different geographical regions.
|
PsaA
|
is a key
|
Y.pestis mammalian virulence determinant
|
31138630|a|PsaA is a key Y. pestis mammalian virulence determinant that forms fimbriae.
|
PsaA
|
forms
|
fimbriae
|
31138630|a|PsaA is a key Y. pestis mammalian virulence determinant that forms fimbriae.
|
MyD88
|
play a central role for the biphasic inflammatory response to
|
pulmonary Y.pestis infection
|
30642901|a|Together these findings indicate a central role for MyD88 during the biphasic inflammatory response to pulmonary Y. pestis infection.
|
HmsA
|
inhibits
|
Y.pestis virulence
|
33026884|a|Conclusion: HmsA inhibits Y. pestis virulence, but this effect may be mediated by indirect effects on pathogenesis, iron homeostasis and/or other cellular processes.
|
Neuraminidase
|
was revealed in bacteria affiliated to
|
Past. pestis
|
331768|a|Neuraminidase was revealed in bacteria affiliated to Past. pestis (causative agents of pseudotuberculosis and Y. enterocolitica).
|
DNA probe
|
for detecting
|
Yersinia pestis
|
1298882|t|[DNA probe for detecting Yersinia pestis and serovariant I of Yersinia pseudotuberculosis by detecting specific DNA repeating sequences].
|
LcrH
|
is necessary for the normal response of Y. pestis to
|
ATP
|
2707857|a|These findings show that LcrH is necessary for the normal response of Y. pestis to ATP and that LcrH contributes to Ca2+ respsonsiveness
|
…
|
…
|
…
|
…
|
Graph Storage and Plague Knowledge Base Platform Construction
Based on the previous work, we have obtained the plague-related entities and their association relations. In this subsection, we have constructed a plague knowledge graph called PlagueKG by batch importing the extracted and integrated knowledge into the Neo4j graph database. The construction process consists of three parts:
(1) Using scripts to classify entities and relations according to their types. Importing the classified data and other previously extracted knowledge, such as MESH numbers and sentence numbers, into an Excel file automatically;
(2) Converting the file into .csv format and import the nodes and relations into the graph database;
(3) Starting the Neo4j graph database. The operations of add, delete, modify and check could be done to the database.
Neo4j provides three ways to display the query results in the form of graph, table, and text. As shown in Figure 2, the relations between plague and gene are revealed.
Furthermore, we also built a plague knowledge base platform. The platform contains a base layer, a data layer, a processing layer, a service layer and an application layer. The front-end is built with Python+Django+web framework and back-end uses mysql+Neo4j graphical database to realize the management and visualization of the data.