PlagueKG: A plague knowledge graph based on biomedical literature mining

doi:10.21203/rs.3.rs-1536910/v1

Download PDF

Research Article

PlagueKG: A plague knowledge graph based on biomedical literature mining

https://doi.org/10.21203/rs.3.rs-1536910/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Plague is an extremely horrific infectious disease caused by Yersinia pestis. Its mortality rate is exceedingly high that threatens the lives of human and animals seriously. There is a vast amount of knowledge related to plague in the literature at present. It is particularly important to obtain useful information from an enormous amount of literature and intuitively represent it in the form of knowledge graph. This will provide help to researchers to quickly understand plague’s complex pathogenesis and potential therapeutic approaches and provide the potential action mechanism of corresponding drug candidates. Consequently, the speed of vaccine research and development could be accelerated.

Results

This paper identifies and extracts plague-related entities and relations based on 5388 abstracts that obtained from PubMed biomedical literature library automatically. Then we construct the plague knowledge graph called PlagueKG, which contains 9633 nodes of 33 types such as disease, gene, protein, species, symptom, treatment, geographic location and so on, and 9466 association relations such as disease-gene, gene-protein, disease-species and so forth. The Neo4j graph database is used to store the relational data in the form of triple. Finally, a multi-factor correlation knowledge graph centered on plague is constructed, as well as a plague knowledge base platform.

Conclusions

We extracted and integrated knowledge from existing plague-related literature using text mining techniques, and constructed a plague knowledge graph, which shows detailed plague-related knowledge in an intuitive and clear way. To the best of our knowledge, it is the first plague knowledge graph that is built using literature mining techniques. In addition, our plague knowledge base platform successfully managed and visualized a large amount of structured data related to plague. Researchers can acquire integrated plague information more conveniently by using this platform. It provides more direct and comprehensive knowledge of the disease.

Plague

Biomedical literature mining

Knowledge graph

Knowledge base

Plague, also known as the Black Death，is a zoonotic disease established in a stable epidemic sites in the Americas, Africa, and Eurasia. It was once a Class A biological and chemical weapon in warfare. Its pathogenic bacterium Yersinia pestis has been listed as a Class A bioterrorism agent [¹]. Its strong infectivity and high mortality pose a great danger to humans and animals. The disease mainly consists of five common types: bubonic plague, pneumonic plague, septicemic plague, meningeal plague and pharyngeal plague.

So far, plague still remains an extremely serious global fatal disease and a public health problem. The disease can be transmitted between humans and animals by the directly contact with animal carcasses, animal bites, and consumption of animal meat and its derivatives. Indirectly contact with parazoon can also cause infection. Infection with plague may also occur under some circumstances, for example, primary pneumonia infection resulting from dissection of animal carcasses, or inhalation of infectious aerosol droplets [²,³]. Yersinia pestis may cause skin ulceration at its entrance, reported as carbuncles and ulcers, as well as pustules, spots, petechiae, bruises, and gangrene [⁴]. It is also accompanied with chest pain, cough, expectoration and other symptoms. Without prompt treatment, patients could die from heart failure and shock. Therefore, it is necessary to further study the potential pathogenic genes, risk factors and precursors of plague infection in order to find effective prevention and treatment methods.

With the explosive growth of domain data, a large amount of plague-related knowledge is scattered across literature. In order to conduct multivariate correlation analysis to determine the impact of different factors on plague，researchers have no choice but to retrieve and integrate information from different databases, and then organize them manually. It is such a tedious and complicated manual task and constructing numerous relation databases is also time-consuming. Due to the limited human resources, the constructed thematic base suffers from imperfect data information and cannot be updated in real time. This causes a great inconvenience to those who study plague and thus affects the progress of research. Therefore, there is an urgent need to build an integrated plague knowledge base platform.

This paper aims at the above-mentioned issues and the fact that the plague knowledge graph hasn’t been studied yet. We use natural language processing technology to collect and organize the plague-related knowledge in the literature. And adopting a semi-automatic approach to collect and sort out the data scattered in various bioinformatics and literature databases, so as to integrate the information that related to plague, such as genes, proteins, detection methods, treatments, and geographic locations, which are implicitly in the literature, thus establish relations logically and form a plague ontology knowledge base. Plague knowledge graph and knowledge base platform can be eventually constructed as well. This will reduce the time of collection and retrieval by researchers, and lay the foundation for the effective use of knowledge.

In this paper, we combine the information extracted from scientific papers with the existing knowledge base, and use PubTator and Open information extraction (OpenIE) to extract entities and relations semi-automatically. After annotated and examined manually, we construct a plague-related knowledge graph based on the Neo4j graph database. The Neo4j graph database can describe the relations in a more intuitive and clearly graphical way. The PlagueKG can help us to understand this kind of complex disease, for example, exploring the precursors and sequelae of plague infection, therapeutic drugs and methods. PlagueKG includes 33 types of 9633 nodes, 292 types of 9466 relations and 9583 triples which connects diseases, genes, proteins, symptoms, diagnosis, detection technology, equipment, geographical location and other entities to provide practical and accurate plague knowledge.

There are two mainstream ways to build database at present, one is based on sequencing data [⁵], and the other is based on literature mining. The Gene Ontology Annotation (GOA) [⁶] and the Kyoto Encyclopedia of Genes and Genomes Orthology (KEGG) [⁷] are the most commonly used gene annotation databases. Gene Ontology (GO) [⁸] is widely used in the field of bioinformatics. KEGG is a comprehensive database that integrats genomic, chemical and systemic functional information. Its metabolic pathway database is a more authoritative public database. Other pathway databases, such as Reactome [⁹], BioCarta [¹⁰], etc., databases storing protein annotation information, such as Swiss-Port [¹¹], PDB [¹²], etc., LncRNA and disease association database LncRNADisease [¹³], etc. Some comprehensive databases, such as Entrez [¹⁴], BIND [¹⁵], biogrid [¹⁶], BGVD [¹⁷], ChickVD [¹⁸] etc., provide a large amount of information about genes and gene products, including pathway information and interaction information between genes.

In order to meet the needs of researchers to quickly understand the current research status, the database based on literature mining has become a hot topic. Osteosarcoma-gene association database[¹⁹] is a literature mining database that contains 911 genes and 81 microRNAs.PhamKG [²⁰] is a multi-relational biomedical knowledge graph that includes over half a million entities of genes, drugs and diseases and it builds relationships between each pair of entities. There are 29 types of relationship and more than 8000 entity vocabularies. In order to better understand the genetic impact on human diseases, Phenomodifier, a manually managed database, provides a more complete spectrum of genetic factors that contribute to human phenotypic variation. The database has a total of 3078 modification information records, involving 288 different diseases, 2126 genetic modification variants and 843 different modification genes [²¹].Text-mined Hypertension, Obesity and Diabetes candidate gene database(T-HOD) [²²] is another database in regards to three common diseases. RAvariomeis a database of genetic risk variants for rheumatoid arthritis [²³].

Although the knowledge base established by expert method has high accuracy, knowledge coverage is not comprehensive and not updated in a timely manner. The main reason is that the knowledge base has limited domain coverage due to the time-consuming and laborious process of manual review and editing, and the data update cannot keep up with the growth rate of newly published literature. Moreover, many biomedical fields still lack corresponding gene annotation library resources, and the gene information related to these fields is scattered across thousands of literatures and not systematically collected.

Knowledge graph technology provides a means to extract structured knowledge from massive amounts of texts, which provides convenience for the extraction and display of literature knowledge. A knowledge graph is essentially a complex network that reflects the relations between entities. It is composed of nodes and edges that can formally describe the real world and its relations. The nodes in a biomedical knowledge graph represent different types of biological entities (e.g., genes, transcripts, proteins, pathways, diseases, symptoms, drugs, side effects, etc.), while the edges represent the logical or biological relations between entities (such as interaction, regulation, inhibition, inclusion, etc.). In recent years, the development of entity recognition, automatic relation extraction technologies and tools have provided a convenient way to construct and update biomedical knowledge graph.

However, to the best of our knowledge, there hasn’t been any research on the construction of the plague knowledge graph yet. Due to the lack of bio-information ontology library for plague subject, it has greatly affected the automation of constructing the knowledge graph and knowledge base of plague, as well as the accuracy of entity identification and relation extraction. The plague knowledge is scattered across a massive of literature, and has not been collected and sorted out systematically. Hence, it is an urgent need to extract information and build a plague knowledge graph and a knowledge base platform.

In this work, we have designed a workflow to the plague-related literature mining, identify plague-related entities and extract relations between them. We split the obtained literature abstracts into 26695 independent sentences, analyze the pipeline to mark 33 kinds of entity types mentioned (i.e., gene, protein, drug, treatment, species, and geographical location), and construct the plague entity dictionary. Then we use several methods to process the relations between entities, and finally construct the plague knowledge graph by Neo4j.

The workflow is shown in Figure 1.

Materials and tools

We construct the plague knowledge graph by a semi-automatic and semi-manual approach. Firstly, we use PubTator to recognize the entities in the abstract, and then extract the association relations (subject-relation-object) by OpenIE. In this section, we introduce the datasets and tools used in this study briefly.

PubMed：

PubMed is a free literature search engine. Although the database has restrictions on the access to the full text, the abstracts of general papers include research purposes, methods, results and conclusions, which can basically cover the main idea of the paper. In this paper, we obtain relevant literature abstracts automatically, and use natural language processing techniques for text preprocessing. Meanwhile, we remove sectional irrelevant data such as author details and publisher information, only retains the necessary information of PubMed Unique Identifier (PMID), title and abstract for subsequent knowledge extraction work.

PubTator：

PubTator Central (PTC) [²⁴] is a Web-based system that can be accessed interactively through a Web browser and downloaded in bulk via File Transfer Protocol (FTP). PubTator is selected as an entity recognition tool for three reasons: (i) it only requires the PMID number of article to get the title and abstract, (ii) it integrates multiple named entity recognition tools and can recognize five types of entities, such as GeneTUKit for gene extraction [²⁵], tmVar for variant extraction [²⁶], DNorm for disease extraction [²⁷], dictionary-based method for chemical extraction, and SR4GN for species extraction [²⁸], (iii) the identified genes and variants are marked with gene ID numbers and variant SNP numbers in the National Center for Biotechnology Information (NCBI) Entrez gene database, so that detailed information of genes and variants can be easily obtained.

OpenIE：

In this paper, we use the OpenIE tool for semi-automatic filtering. OpenIE is a relation extraction tool that extracts structured triples from plain text without specifying relationships in advance [²⁹], and does not require extensive annotation to train the corpus correctly.

Neo4j：

Neo4j is a native graph database engine with a unique storage structure, index-free neighbor node storage method, and a corresponding graph traversal algorithm, which makes retrieval speed faster. Its performance will not be affected by the growth of data, hence it has a very high query performance. Neo4j database is very scalable and flexible. Compared with other graph databases, the data that Neo4j stored and used is processed by the native graph data structure. As an open source database, neo4j's open source community version attract many third parties’ utilization and promotion [³⁰].

Data source

In this paper, we use PubMed database as the data source to build the knowledge graph of plague, search the literature from PubMed with "pestis" as the keyword and get the PMID numbers of all relevant articles. Then, obtaining the totally 5388 corresponding article abstracts according to the PMID number automatically, and stored them in format of txt. In order to enrich the content of the knowledge base, all aliases of each entity type are obtained in the database of NCBI, so as to facilitate the retrieval of the content of the knowledge graph at a later stage.

Data Pre-processing

Since the export of the target literature by PubTator covers some information that is unrelated to the research (i.e., article publication time, author information, and doi), articles are firstly pre-processed by a script to remove them, so as to facilitate the subsequent work of named entity identification and relation extraction. Due to the copyright restriction, the information that only contains the article title also need to be deleted.

Named Entity Recognition

We recognize 33 types of entities through a combination of PubTator, dictionary and rule-based methods and manual annotation methods. PubTator can only recognize 6 types of entities: Gene, Disease, Chemical, Mutation, Species, and CellLine, therefore, in order to expand the coverage of biomedical concepts and make the plague knowledge graph more detailed and comprehensive, we add another 27 entity types on top of it. The following 9 entity types：Amino Acid, Peptide, Protein, Lipid, Enzyme, Nucleotide, Nucleic Acid, Toxin and Vaccine are further expanded based on the chemical entity type that provided by PubTator, while the remaining 18 entity types (i.e., Viruses, Phenomenon, Technique, Anatomy, Diagnosis, Geographic Location, Symptom, Environment, Social Sciences, Etiology, Assay, Genome Equipment, Therapeutics, Time, Person, SNV, Food) are constructed manually and individually according to the text content and actual needs.

A plague entity dictionary is constructed based on the obtained entities and their types. In the process of constructing the dictionary, there are cases where the same entity corresponds to multiple different entry terms, so it is necessary to uniquely identify an entity based on the NCBI_ID. Medical Subject Headings (MeSH) is the most frequently accessed database when we perform entity queries. It is an authoritative hierarchical subject term list compiled by the National Library of Medicine for indexing articles in PubMed. In the process of retrieving entity categories, we select the most appropriate subject terms as entity type based on the hierarchical relationship of the MeSH thesaurus, following the principle that the subject terms have high citation rates and match the actual text content information. Then the unique entity mesh number is given. Since there are some errors when using PubTator to identify entities, such as classifying "California" as a disease type, we need to annotate the unidentifiable entities manually and correct the annotated entities and their NCBI_ID.

The resultant data obtained after the named entity recognition is shown in Table 1. It includes PMID, Starting position, End position, Entity, Type and Attribute.

Table 1 The result data of named entity recognition

PMID	Starting position	End Position	Entity	Type	Attribute
896271	461	465	mice	Animals	10090
235822	971	980	tularemia	Disease	MESH:D014406
896271	461	465	guinea pigs	Animals	10141
1472717	1016	1020	LcrD	Proteins	MeSH:C071579
848916	229	245	Benzylpenicillin	Amides	MeSH:D010400
…	…	…	…	…	…

As shown in Table 2, the constructed entity dictionary includes article PMID, Entity name, Entity type and uniquely specified NCBI_ID.

Table 2 The dictionary of entity

PMID	Entity name	Entity type	NCBI_ID
117851	human	Species	9606
200563	mouse	Species	10090
200563	phenol	Chemical	MESH:D019800
24786165	lymphadenopathy	Disease	MESH:D000072281
24786165	vomiting	Symptom	MESH:D014839
23588087	IL-10	Peptide	MESH:D016753
1695896	insertion mutagenesis	Phenomena	MESH:1695896
32315702	JC221	Viruse	2654973
21712421	alpha-D-galactose	Enzyme	MESH:D000519
21219468	c-di-GMP signalling	Nucleotide	MESH:C062025
…	…	…	…

Relation Extraction

In this paper, the article abstracts are firstly divided into sentences by scripts, then the sentences are de-duplicated, finally obtain total 26,695 sentences. Stanford OpenIE is used to extract the entity relations semi-automatically according to the entity vocabulary. After extracting by OpenIE, the output results are in the form of triple as {'subject:', 'relationship: ','object:'}. Because this research belongs to the vertical domain knowledge category, compared with the open domain, the relations are more complex and need to be extracted more detailed. While OpenIE is a general domain relation extraction tool and its extraction results will be presented in a multi-possibility way, so further manual screening and proofreading are required after extraction. The total number of obtained relations is 9583, corresponding to the following relationship: Gene - disease, disease - drug, disease - test, gene - protein, disease - species, etc. Each pair of entities has detailed relation information. The examples are shown in Table 3.

Table 3 The detailed information of relation

subject	relation	object	sentence
deer mice	are occasionally exposed to	Y. pestis	29700709\|a\|While they may not be primary reservoirs, results supported the premise that deer mice are occasionally exposed to and infected by Y. pestis and instead may be spillover hosts.
Biofilm formation	is critical for transmission of	Y. pestis	30333962\|a\|Biofilm formation is critical for blocking flea foregut and hence for transmission of Y. pestis by flea biting.
Y. pestis	is in	Russia	30380206\|a\|We studied the prevalence of the intact and dis- rupted porin genes among 240 strains of Y. pestis from 39 natural centers in Russia and abroad, and 68 strains of Yersinia pseudotuberculosis from different geographical regions.
PsaA	is a key	Y.pestis mammalian virulence determinant	31138630\|a\|PsaA is a key Y. pestis mammalian virulence determinant that forms fimbriae.
PsaA	forms	fimbriae	31138630\|a\|PsaA is a key Y. pestis mammalian virulence determinant that forms fimbriae.
MyD88	play a central role for the biphasic inflammatory response to	pulmonary Y.pestis infection	30642901\|a\|Together these findings indicate a central role for MyD88 during the biphasic inflammatory response to pulmonary Y. pestis infection.
HmsA	inhibits	Y.pestis virulence	33026884\|a\|Conclusion: HmsA inhibits Y. pestis virulence, but this effect may be mediated by indirect effects on pathogenesis, iron homeostasis and/or other cellular processes.
Neuraminidase	was revealed in bacteria affiliated to	Past. pestis	331768\|a\|Neuraminidase was revealed in bacteria affiliated to Past. pestis (causative agents of pseudotuberculosis and Y. enterocolitica).
DNA probe	for detecting	Yersinia pestis	1298882\|t\|[DNA probe for detecting Yersinia pestis and serovariant I of Yersinia pseudotuberculosis by detecting specific DNA repeating sequences].
LcrH	is necessary for the normal response of Y. pestis to	ATP	2707857\|a\|These findings show that LcrH is necessary for the normal response of Y. pestis to ATP and that LcrH contributes to Ca2+ respsonsiveness
…	…	…	…

Graph Storage and Plague Knowledge Base Platform Construction

Based on the previous work, we have obtained the plague-related entities and their association relations. In this subsection, we have constructed a plague knowledge graph called PlagueKG by batch importing the extracted and integrated knowledge into the Neo4j graph database. The construction process consists of three parts：

(1) Using scripts to classify entities and relations according to their types. Importing the classified data and other previously extracted knowledge, such as MESH numbers and sentence numbers, into an Excel file automatically；

(2) Converting the file into .csv format and import the nodes and relations into the graph database；

(3) Starting the Neo4j graph database. The operations of add, delete, modify and check could be done to the database.

Neo4j provides three ways to display the query results in the form of graph, table, and text. As shown in Figure 2, the relations between plague and gene are revealed.

Furthermore, we also built a plague knowledge base platform. The platform contains a base layer, a data layer, a processing layer, a service layer and an application layer. The front-end is built with Python+Django+web framework and back-end uses mysql+Neo4j graphical database to realize the management and visualization of the data.

Plague Knowledge Graph

PlagueKG contains 9,633 nodes of 33 types, 9,446 relations of 292 types and 9,583 triples that connects diseases, genes, proteins, symptoms, diagnosis, detection technology, equipment, geographical location and other entities. The knowledge graph provides a visualization of plague-related content which allows researchers to quickly understand this kind of disease. It can accelerate the development of a more effective vaccine without spending plenty of time and energy on reading literature. Therefore, it provides an important reference for researchers. Figure 3 shows the whole plague knowledge graph.

Plague Knowledge Base Platform

According to system framework requirements and research content, the platform mainly designs four major functional modules: Plague Knowledge Graph Visualization Module, Plague Ontology Data Sheet Module, Text Auto-annotation Module and Manager Module. Among them, the Manager Module is subdivided into 3 sub-modules as Data Upload, Data Management and Neo4j browser. The system can fully display the plague-related relations and provide convenient retrieval function for researchers. The home page of the system is shown in Figure 4.

Plague knowledge graph visualization module

This module provides users with a visual presentation function for plague knowledge graph. As shown in Figure 5, users can access the data by directly querying entity or relation. The results can be exported in three formats: csv, json, and png.

Plague Data sheet module

This module provides an information list of plague ontology database, which contains entity categories, entity numbers, inter-entity relations and the corresponding sentence sources. It is available for users to browse, search and download. As shown in Figure 6, users can search and query in this module according to their own needs and filter table information. It also provides a drop-down list of entity types. Users can also query by type. At the same time, the data can be exported in xlsx format.

Text annotation module

The main function of this module is to provide automatic annotation of biomedical concepts (i.e., chemical, species, proteins) in plague-related PubMed abstracts. The corresponding literature abstracts can be searched by PMID number, title, or keyword. When viewing the search results, one can see that different entity types have been highlighted in the text in different colors. As shown in Figure 7, the automatically annotated information includes the entity type and its unique NCBI_ID. This provides a great convenience for researchers to access the information.

Manager module

This module is designed and developed for managers and biomedical experts. It contains three sub-modules: Data Upload, Data management and Neo4j Browser. As shown in Figure 8, the login interface allows the manager to access the back-end to update and maintain the data on a regular basis. As shown in Figure 9, the Data Upload module provides the functions of uploading and exporting plague ontology data and plague text information in .xlsx format and .txt format respectively, for keeping the literature information updated with PubMed synchronously. The Data Management Module provides a series of operations to add, delete, modify, query and update, as shown in Figure 10. As shown in Figure 11, the Neo4j Browser module provides access to the underlying data.

Plaguerelated Disease, Symptoms, Anatomy and Chemical

From the constructed knowledge graph, we can know that plague is mainly presented in three forms: pneumonic plague, bubonic plague and septicemic plague. The occurrence of plague is accompanied with fever, hemoptysis, tachypnea, abdominal pain, cramps, diarrhea, joint pain, shaking chills, headache, anorexia, bloody sputum, hypoxemia, skin ulceration and so on. Similarly, from the aspect of anatomy we can know that the lungs, liver, spleen, bone marrow and blood are vulnerable to plague.

Currently, there are more than 20 antibiotics available for the treatment of plague, such as streptomycin, kanamycin, tobramycin, doxycycline, Cefotaxime, trifluoperazine, amoxapine, doxapram, netilmicin, gentamicin, ciprofloxacin, ofloxacin, rifampicin, cefoperazone, cefotaxime, ceftazidime, ceftriaxone, amikacin, lomefloxacin, moxifloxacin. Streptomycin and doxycycline are considered to be the first choice for the treatment of plague, also known as the "golden standard". Studies have shown that quinolones are more active against plague and some further research need to be studied for treating plague. Trifluoperazine, amoxapine and doxapram can provide 40–60% protection against pneumonic plague, while netilmicin, gentamicin, ciprofloxacin and oxfloxacin provide 40–60% protection against plague. Netilmicin, gentamycin, ciprofloxacin and ofloxacin may be alternatives for the treatment of human pneumonic plague. Rifampicin is highly effective in the prevention and treatment of experimental plague, while cefoperazone, cefotaxime, ceftazidime and ceftriaxone have broad prospect in the treatment of heterogenous plague. The aminoglycosides (i.e., streptomycin, kanamycin, tobramycin, gentamicin and amikacin) and cephalosporins (i.e., ceftriaxone and ceftazidime) are very effective in preventing and treating of plague caused by F1 + and F1- strains.

plaguerelated Geographic Locations, Specie, ,Food and Environment

The plague has an extremely wide distribution, almost in all over the world, such as Mongolia, India, Madagascar, USSR, USA, China, Italy and other countries. The disease is mainly transmitted by fleas, such as Oriental rat flea, Eumolpianus eumolpid and Oropsylla montana. Besides, mammals such as cats, dogs, sheep, camels, cottontail rabbits, gerbils and rhesus macaques can also spread plague. Deltamethrin and fluometuron can control the spread of plague by eliminating fleas. Surprisingly, Coptis chinensis can also be used to against plague.

The plague also exists in raw meat and seafood that we eat daily, and it can survive in bottled water for a long time. It is easy to see that temperature is an important factor in the spread of the plague and the warmer and wetter the environment is, the spread of plague is easier. Mild winter and early summer are susceptible season. Cool spring water is also likely to spread plague. Plague can also be found in contaminated soil.

plaguerelated Gen, ,Protein and Enzyme

From the knowledge graph we know that CRP, lcrH and Pla are virulence factor of plague. Pla is a unique gene encoding coagulase and fibrinase in Yersinia pestis and involves in the transmission of plague. Although higB1, higB3, higB5, hicA2, and tox are expressed in Yersinia pestis, they have no toxic activity. tcaA, tcaB, and tccC1 contain a few mutations in Yersinia pestis. carB, fadJ, gluM, gltX, ileS, malE, nusA, ribD, and rlmL can be used to identify Yersinia species.

There are also many enzymes in Yersinia pestis, such as cellulases, dihydroorotase, dihydrofolate reductase, Neuraminidase, Phosphofructokinase, Thymidine monophosphate kinase, Phosphoglucomutase, Coagulase, DsbA, D-alanine-D-alanine ligase, Dioxygenases, gamma Glutamyltransferase and so on. Phospholipase D is an enzyme necessary for the survival of Yersinia pestis in its host intestine. 1-Deoxy-D-xylulose 5-phosphate reductoisomerase(DXR) is an important enzyme for the synthesis of isoprenoid in Yersinia pestis. E3 Ubiquitin Ligases contributes to the virulence of plague. Protein-tyrosine phosphatase (PTP) is essential for the pathogenicity of Yersinia pestis. Adenylate cyclase is a probable factor for Yersinia pestis pathogenicity, while adenosine deaminase can be used to differentiate Yersinia pestis and pseudotuberculous microorganisms. Isocitrate lyase can be used to identify Yersinia pestis. DsbA is required for efficient Yop secretion by Yersinia pestis.

From the graph, we can see that TLR2, Ail, YscF, OmpF, OmpC, OmpX, CRP, YopJ, YopE, YopH, YopM, LcrG, ybtP, ybtQ, and PchG are proteins related to plague, among which cyclic AMP-binding protein is a kind of high molecular weight compound of 180 kD present in Yersinia pestis. Yfe and Feo play an important role in lymphatic gland plague. Ail is identical to that of Yersinia pseudotuberculosis. YscF is a surface-expressed protein of the Yersinia pestis type III secretion complex. YadC can be used as a vaccine component against Yersinia pestis and RovA can enhance the virulence of Yersinia pestis by directly upregulating the psa locus.

plaguerelated Phenomen, ,Diagnosis and Equipment

From the knowledge graph we can find phenomena closely related to plague such as phagocytosis. Virulence is in Microbiological Phenomena. Innate immunity and seroconversion are in Immune System Phenomena. Population sensing is in Environmental Phenomena and Cell Physiological Phenomena. Future warming is in Environmental Phenomena; Quorum sensing and endocytosis are in Cell Physiological Phenomena; respiration, inhalation, heat shock and fibrinolysis are in physiological phenomena; multidrug resistant (MDR), SNPs, Variable Number Tandem Repeats (VNTRs) are in genetic phenomena. Electron Transport is in biochemical phenomena.

At present, the main tests used to detect the presence of Yersinia pestis include: haemagglutination(HA) test, fluorescent antibody staining, latex agglutination test, immunohistochemistry, immunochromatography assay (ICA), fluorescent in situ hybridization (FISH), and enzyme-linked immunoassay (ELISA), enzyme-linked immunospot analysis (ELISPOT), Western blotting and sodium dodecyl sulfate-polyacrylamide gel electrophoresis.

Confocal laser scanning microscopy, 3IS-RFLP, ElectraSense platform and microspheres are common used devices for detection of plague.

In this paper, a total of 5388 literature related to plague are collected based on literature mining. After named entity recognition and relation extraction, a plague knowledge graph is constructed that contains 33 entity types, 292 relation types, 9433 nodes and 9466 edges. Through the knowledge graph, we get a more intuitive and clear cognition of a series of plague-related biomedical concepts (i.e., chemicals, genes, species, proteins, diseases) and their correlations. The data are of great significance to the study of plague and provide reference and help for researchers. This work could accelerate the vaccine development. Our plague knowledge base platform can successfully manage and present a large amount of structured plague data.

Since it takes a lot of time and energy to organize entities and relations data in a semi-automatic and semi-manual way, we will propose named entity recognition algorithms and relation extraction models for plague in subsequent research work. Intelligent question and answer algorithm and inference algorithm will be conducted based on the constructed knowledge graph. In addition, we will enrich and update the knowledge base regularly by adding new plague-related data, so as to provide researchers with a more practical and precise service.

PlagueKG

Plague Knowledge Graph

OpenIE

Open information extraction

GOA

Gene Ontology Annotation

KEGG

Kyoto Encyclopedia of Genes and Genomes Orthology

Gene Ontology

PDB

Protein Data Bank

BIND

biomolecular interaction network database

BioGRID

bio general repository for interaction datasets

BGVD

Bovine Genome Variation Database

ChickVD

Chick Variation Database

RAvariome

Rheumatoid Arthritis variome

T-HOD

Text-mined Hypertension, Obesity and Diabetes candidate gene database

SNPs

single nucleotide polymorphisms

CNVs

Copy numbervariations

RAvariome

Rheumatoid arthritis variome

PMID

PubMed Unique Identifier

PTC

PubTator Central

FTP

File Transfer Protocol

NCBI

National Center for Biotechnology Information

MESH

Medical Subject Headings

PTP

Protein-tyrosine phosphatase

DXR

1-Deoxy-D-xylulose 5-phosphate reductoisomerase

MDR

multidrug resistant

VNTRs

Variable Number Tandem Repeats

haemagglutination

ICA

mmunochromatography assay

FISH

fluorescent in situ hybridization

ELISA

enzyme-linked immunoassay

ELISPOT

enzyme-linked immunospot analysis

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The data that support the findings of this study are available on request from author.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the Inner Mongolia Science and Technology Major Special Projects (2019ZD016,2021ZD0005), Natural Science Foundation of Inner Mongolia Autonomous Region (2019MS03014)；Department of Education of Inner Mongolia Autonomous Region (Grant ID has not been announced yet)

Authors' contributions

J.G. designed and supervised the research work. J.L. conducted the biomedical text mining, wrote the manuscript and participated in the development of the knowledge portal. B.F. conducted the development of knowledge platform. Y.J. participated in some data collection and sorting.

Acknowledgements

We would like to express our sincere appreciation to the Professor Weiguang Zhou, an animal infectious disease expert from the Inner Mongolia Agricultural University and Director of Animal Infectious Diseases Branch of Chinese Society of Animal Husbandry and Veterinary Medicine, and his team for their help and guidance in data collation and review.

Conflict of interest

The authors have no conflict of interest to declare.

Authors' information

J.G. is a professor and doctoral supervisor at the University of Inner Mongolia Agricultural University. Her research interest covers Big Data Intelligence and Knowledge Discovery, Machine Learning and Big Data Analysis of Bioinformatics.

J.L. is a PhD student at the University of Inner Mongolia Agricultural University. Her researchinterest covers Biomedical literature mining and Natural Language Processing.

B.F. is a postgraduate student at the University of Inner Mongolia Agricultural University. His researchinterest covers Biomedical literature mining and Natural Language Processing.

Y.J. is currently pursuing Master of Data Science and Decisions at the University of New South Wales in Australia. Her main research interests include casual inference, machine learning and data mining.

Corresponding author:

Correspondence to JG，Email：[email protected].

Yang, R. Plague: recognition, treatment, and prevention. Journal of clinical microbiology, 2018;56(1), e01519-17. doi:https://doi.org/10.1128/JCM.01519-17.
Wong, D., Wild, M. A., Walburger, M. A., et al. Primary pneumonic plague contracted from a mountain lion carcass. Clinical Infectious Diseases,2009; 49(3), e33-e38. doi:https://doi.org/10.1086/600818
Dai, R., Wei, B., Xiong, H., et al. Human plague associated with Tibetan sheep originates in marmots. PLoS neglected tropical diseases, 2018;12(8), e0006635. doi:https://doi.org/10.1371/journal.pntd.0006635
Abbott, R. C., and Rocke, T. E. Plague: US geological survey circular 1372. USGS National Wildlife Health Center, Madison, WI, USA.2012
Stanton, J. A., Macgregor, A. B., Mason, C., et al.Building comparative gene expression databases for the mouse preimplantation embryo using a pipeline approach to UniGene. Molecular human reproduction, 2007;13(10), 713-720. doi:10.1093/molehr/gam050
Camon, E., Magrane, M., Barrell, D., et al. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic acids research, 2004;32(suppl_1), D262-D266. doi：10.1093/nar/gkh021
Kanehisa, M., Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 2000;28(1): 27-30. doi:https://doi.org/10.1093/nar/28.1.27
Ashburner, M., Ball, C. A., Blake, J. A., et al.Gene ontology: tool for the unification of biology. Nature genetics, 2000;25(1), 25-29. doi:https://doi.org/10.1038/75556
Croft, D., Mundo, A. F., Haw, R., Milacic, et al. The Reactome pathway knowledgebase. Nucleic acids research, 2014;42(D1), D472-D477. doi:10.1093/nar/gkt1102
Nishimura, D. BioCarta. Biotech Software & Internet Report: The Computer Software Journal for Scient, 2001;2(3), 117-120.doi:https://doi.org/10.1089/152791601750294344
Boeckmann, B., Bairoch, A., Apweiler, R., et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 2003;31(1), 365-370. doi：10.1093/nar/gkg095
Sussman, J. L., Lin, D., Jiang, J., et al. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallographica Section D: Biological Crystallography,1998; 54(6), 1078-1084.doi：10.1107/s0907444998009378
Chen, G., Wang, Z., Wang, D., et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic acids research, 2012;41(D1), D983-D986. doi：10.1093/nar/gks1099
Maglott, D., Ostell, J., Pruitt, K. D., et al. Entrez Gene: gene-centered information atNCBI. Nucleic acids research, 2010;39(suppl_1), D52-D57. doi：10.1093/nar/gkq1237
Bader, G. D., Betel, D., and Hogue, C. W. BIND: the biomolecular interaction network database. Nucleic acids research,2003; 31(1), 248-250.doi:10.1093/nar/gkg056
Stark, C., Breitkreutz, B. J., Reguly, T., et al. BioGRID: a general repository for interaction datasets. Nucleic acids research, 2006;34(suppl_1), D535-D539.doi：10.1093/nar/gkj109
Chen, N., Fu, W., Zhao, J., et al. The Bovine Genome Variation Database (BGVD): Integrated Web-database for bovine sequencing variations and selective signatures. BioRxiv, 2019;802223.doi：10.13140/RG.2.2.14078.95048
Wang, J., He, X., Ruan, J., et al. ChickVD: a sequence variation database for the chicken genome. Nucleic acids research, 2005;33(suppl_1), D438-D441. https://doi.org/10.1093/nar/gki092
Poos, K., Smida, J., Nathrath, M., et al.Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation. Database, 2014. doi：10.1093/database/bau042
Zheng, S., Rao, J., Song, Y., et al. PharmKG: a dedicated knowledge graph benchmark for bomedical data mining. Briefings in bioinformatics, 2021;22(4), bbaa344. doi：10.1093/bib/bbaa344
Sun, H., Guo, Y., Lan, X., et al. PhenoModifier: a genetic modifier database for elucidating the genetic basis of human phenotypic variation. Nucleic acids research, 2020;48(D1), D977-D982. doi：10.1093/nar/gkz930
Dai, H. J., Wu, J. C. Y., Tsai, R. T. H., et al. T-HOD: a literature-based candidate gene database for hypertension, obesity and diabetes. Database, 2013.doi：10.1093/database/bas061
Nagai, Y., and Imanishi, T. RAvariome: a genetic risk variants database for rheumatoid arthritis based on assessment of reproducibility between or within human populations. Database, 2013. doi：10.1093/database/bat073
Wei, C. H., Kao, H. Y., and Lu, Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 2013;41(W1), W518-W522. doi：10.1093/nar/gkt441
Huang, M., Liu, J., and Zhu, X. GeneTUKit: a software for document-level gene normalization. Bioinformatics, 2011;27(7), 1032-1033. doi：10.1093/bioinformatics/btr042
Wei, C. H., Harris, B. R., Kao, H. Y., et al. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics, 2013;29(11), 1433-1439. doi：10.1093/bioinformatics/btt156
Leaman, R., Islamaj Doğan, R., and Lu, Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 2013;29(22),2909-2917.doi：10.1093/bioinformatics/btt474
Wei, C. H., Kao, H. Y., and Lu, Z.SR4GN: a species recognition software tool for gene normalization. PloS one, 2012;7(6), e38460. doi：10.1371/journal.pone.0038460
Angeli, G., Premkumar, M. J. J., and Manning, C. D. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing ,2015;(Volume 1: Long Papers) (pp. 344-354). doi：10.3115/v1/P15-1034
Wang Mingqiang., Zhang Lei., Cui Yidi., et al. Method of Storing Ontologies of “Disease-Syndrome-Treatment” of Dermatosis of Chinese Medicine by Neo4j. Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology , 2020;22(08):2914-2921. doi：10.11842/wst.20190301008

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

PlagueKG: A plague knowledge graph based on biomedical literature mining

Status:

Version 1

Abstract

Figures

Introduction

Related Work

Methods

Materials and tools

Data source

Data Pre-processing

Named Entity Recognition

Relation Extraction

Graph Storage and Plague Knowledge Base Platform Construction

Results

Plague Knowledge Graph

Plague Knowledge Base Platform

Plague knowledge graph visualization module

Plague Data sheet module

Text annotation module

Manager module

Discussion

Plaguerelated Disease, Symptoms, Anatomy and Chemical

plaguerelated Geographic Locations, Specie, ,Food and Environment

plaguerelated Gen, ,Protein and Enzyme

plaguerelated Phenomen, ,Diagnosis and Equipment

Conclusion And Future Work

Abbreviations

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and materials

Competing interests

Funding

Authors' contributions

Acknowledgements

Conflict of interest

Authors' information

References

Additional Declarations

Status:

Version 1