4.1 Establishment of the mode layer of the public security knowledge graph based on the case text
For the establishment of the model layer of the graph, the case text data should first be pre-processed and analyzed to extract the core concepts, and the entity relationship entities were classified and layered in terms of some intelligence factors including person, thing, material, time and location, to form the conceptual framework. Next, the conceptual framework was combined with the universal ontology to form public security ontology knowledge library. The public security ontology library, as the storing carrier of intelligence knowledge, can collect, organize and share constantly iterated and updated public security intelligence knowledge and business knowledge based on logic. Accordingly, the problem that knowledge is easily lost in the public security information service can be effectively solved.
4.1.1 Case text acquisition and pre-processing
The related data were sought and downloaded from the website (https://wenshu.court.gov.cn/) of judgment document, including 36 charges and 60,746 cases in four parts. The first part includes 18,182 case texts with 10 charges of producing and selling fake and inferior commodities. The second part includes 8,802 case texts with 8 charges of infringing intellectual property rights. The third part includes 2 charges of endangering public health, with 35 case texts. The fourth part includes 33,727 case texts with 16 charges of destroying the protection of environmental resources. In this study, for each charge, 300 data was extracted from the case texts for artificial processing. The involved key industries, key sites, key parts, key personnel, main materials, main species and main crime methods were extracted from the case texts as per related standards formulated according to the characteristics of the common cases and the related national standards.
4.1.2 Establishment of public security ontology
According to the above described circulation method, the present ontology was established based on the data structured by the above case texts. The detailed procedures are described below.
1) Analyze the ontology demand and investigate the reusable ontology
In the public security domain, the ontology was established based on the case text data. After investigation of the related literature, the public domain ontology can be expanded based on the universal ontology knowledge library. The present study took the expansion on the encyclopedic knowledge tree TermTree.
2) Establish the domain core concepts
By summarizing the structured data of the above case texts and taking the statistics in terms of charges, the word frequencies of the lexical terms in each field were recorded, and the high-frequency words ranking the first quarter were screened out. Accordingly, the core concepts of the cases of the charges were obtained. For example, the key concepts of the charge of illegal logging can be obtained and the knowledge nodes in the schema layer were established on the basis of these core concepts.
3) Establish the conceptual taxonomic hierarchies and defining the knowledge nodes
These core conceptual factors were classified into five categories (person, thing, material, time and position). The object’s attribute hierarchy was established by referring the semantic description of OpenSchema. For example, the semantic level of a person can be obtained by referring to https://schema.org.cn/Person. According to the hierarchy and semantic structure, the terms can be filled in so as to acquire the final ontology scheme in the public security domain.
By adding the ontology pattern in the public security domain to the universal ontology knowledge tree, the ontology in public security Termtree can be obtained. The core codes are given below.
④ {for i in range(0, len(p)):
thisterm = p.iloc[i]
# Add new term
try:
termtree.add_term(term=thisterm["term"],
base=thisterm["base"],
term_type=thisterm["term_type"])
except Exception as e:
print(e)
}
|
Using the codes, the ontology lists in public security domain was transversed so as to extract the ontology knowledge and convert it into JSON format. By writing the ontology into the universal ontology, the public security ontology Termtree can finally be obtained.
4) Ontology evaluation and evolution
The ontology should constantly be updated and maintained in accordance with the actual requirements. New intelligence requirements should be analyzed by returning to the first step and the life cycle of ontology construction re-operated.
4.2 Establishment of the data layer of the public security knowledge graph based on the case text
Based on the knowledge system in the schema layer, the graph data layer can unearth the intelligence clues from the case texts to be processed, which can provide a creative text structured means for information mining from the case texts in public security domain. It can contribute to dig out the hidden entity relationship from huge case text library so as to form the intelligence clues in the public security work. The establishment of data layer in the knowledge graph includes two steps: knowledge labeling and entity linking. Firstly, the entity set in the texts was parsed via knowledge labeling and the entities were correlated via entity linking so as to form the graph.
To be specific, knowledge labeling was taken based on the established public security ontology and the open-source knowledge labeling tool from Baidu ‘Jieyu’., while entity linking was performed based on the open source natural language processing tool from Harbin Institute of Technology LTP and the open source natural language processing tool from Baidu PaddleNLP (Zhao et al., 2020).
4.2.1 Knowledge labeling
An important problem in knowledge labeling that needs to be solved is entity disambiguation. A word can have different semantic meanings under different contexts. For example, the word skating in two sentences ‘Yu plays skating in the gymnasium’ and ‘Yu plays skating using his curling’ have different semantic meanings. The former refers to a kind of sports while the latter refers to taking cocaine. How to realize accurate knowledge labeling by discriminating the semantic meanings of words is the main issue? Based on the open-source knowledge labeling tool ‘Jieyu’, this study attempted to take knowledge labeling in the public security domain based on the established domain knowledge ontology.
First, the case texts were pre-processed via word separation and part-of-speed tagging so as to form a series of part-of-speech. Next, in combination with knowledge ontology tree, the noun phrases were classified via named entity recognition and the ontology mode class that the entity belonged to was obtained. Finally, the knowledge ontology nodes corresponding to the word were retrieved so as to obtain the knowledge of the entity word so as to finish knowledge labeling. By taking the sentence ‘Zhang killed Zhao and discarded the dead body into the river’ as an example, the detailed labeling process illustrated in Figure 1.
A complete sentence was separated into lexical item series. Next, the part-of-speech information of these lexical items was obtained via part-of-speech tagging. For example, killing can be understood as an event, and its part-of-speech information is the class of the knowledge ontology that the lexical item belongs to. Killing corresponds to the class of scene event on the knowledge ontology tree. Finally, the corresponding knowledge of the word can be found from the class on the knowledge ontology tree so as to finish knowledge labeling process. The core codes are given below.
⑤
>>> from paddlenlp import Taskflow
>>> wordtag = Taskflow("knowledge_mining",
model="wordtag",
linking=True,
task_path="./custom_task_path/")
>>> wordtag("Zhang killed Zhao and dumped the body into the river")
|
To be specific, Taskflow executor in paddlenlp was introduced in the first line, to obtain knowledge labeling tool of ‘Jieyu,’ then self-defined catalogs that the public security knowledge ontology stored was assigned in the second line, and the knowledge labeling was performed in the third line.
⑥ [
{ "text": "Zhang killed Zhao and dumped the body into the river
",
"items": [
{
"item": "Zhang",
"offset": 0,
"wordtag_label": "Person_entity",
"length": 2
},
{
"item": "Kill",
"offset": 2,
"wordtag_label": "Sceneevent",
"length": 2,
"termid": "Sceneevent_cb_kill"
},
{
"item": "Zhao",
"offset": 4,
"wordtag_label": "Person_eneity",
"length": 2
},
{
"item": "and",
"offset": 6,
"wordtag_label": "Conjunction",
"length": 1,
"termid": "Conjunction_cb_and"
},
{
"item": "Dump the body",
"offset": 7,
"wordtag_label": "Scene event",
"length": 2,
"termid": "Scene event_cb_dump the body"
},
{
"item": "into the river",
"offset": 9,
"wordtag_label": "Position",
"length": 2,
"termid": "Position_cb_into the river"
}
]
}
]
|
Where, the text field represents the input text, the items field represents the list of parsed lexical term results (in which the field of each lexical term represents the lexical item text), offset represents the index at the beginning of the lexical item text, length represents the length of the lexical item text, wordtag_label represents the class in the ontology knowledge library after labeling, and termid represents the knowledge node retrieved from the class.
4.2.2 Entity linking
A. Relationship parsing based on dependency grammar analysis
Dependency grammar analysis, as a key technique in natural language processing, aims to obtain the sentence’s syntactic structure by recognizing interdependence relation among vocabularies in the sentence. Based on the obtained syntactic structure after dependency grammar analysis, the relation parsing in this study was performed using the LTP-based dependency grammar analysis model Biaffine (Dozat et al., 2017). The Stanford Dependencies Chinese (Chang et al., 2009) was selected as the annotation standard. The extraction rules are described in Table 1.
Table 1. Relation extraction rules
Relation type
|
Label
|
Component explanation
|
Example
|
Subject-predicate relation
|
SBV
|
subject-verb
|
Yu sold cocaine to Zhao (Yu <– sell)
|
Verb-object relation
|
VOB
|
Direct object, verb-object
|
Yu sold cocaine to Zhao (sell–> cocaine)
|
Indirect-object relation
|
IOB
|
Indirect object, indirect-object
|
Yu sold cocaine to Zhao (sell–>Zhao)
|
Fronting object
|
FOB
|
Front object, fronting-object
|
Yu kills anyone (person<– kill)
|
Pivot
|
DBL
|
double
|
Yu ran off with money (with –>money)
|
Attributive-centered relation
|
ATT
|
attribute
|
First-grade goods (first-grade <– goods)
|
Adverbial-verb structure
|
ADV
|
adverbial
|
Very dangerous (very dangerous)
|
Verb-complement structure
|
CMP
|
complement
|
Finish committing a crime (commit –> finish)
|
Coordinating relation
|
COO
|
coordinate
|
Yu and Zhao (Yu –> Zhao)
|
Proposition-object relation
|
POB
|
preposition-object
|
In the trade zone (is –> in)
|
Left adjunction relation
|
LAD
|
left adjunct
|
Yu and Zhao (Yu <– Zhao)
|
Right adjunction relation
|
RAD
|
right adjunct
|
Partners (partner –> partners)
|
Independent structure
|
IS
|
independent structure
|
-
|
Core relationship
|
HED
|
head
|
-
|
B. Relation linking based on text matching
The present study adopted the text similarity matching model SimBERT in open-source natural language processing toolkit PaddleNLP on Baidu (Su, 2020). First, the sentences were input in the pre-training model for generating the sentence vectors, using cosine function the similarity degree between sentence vectors was calculated to form the similarity degree matrix. For example, in Fig. 2, Yu committed a crime together with Yao is similar to Yu and Yao committed a crime together. The sentence was marked by the purple box, the similar sentence was denoted as 1, and the dissimilar sentence was denoted as 0. Accordingly, the data of Mask_1_0_0_0_0 can be obtained. The similarity matrix can be obtained by combining multiple sentences. The core codes are given below.
⑦ #Generate the sentence vector
def extract_emb_feature(model,tokenizer,sentences,max_len,mask_if=False):
mask = generate_mask(sentences,max_len)
token_ids_list = []
segment_ids_list = []
for sen in sentences:
token_ids, segment_ids = tokenizer.encode(sen,first_length=max_len)
token_ids_list.append(token_ids)
segment_ids_list.append(segment_ids)
result = model.predict([np.array(token_ids_list), np.array(segment_ids_list)])
if mask_if:
result = result * mask
return np.mean(result,axis=1)
#Generate the similarity matrix
def generate_mask(sen_list,max_len):
len_list = [len(i) if len(i)<=max_len else max_len for i in sen_list]
array_mask = np.array([np.hstack((np.ones(j),np.zeros(max_len-j))) for j in len_list])
return np.expand_dims(array_mask,axis=2)
|
By packaging above logic into the function of similarity, the similarity degree between the lexical items output after knowledge labeling and the entity items in the triple output after dependency grammar analysis was calculated and matched. For each triple relation, the similarity degree between the entity texts at the 0 and the 2nd index and the entity term text after knowledge labeling was calculated and matched. Two entities were linked so as to link the relation between entities and output the graph. The core codes are given below.
⑧ for i in depout:
a = max(similarity([[i[0], s] for s in nerout]),key=lambda x:x["similarity"])
b = max(similarity([[i[2], s] for s in nerout]),key=lambda x:x["similarity"])
pprint(select(a)+'->'+i[1]+'->'+select(b))
Zhang--kill--Zhao
Zhang--dump the body---into the reiver
|
C, Data format conversion and visualization
The final output triple was converted in format so as to achieve visual display. During data conversion process, the entities in the triple can be assigned with the only identifier that was randomly generated. By treating the class of ontology annotation as the entity tag, the relation can be used as the relation type. The unique identifiers of left and right entities were filed in the field of ‘source entity id’ and the field of ‘target entity id’ so as to obtain the graph data for visualization. By taking the above triple output as an example, the Excel dataset after processing is listed below, in which the Vertexes table records the entity term and the Endges table records the relation term.
Table 2. Vertexes table
Entity id
|
Entity name
|
Entity tag
|
Attribute name_1
|
Attribute value_1
|
ID-Zhang
|
Zhang
|
Person
|
Identity
|
Criminal suspect
|
ID-Zhao
|
Zhao
|
Person
|
Identity
|
Criminal victim
|
ID-He
|
In the river
|
Position
|
|
|