4.1 Dataset
In this experiment, the SemEval-2010 Task 8 dataset was used. The dataset was collected from major data sources according to nine pre-set incompatible relationships, which contains 10,717 pieces of data, including 8,000 use cases for training and 2,717 use cases for testing. All examples in the dataset were annotated with nine relationships and an Other relationship. The distribution of quantity of the nine relationship types is shown in Table 1:
Table 1
Relationship Distribution of SemEval-2010 Task 8 Dataset
Relationship Type
|
Training Set
|
Testing Set
|
Cause-Effect
|
1003
|
328
|
Component-Whole
|
941
|
312
|
Entity-Destination
|
845
|
292
|
Product-Producer
|
717
|
261
|
Entity-Origin
|
716
|
258
|
Member-Collection
|
690
|
233
|
Message-Topic
|
634
|
231
|
Content-Container
|
540
|
192
|
Instrument-Agency
|
504
|
156
|
Other
|
1410
|
454
|
In addition to the annotated relationship types, each data also contains two annotated entities e1 and e2. The relationship types other than Other type are directional. For example, Cause-Effect (e1, e2) and Cause-Effect (e2, e1) are different. Therefore, in the experiments, 19 relationship types are usually set to make predictions.
In this paper, the macro average F1 value in the official scoring script provided by the SemEval-2010 Task 8 dataset was used for scoring. According to this scheme, the macro average F1 value scores of 9 actual relationships (excluding relationship of Other type) were calculated, and the directionality of the relationships was taken into account. The calculation of F1 values requires precision and recall. The calculation formula is shown in equations (9) to (11):
$$\text{precision}\text{=}\frac{TP}{TP+FP}$$
9
$$\text{recall}\text{=}\frac{TP}{TP+FN}$$
10
$$\text{F1}\text{=}\frac{2\times \text{precision}\text{×}\text{recall}}{\text{precision}\text{+}\text{recall}}$$
11
where true positive (TP) represents the number of correct predictions in positive prediction cases, false positive (FP) represents the number of wrong predictions in positive prediction cases, and false negative (FN) represents the number of wrong predictions in the negative prediction cases.
4.2 Hyper-parameter settings
The settings of hyper-parameters are as follows:
Table 2
Parameter
|
Value
|
Batch_size
|
8
|
Max_sequence_length
|
384
|
Learning_rate
|
2e-5
|
Train_epoch
|
5
|
Adam_epsilon
|
1e-8
|
dropout_rate
|
0.1
|
seed
|
42
|
4.3 Comparison of experimental results
Table 3 compares the performance of the model in this paper with various neural network models on the SemEval-2010 Task 8 dataset, which proves that the method proposed in this paper has achieved good results. The highest value in each column of indicators is shown in bold.
Table 3
Model
|
F1/%
|
RNN
|
77.6
|
Bi-LSTM
|
82.7
|
CNN+softmax
|
82.7
|
CR-CNN
|
84.1
|
Attention Bi-LSTM
|
85.2
|
Attention CNN
|
85.9
|
BERT-base
|
87.1
|
R-BERT
|
89.25
|
R-BERT+SDP
|
89.97
|
It can be seen from the results in the table that the effect of the pre-training model is much better than those of such neural network models as CNN and LSTM. In this paper, the pre-training model was also used for experiments, and the R-BERT model was selected as the Baseline model. The R-BERT model was based on the pre-training model and highlighted the entity information with special identifiers to indicate the entity location, which achieved the best results at that time, and the official F1 evaluation value reached 89.25%. On this basis, the shortest dependency path was obtained through dependency parsing and integrated into the R-BERT model in this paper, so that the model could learn the context information of sentences. The results show that the F1 value performance of the model reaches 89.97% after parsing is introduced, which fully proves that the context information provided by the dependency parsing is effective.
4.4 Ablation Experiments
The method proposed in this paper has been proved by the above experimental results. We wanted to further understand what factors besides BERT contributed to the experimental results in the method based on the pre-training model, and therefore, three ablation experiments were designed. Since the entity tags "<e1>" and "<e2>" were added to emphasize the entity and add boundary information to the entity, which significantly improved the classification prediction, these entity tags were reserved and used in each ablation experiment.
In the first experiment, a [CLS] token was added before the sentence input, the hidden layer vector of this token was used as a vector representation of sentence classification, and only this vector was used for classification. In the second experiment, [CLS] and the hidden vector of entity dependency path were spliced to obtain a vector as the sentence representation, in which the entity dependency path did not contain entity information. In the third experiment, [CLS] and the hidden vector of the entity were spliced as sentence representation, and in this case, the entity information contained the tags of the entity and integrated the boundary information of the entity.The SDP represents the shortest dependency path.
Table 4
Comparison of Different Components of BERT-based Method
Relationship Representation
|
F1/%
|
[CLS]
|
87.99
|
[CLS]+SDP
|
89.15
|
[CLS]+ENT
|
89.23
|
[CLS]+ENT+SDP
|
89.97
|
It can be seen from the results in Table 4 that the experimental results are improved after the addition of entity identifiers, which provide the model with the boundary information of the entity and emphasize the entity. There is little difference between the result of using the hidden vector of entity dependency path information as sentence representation and that of using the hidden vector of entity as sentence representation, but the result of using entity information is better. Experimental results show that the model can make use of context information, but the model still needs entity information for supplementation. After combining the entity information with the context information provided by the dependency parsing, the model can predict the classification better.
4.5 Case study
This section analyzes the results of the R-BERT model and the model proposed in this paper in detail, and compares the results of various relationship types, as shown in Table 5.
Table 5
Comparison of F1 Values of Various Relationship Types
Relationship
|
R-BERT
|
BERT+ENT+SDP
|
Cause-Effect
|
93.11
|
92.47
|
Component-Whole
|
87.34
|
87.72
|
Content-Container
|
90.03
|
92.93
|
Entity-Destination
|
94.18
|
93.68
|
Entity-Origin
|
89.14
|
89.15
|
Instrument-Agency
|
82.87
|
84.04
|
Member-Collection
|
87.82
|
88.48
|
Message-Topic
|
90.77
|
91.31
|
Product-Producer
|
87.82
|
90.00
|
Other
|
64.22
|
67.14
|
Official Score
|
89.23
|
89.97
|
The results in the table show that the classification effect for most relationship types is improved compared with Baseline after the introduction of the entity dependency path, and the effect is more obvious for such relationship types as Content-Container, Product-Producer and Instrument-Agency, indicating that this experiment has successfully integrated the entity dependency path into the pre-training model, and is beneficial to improving the effect of relationship classification.
However, the classification effect for Cause-Effect and Entity-Destination has not improved, but reduced significantly. Therefore, we reviewed the classification results obtained by the two models in detail, and extracted the examples of wrong classification results of the two models respectively. Table 6 provides detailed examples of classifications errors in these two types.
Table 6
Comparative Examples of Results Generated by Models
A few days before the service, Tom Burris had thrown into Karen's <e1> casket </e1> his wedding <e2> ring </e2>.
|
prediction
|
Official
|
R-BERT
|
BERT+ENT+SDP
|
Entity-Destination(e2, e1)
|
Other
|
Entity-Destination(e1, e2)
|
Each time a <e1> neuron </e1> unleashes its tiny <e2> jolt </e2>, it needs to replenish its stores of energy for the next spark.
|
prediction
|
Official
|
R-BERT
|
BERT+ENT+SDP
|
Cause- Effect (e1, e2)
|
Other
|
Cause-Effect(e2, e1)
|
These wind <e1> turbines </e1> generate <e2> electricity </e2> from naturally occurring wind.
|
prediction
|
Official
|
R-BERT
|
BERT+ENT+SDP
|
Cause-Effect (e1, e2)
|
Content-Container(e1, e2)
|
Cause-Effect(e2, e1)
|
From the results of data classification in the table, we can see that in the prediction results of these two types, the model proposed in this paper correctly predicted the relationship types, but mispredicted the relationship directions, and the relationship types predicted by the baseline model were different from the standard results. Therefore, taking Cause-Effect as an example, when the accuracy of this type is calculated on the premise that the recall rates of the two models are not much different, due to the wrong relationship directions in the prediction results on some data, the model in this paper predicted more data to be of Cause-Effect type than that of the baseline model, so the accuracy rate obtained is lower. As a result, the F1 value evaluation of the Cause-Effect classification results is lower than that of the baseline model.
It can be seen from the above results that the method proposed by the model in this paper not only allows the model to learn the context information provided by the dependency syntax, but also improves the prediction of the model. However, the model underutilizes the context information of the data in some relationship types, resulting in correct classification of relationship types and wrong classification of relationship directions. In this case, it shows that there is still room for improvement in the use of context information, which is also the focus of our following work.