Our model consists of five parts, as shown in Fig. 1. The first part is two embeddings, word embedding and improved Pinyin-character embedding. Through the processing of the first part, sentences can be converted into word vectors and Pinyin vectors that can be processed by computers. And the second part is two encoders, which deal with vectors from two embeddings. Pinyin vectors and word vectors are processed by Pinyin Encoder and Word Encoder respectively, ensuring that the information contained in the two vectors does not affect each other. After the two Encoders have processed Pinyin and Word respectively, the features are fused through TextCNN in the third part. TextCNN can process sentences of a given length. So that features can be extracted without generating timing problems. Later, BiLSTM in the fourth part can deal with the merged sequence, and then deal with the sequence timing problem. Finally, the CRF layer gets output through probability calculation.
1 Character Representation
In Chinese NER, the input of the model is a sequence of sentences, which is represented by s1. Before processing a sentence, we use the BERT tokenizer tool to divide the sentence into individual characters, which are denoted by C here. We can get s1= {c1, ..., ci, ..., cn}.
After obtaining the words of the sentence, we need to extract the pinyin of Chinese characters. In our model, we use a dictionary with a length of 1325, including 1238 pinyin characters and 87 other commonly used characters. We will compare the divided characters with the dictionary. If our current character is a Chinese character, we will get its Pinyin and get the code through the dictionary. If the current character cannot be converted to Pinyin, we can directly go to the Pinyin dictionary to find the corresponding symbol code. We use s2 to represent the pinyin sequence and p to represent Pinyin. So, we get S2= {p1, ..., pi, ..., pn}. Because Pinyin and Word cannot be directly processed by the model, we need the word2vec[12]method to convert them into vectors. We used a word embedding layer and a Pinyin embedding layer, representing them as word_e and pinyin_e respectively. Then we get the character vector word_v and pinyin_v. We can get the following expression:
$$wor{d\_v}_{i}=word\_e\left({c}_{i}\right)$$
$$pinyin{\_v}_{i}=pinyin\_e\left({c}_{i}\right)$$
In this model, we do not use the pre-trained embedding module, where both initial values of the embedding layers are random.
2 Two Parallel Encoder
We use two parallel Encoders to process the output of two embeddings. We believe that the timing information contained in Pinyin and Word should be different in Chinese NER tasks. If you put Pinyin and Word together, you will lose their characteristics. Therefore, instead of merging the Pinyin vector and word vector obtained above into one encoder, we use two encoders to process two vectors respectively. In the word encoder and Pinyin encoder sections of the model, we use the transformer encoder to extract features.
Transformer[8] is a seq2seq model based on the attention mechanism. It consists of encoders and decodes and we use its encoder part. The transformer encoder section includes Input Embedding, Positional Encoding, Multi-Head Attention, Add & Norm and Feedforward + Add & Norm. The embedding layer uses the word embedding and the Pinyin embedding mentioned above. The remaining four layers are then joined to form a transformer encoder.
2.1 Positional Encoding
Positional Encoding can give location information. This is also the result of the attention mechanism. Attention is different from LSTM. It cannot solve the timing problem through the network. Therefore, it should be solved through Positional Encoding.
$${p}_{i,2j}={sin}\left(\frac{i}{{10000}^{2j∕d}}\right)$$
$${p}_{i,2j+1}={cos}\left(\frac{i}{{10000}^{2j∕d}}\right)$$
Where i refers to the position of a character in a sentence and j refers to the dimension of a word vector. With this encoding, the sequencing problem of the sequence can be solved.
2.2 Multi-Head Attention
The attention mechanism is to focus on something more relevant. By entering data, three matrices of Query, Key, Value are generated. Query, Key and Value calculate the correlation between each word and other words. Multipurpose attention maps an output together with a Query and a series of Key-Value pairs. Here is the formula for Query, Key, Value and Attention.
$$Qurry=\dot{i}nput*{W}_{Q}$$
$$Key=\dot{i}nput*{W}_{K}$$
$$Value=\dot{i}nput*{W}_{V}$$
$$Attention\left(Q,K,V\right)=soft{max}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{K}}}\right)V$$
2.3 Add & Norm
There are two operations in this layer:
1. Make a residual connection. Add the inputs and outputs of the previous layer.
2. Normalize the hidden layer of the neural network.
2.4 Feedforward + Add & Norm
Feedforward networks are simply two-level linear mappings that pass through the activation function. Residual linking and normalization are then done as described above.
3 Feature Fusion Layer
After getting the input of two Encoders, you need to merge the two vectors. However, after merging the outputs of the two Encoders, we cannot directly put them together for decoding. Because if they are processed together, the timing information of Pinyin and the word will still be lost. At this point, we need a feature extraction method that will not affect the whole sentence, namely TextCNN. TextCNN can guarantee the processing of sentence context on a given length, and can better extract the Pinyin and word features of a single word.
In 2014, Yoon Kim made some distortions on the input layer of CNN and proposed a text classification model, TextCNN[7]. TextCNN is a very useful and effective deep-learning algorithm for short text classification tasks. Due to its promising performance, TextCNN has become a baseline model for text classification[13]. However, with the development of NLP, many people begin to notice the use of TextCNN in NER. For example, in 2021, Jun Kong et al proposed the multi-level CNN and attention mechanism[14]. This model uses CNN to extract text features for NER.
Similar to image convolution, TextCNN uses a slider with the length of the word embedding dimension to extract features about the context and the current text by moving, as shown in Fig. 2. In our model, we set the kernel size to 3 and the kernel channel to 512. TextCNN can better condense the concatenated word and Pinyin features. We added the TextCNN module in our experiment and found that this module can indeed integrate the information of characters and Pinyin to achieve better results. In the ablation experiment, we found that the effect of the model with TextCNN is higher than that without TextCNN.
4 Bilstm Module
In the previous processing, we only considered the timing of Pinyin and text separately, and we did not consider the timing of them together. Through TextCNN, we have integrated the features of Pinyin and Word. In 1997, LSTM with storage cell and threshold was proposed to solve the problems of gradient disappearance, gradient explosion and long-distance information transmission of text sequences. Therefore, we use BiLSTM to process the results of TextCNN. The following is the calculation formula of LSTM.
Where W is the weight matrix and b are the bias vector, σ is the activation function, ct is the update state at time t, and it, ft, and ot are the outputs of input gate, forget gate, and output gate, respectively. As a result, ht is the output of the entire LSTM unit at time t. The LSTM is calculated as follows:
$${i}_{t}=\sigma \left({W}_{xi}{x}_{t}+{W}_{hi}{h}_{t-1}+{W}_{ci}{c}_{t-1}+{b}_{i}\right)$$
$${f}_{t}=\sigma \left({W}_{xf}{x}_{t}+{W}_{hf}{h}_{t-1}+{W}_{cf}{c}_{t-1}+{b}_{f}\right)$$
$${c}_{t}={f}_{t}{c}_{t-1}+tanh\left({W}_{xc}{x}_{t}+{W}_{hc}{h}_{t-1}+{b}_{c}\right){i}_{t}$$
$${o}_{t}=\sigma \left({W}_{xo}{x}_{t}+{W}_{ho}{h}_{t-1}+{W}_{co}{c}_{t}+{b}_{o}\right)$$
$${h}_{t}={o}_{t}tanh\left({c}_{t}\right)$$
5 Crf Module
Conditional Random Field (CRF) is a basic model in NLP, which is widely used in scenes such as word segmentation, entity recognition and part of speech tagging. LSTM can effectively process the information of long-distance text, but it cannot learn the dependency of labels. For example, under the "BIO" annotation mode, Class I labels can only be after Class B. However, this label format cannot be standardized only through LSTM. So, the role of CRF is obvious. CRF learns the relationship between tags through datasets. It greatly improves the effect of the model by modifying the output label through probability calculation. In our model, the output of BiLSTM is classified through the linear layer and used as the input of the CRF layer. Results are obtained after CRF correction.