Chinese Characters and Pinyin: A Model with Two Parallel Feature Extractors for Chinese Entity Recognition

doi:10.21203/rs.3.rs-2282745/v1

Download PDF

Research Article

Chinese Characters and Pinyin: A Model with Two Parallel Feature Extractors for Chinese Entity Recognition

https://doi.org/10.21203/rs.3.rs-2282745/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The purpose of Named Entity Recognition (NER) is to identify and mark entities with specific meanings in a text. Compared with English NER, Chinese NER is blurry about the boundaries of entity classes because there is no clear separator between Chinese characters and Chinese entities are composed of several characters with different lengths. For Chinese NER, traditional methods only focus on Chinese characters, ignoring the important role of pronunciation. But for these models considering pronunciation, they put pronunciation and characters together for feature extraction. In this paper, we propose a Model with Two Parallel Feature Extractors. It uses a new Pinyin embedding layer that can handle characters except Pinyin and it uses Pinyin Encoder and Word Encoder to obtain the features of Pinyin and characters respectively and then features are fused through TextCNN. Compared with the previous model, this model is not as big as BERT, and it can get good results without additional data type training. We used four datasets: Resume, CCKS2019, CLUENER2020 and MSRA to test our model and it showed a good result, which proved the validity of our model.

Chinese Named Entity Recognition

Parallel encoder

Phonetic feature

Natural language processing

As a task of extracting structured information from unstructured text, Named Entity Recognition (NER), which aims to identify entity boundaries and types, plays an important role in many natural language processing (NLP) downstream tasks, like entity relation extraction[¹], events extraction[²], knowledge graph construction[³] and additional applications. Generally, NER task processing is divided into the following steps. First, through the embedding layer, text information becomes a vector that can be processed by the computer, then the encoder is used to extract features, and finally, the decoder is used to predict tags. Nowadays, many models deal with NER tasks, such as the RNN[⁴] model. Later, LSTM[⁵] appears due to the problem of long text gradient explosion and gradient disappearance in RNN. Finally, the application of CRF[⁶]and text convolution[⁷], the emergence of Transformer[⁸], BERT[⁹] and other models, and the collocation of various models have greatly improved the effectiveness of NER.

Although the research on NER has a long history, Chinese NER has been facing some difficulties. Because Chinese is different from English, there is no clear boundary between Chinese entities. Therefore, the research of Chinese NER cannot be completely like that of English. In previous work, the research center of Chinese NER was put on data volume. Supported by a large amount of data, the performance of Chinese NER can indeed be improved. However, existing approaches neglect an important fact that Chinese characters contain both semantic and phonetic meanings - there are various representations of characters designed for capturing these features[¹⁰]. Chinese pronunciation is expressed in Pinyin. In recent years, many people have begun to notice this and use Pinyin to handle the Chinese NER task. But they overlooked two problems. First, when we studied some Chinese NER datasets, we found that although it is called a Chinese NER, it would be mixed with English letters and other characters except Pinyin. Here we improve the embedding of Pinyin and prove its effectiveness in ablation experiments. Second, for characters and pinyin, we believe that they contain different information and should be extracted separately. Therefore, we propose a model with two parallel feature extractors for our Chinese NER task and test it.

Related work

In traditional Chinese NER, each character or word in a Chinese sentence is often split directly, then the character vector is obtained by word embedding. That ignores the Pinyin attribute of Chinese characters. In life, people often don't think about words when they speak. Most of the information in sentences can be extracted by pronunciation. From this point, we think that Chinese character pronunciation is effective in the research of NER.

There are models using Pinyin in previous studies, such as the ChineseBERT[¹¹]model proposed in 2021. However, those studies only focus on Pinyin, but treat other characters, such as numbers and Chinese symbols, as irrelevant characters. Some Chinese entities contain other characters, such as "-" in "电解质紊乱-低钠低氯血症" and "MTU" in "德国MTU发动机公司". Obviously, if these problems are not addressed, they will have a negative impact on the model. Moreover, we believe that the features contained in Pinyin should be different from that in Chinese characters. Overlaying the Pinyin vectors and the Chinese characters vector into the same encoder will lose some unique features of Chinese characters and pinyin. In addition, it is also a problem how to integrate the Pinyin features and character features.

Because of the above problems, we have carried out the following work:

1、Instead of embedding Pinyin directly into the neural network, a new Pinyin dictionary transformation is used to enable some other characters in entity recognition to play a role in Pinyin feature extraction.

2、Instead of putting the Pinyin vector and word vector together, we use two parallel encoders to extract the features of Pinyin and words respectively, to avoid the interaction between them.

3、Use TextCNN to extract the features of words and Pinyin.

First, we do the same job for the Pinyin dictionary in a way like word dictionary mapping. If a character is obtained, we will convert it to Pinyin. If it is not a character, we will use the Pinyin dictionary to find its code. So, we can handle other characters except Pinyin. Secondly, we used two parallel encoders (here we use the Transformer Encoder), rather than the superposition method to process word vectors and pinyin vectors. This can ensure that the processing of characters and Pinyin will not be affected by each other. Finally, we use TextCNN to extract the output of the two encoders. We hope to have a method to extract the features of each word without touching the sequence timing problem. Then we found TextCNN, which can extract the Pinyin and its features of a single word without affecting the whole sentence. We believe that the time information of the Pinyin sequence and text sequence should be different. If the outputs of the two encoders are directly merged into the LSTM model, pinyin and characters will still affect each other. However, TextCNN will process the characteristics of the single text in the sequence, which will not affect the whole sequence. At first, we did not use TextCNN, resulting in a low recall rate of the model. After joining TextCNN, the experimental results prove that it is indeed effective.

Our model consists of five parts, as shown in Fig. 1. The first part is two embeddings, word embedding and improved Pinyin-character embedding. Through the processing of the first part, sentences can be converted into word vectors and Pinyin vectors that can be processed by computers. And the second part is two encoders, which deal with vectors from two embeddings. Pinyin vectors and word vectors are processed by Pinyin Encoder and Word Encoder respectively, ensuring that the information contained in the two vectors does not affect each other. After the two Encoders have processed Pinyin and Word respectively, the features are fused through TextCNN in the third part. TextCNN can process sentences of a given length. So that features can be extracted without generating timing problems. Later, BiLSTM in the fourth part can deal with the merged sequence, and then deal with the sequence timing problem. Finally, the CRF layer gets output through probability calculation.

1 Character Representation

In Chinese NER, the input of the model is a sequence of sentences, which is represented by s₁. Before processing a sentence, we use the BERT tokenizer tool to divide the sentence into individual characters, which are denoted by C here. We can get s₁= {c₁, ..., c_i, ..., c_n}.

After obtaining the words of the sentence, we need to extract the pinyin of Chinese characters. In our model, we use a dictionary with a length of 1325, including 1238 pinyin characters and 87 other commonly used characters. We will compare the divided characters with the dictionary. If our current character is a Chinese character, we will get its Pinyin and get the code through the dictionary. If the current character cannot be converted to Pinyin, we can directly go to the Pinyin dictionary to find the corresponding symbol code. We use s₂ to represent the pinyin sequence and p to represent Pinyin. So, we get S2= {p₁, ..., p_i, ..., p_n}. Because Pinyin and Word cannot be directly processed by the model, we need the word2vec[¹²]method to convert them into vectors. We used a word embedding layer and a Pinyin embedding layer, representing them as word_e and pinyin_e respectively. Then we get the character vector word_v and pinyin_v. We can get the following expression:

$$wor{d\_v}_{i}=word\_e\left({c}_{i}\right)$$

$$pinyin{\_v}_{i}=pinyin\_e\left({c}_{i}\right)$$

In this model, we do not use the pre-trained embedding module, where both initial values of the embedding layers are random.

2 Two Parallel Encoder

We use two parallel Encoders to process the output of two embeddings. We believe that the timing information contained in Pinyin and Word should be different in Chinese NER tasks. If you put Pinyin and Word together, you will lose their characteristics. Therefore, instead of merging the Pinyin vector and word vector obtained above into one encoder, we use two encoders to process two vectors respectively. In the word encoder and Pinyin encoder sections of the model, we use the transformer encoder to extract features.

Transformer[8] is a seq2seq model based on the attention mechanism. It consists of encoders and decodes and we use its encoder part. The transformer encoder section includes Input Embedding, Positional Encoding, Multi-Head Attention, Add & Norm and Feedforward + Add & Norm. The embedding layer uses the word embedding and the Pinyin embedding mentioned above. The remaining four layers are then joined to form a transformer encoder.

2.1 Positional Encoding

Positional Encoding can give location information. This is also the result of the attention mechanism. Attention is different from LSTM. It cannot solve the timing problem through the network. Therefore, it should be solved through Positional Encoding.

$${p}_{i,2j}={sin}\left(\frac{i}{{10000}^{2j∕d}}\right)$$

$${p}_{i,2j+1}={cos}\left(\frac{i}{{10000}^{2j∕d}}\right)$$

Where i refers to the position of a character in a sentence and j refers to the dimension of a word vector. With this encoding, the sequencing problem of the sequence can be solved.

2.2 Multi-Head Attention

The attention mechanism is to focus on something more relevant. By entering data, three matrices of Query, Key, Value are generated. Query, Key and Value calculate the correlation between each word and other words. Multipurpose attention maps an output together with a Query and a series of Key-Value pairs. Here is the formula for Query, Key, Value and Attention.

$$Qurry=\dot{i}nput*{W}_{Q}$$

$$Key=\dot{i}nput*{W}_{K}$$

$$Value=\dot{i}nput*{W}_{V}$$

$$Attention\left(Q,K,V\right)=soft{max}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{K}}}\right)V$$

2.3 Add & Norm

There are two operations in this layer:

1. Make a residual connection. Add the inputs and outputs of the previous layer.

2. Normalize the hidden layer of the neural network.

2.4 Feedforward + Add & Norm

Feedforward networks are simply two-level linear mappings that pass through the activation function. Residual linking and normalization are then done as described above.

3 Feature Fusion Layer

After getting the input of two Encoders, you need to merge the two vectors. However, after merging the outputs of the two Encoders, we cannot directly put them together for decoding. Because if they are processed together, the timing information of Pinyin and the word will still be lost. At this point, we need a feature extraction method that will not affect the whole sentence, namely TextCNN. TextCNN can guarantee the processing of sentence context on a given length, and can better extract the Pinyin and word features of a single word.

In 2014, Yoon Kim made some distortions on the input layer of CNN and proposed a text classification model, TextCNN[7]. TextCNN is a very useful and effective deep-learning algorithm for short text classification tasks. Due to its promising performance, TextCNN has become a baseline model for text classification[¹³]. However, with the development of NLP, many people begin to notice the use of TextCNN in NER. For example, in 2021, Jun Kong et al proposed the multi-level CNN and attention mechanism[¹⁴]. This model uses CNN to extract text features for NER.

Similar to image convolution, TextCNN uses a slider with the length of the word embedding dimension to extract features about the context and the current text by moving, as shown in Fig. 2. In our model, we set the kernel size to 3 and the kernel channel to 512. TextCNN can better condense the concatenated word and Pinyin features. We added the TextCNN module in our experiment and found that this module can indeed integrate the information of characters and Pinyin to achieve better results. In the ablation experiment, we found that the effect of the model with TextCNN is higher than that without TextCNN.

4 Bilstm Module

In the previous processing, we only considered the timing of Pinyin and text separately, and we did not consider the timing of them together. Through TextCNN, we have integrated the features of Pinyin and Word. In 1997, LSTM with storage cell and threshold was proposed to solve the problems of gradient disappearance, gradient explosion and long-distance information transmission of text sequences. Therefore, we use BiLSTM to process the results of TextCNN. The following is the calculation formula of LSTM.

Where W is the weight matrix and b are the bias vector, σ is the activation function, c_t is the update state at time t, and i_t, f_t, and o_t are the outputs of input gate, forget gate, and output gate, respectively. As a result, h_t is the output of the entire LSTM unit at time t. The LSTM is calculated as follows:

$${i}_{t}=\sigma \left({W}_{xi}{x}_{t}+{W}_{hi}{h}_{t-1}+{W}_{ci}{c}_{t-1}+{b}_{i}\right)$$

$${f}_{t}=\sigma \left({W}_{xf}{x}_{t}+{W}_{hf}{h}_{t-1}+{W}_{cf}{c}_{t-1}+{b}_{f}\right)$$

$${c}_{t}={f}_{t}{c}_{t-1}+tanh\left({W}_{xc}{x}_{t}+{W}_{hc}{h}_{t-1}+{b}_{c}\right){i}_{t}$$

$${o}_{t}=\sigma \left({W}_{xo}{x}_{t}+{W}_{ho}{h}_{t-1}+{W}_{co}{c}_{t}+{b}_{o}\right)$$

$${h}_{t}={o}_{t}tanh\left({c}_{t}\right)$$

5 Crf Module

Conditional Random Field (CRF) is a basic model in NLP, which is widely used in scenes such as word segmentation, entity recognition and part of speech tagging. LSTM can effectively process the information of long-distance text, but it cannot learn the dependency of labels. For example, under the "BIO" annotation mode, Class I labels can only be after Class B. However, this label format cannot be standardized only through LSTM. So, the role of CRF is obvious. CRF learns the relationship between tags through datasets. It greatly improves the effect of the model by modifying the output label through probability calculation. In our model, the output of BiLSTM is classified through the linear layer and used as the input of the CRF layer. Results are obtained after CRF correction.

1 Experiment setup

Our model has been evaluated on four popular Chinese NER public datasets, which cover four different environments, including Resume[¹⁵], CCKS2019[¹⁶], CLUENER2020[¹⁷], and MSRA[¹⁸]. Among them, the Resume dataset is from Sina, which belongs to resume category and has 8 entity categories. CCKS2019 is from the China Clinical NER Knowledge Map and Semantic Computing Conference, which belongs to the medical field and has 6 entity categories. MSRA belongs to the field of journalism and has three entity categories. CLUENER2020 belongs to the category of ordinary life, with a total of 10 entity categories. In this experiment, we used the "BIO" annotation method for data annotation of all datasets.

We used the splitting of datasets as shown in Table 1. Among them, Resume, CCKS2019 and CLUENER2020 use the same data division as the comparison paper. Since the initial values of our models are random, which leads to the inability to learn the character models of some training sets in MSRA, we adopt the form of randomly divided datasets.

Table 1

Size of Sentences on Dataset
Dataset name	train	dev	test
Resume	3.8k	0.46k	0.48k
CCKS2019	1k	-	0.38k
CLUENER2020	10.7k	1.3k	1.3k
MSRA	46.4k	-	4.4k

In terms of word segmentation, BERT tokenizer was used for word segmentation. In terms of Pinyin, we used the Pypinyin package to obtain the Pinyin of the word after segmentation. AdamW was used for optimization. The initial learning rate of BERT and our model was 3e-4. After ten epochs, the learning rate became 3e-5. BiLSTM model’s learning rate was set to 3e-3. All models were trained for 100 rounds. And all models’ dropout was set to 0.1. The experiments were conducted on a GPU with NVIDIA RTX 3080. The version of the torch used was 11.3. The training speed of the models was 20 seconds/round on Resume, 85 s/round on the CCKS2019 dataset, 32 s/round on the CLUENER2020 dataset, and 187 s/round on the MSRA dataset.

To evaluate the model, we used precision (P), recall (R), and F1 score (F1) as our evaluation criteria. Where P, R, and F1 are calculated as follows. The calculation parameters are the results (TP) corresponding to the positive samples correctly predicted by the model, the positive samples (TN) incorrectly predicted by the model and the results (FN) corresponding to the negative samples incorrectly predicted by the model:

$$P=\frac{TP}{TP+FP}$$

$$R=\frac{TP}{TP+FN}$$

$$F1=\frac{2*P*R}{P+R}$$

2 Effectiveness Study

2.1 Performance comparison with basic models

To verify our improvement, we set the TextCNN + CRF model, BiLSTM + CRF model, and BERT + LSTM + CRF model in the same environment as the baseline. The models’ performance on datasets of Resume, CCKS2019, CLUSER2020 and MSRA are shown in Table 2, Table 3, Table 4, and Table 5, respectively.

Table 2

Performance of each model on Resume
Models	Precision	Recall	F1-Score
BiLSTM + CRF	89.78	87.82	88.79
TextCNN + CRF	91.27	88.42	89.50
BERT + LSTM + CRF	95.36	94.29	94.82
Our model	96.92	97.06	96.97

Table 3

Performance of each model on CCKS2019
Models	Precision	Recall	F1-Score
BiLSTM + CRF	81.71	79.24	80.46
TextCNN + CRF	88.53	79.53	83.50
BERT + LSTM + CRF	85.28	84.36	84.82
Our model	89.14	85.48	87.17

Table 4

Performance of each model on CLUENER2020
Models	Precision	Recall	F1-Score
BiLSTM + CRF	67.53	65.21	66.36
TextCNN + CRF	75.07	69.78	72.12
BERT + LSTM + CRF	76.55	76.23	76.39
Our model	80.53	77.35	78.91

Table 5

Performance of each model on MSRA
Models	Precision	Recall	F1-Score
BiLSTM + CRF	86.39	84.76	85.58
TextCNN + CRF	92.79	91.46	91.15
BERT + LSTM + CRF	95.22	93.96	94.58
Our model	95.70	94.29	94.98

By comparing the TextCNN model, we can prove that we can get better sentence features by using two parallel extractors. By comparing with the baseline models, we can see that our model has a good effect on the Chinese NER task.

2.2 Performance Comparison With Existing Models

On the datasets of Resume and CCKS2019, we not only compared the standards of the three baseline models, but also compared the performance of recent advanced models that use the same datasets.

Table 6 shows the comparison between our model and the existing model on the Resume dataset. The first two models are optimized based on Bert. Nowadays, many models are optimized on Bert, which also leads to the model becoming larger and larger. The ACNN model is a combination of TextCNN and Attention. Our model also uses TextCNN, so we compare this model. PDMD model also uses Pinyin, but it processes Pinyin together with Word. The table shows that our model is very effective for Chinese NER.

Table 6

Performance of each model on Resume
Models	Precision	Recall	F1-Score
BERT + LSTM + CRF (2020) [¹⁹]	95.75	95.28	95.51
BERT + FLAT (2020) [²⁰]	-	-	95.86
ACNN (2021) [14]	94.69	95.21	94.95
PDMD (2022) [²¹]	96.62	96.48	96.54
FLAT + ATSSA (2022) [²²]	-	-	96.73
Our model	96.92	97.06	96.97

Table 7 shows the comparison between our model and the existing model on the CCKS2019 dataset. This dataset is a medical dataset, which contains many long texts. We selected the result of first place in the competition as a reference. And the models based on Bert and ELMo in recent years are selected for comparison. We also compared the ACNN model using TextCNN. Through this dataset, we can prove that our model has a good effect on long text and medical fields.

Table 7

Performance of each model on CCKS2019
Models	Precision	Recall	F1-Score
CCKS2019-No.1 (2020) [16]	-	-	85.62
ELMo-ET-CRF (2020) [²³]	83.65	87.61	85.59
BERT + BiLSTM + CRF (2020) [²⁴]	85.20	84.10	86.47
ACNN (2021) [14]	83.07	87.29	85.13
Multi-level semantic fusion network (2022) [²⁵]	85.11	86.74	85.92
Our model	89.14	85.48	87.17

To further confirm the role of each component in our model, we conducted ablation experiments.

3. Ablation Experiment

To test each module, we made the following model:

Use the original Pinyin embedding, and the rest is the same as the original model.

Use one encoder, and the rest are the same as the original model.

Do not use pinyin and character features that TextCNN extracts, and the rest is the same as the original model.

The training methods of the three models are the same as those of the original model, and the results are shown in Table 8.

Table 8

F1 Score of Different Incomplete Models and Complete Models on Each Dataset
Models	Resume	CCKS2019	CLUENER2020	MSRA
Original Pinyin embedding	96.09	86.87	78.49	94.32
One encoder	96.10	85.98	74.92	93.95
No TextCNN	96.04	85.32	74.84	93.82
Complete model	96.97	87.17	78.91	94.98

Through the experimental results, we can see that each module has played an important role in our proposed model. Among them, the TextCNN layer has the greatest impact.

After removing the TextCNN layer, we found that the accuracy of the model was not greatly affected, but the recall rate decreased. Therefore, we infer that TextCNN can better extract features in this model.

If Pinyin and words are processed in one encoder, you can see that the F1 score of the model will be lower. We believe that the characteristics of sequences obtained in this way will affect each other after switching to the network, which is not conducive to network optimization. The results show that our two encoders are effective.

The smallest change is our improvement of the Pinyin dictionary. Because the Chinese NER dataset itself is full of Chinese characters, while other characters with entity attribute only account for a small part. To verify the correctness of our results, we used entities with English characters to test them, and found that this method can indeed improve the recognition of entities with characters except Pinyin. For example, the model using the original pinyin processing method will recognize the "-" in the sentence "5-FU" as "O", but the improved model will recognize the correct entity label "Drug"

From the above data, we can see that our model has achieved good results. It is not only better than the local BiLSTM + CRF model and BERT + LSTM + CRF model, but also better than some other advanced models in some datasets. It can show that the model proposed by us is effective.

For the datasets of Resume and CCKS2019, our model is better than the excellent models in recent years. We know that Resume and CCKS2019 have many long sentences. It can be said that our model has a very good effect on long-sentence processing.

For the CLUENER2020 dataset, our model is only better than the local replication model. Although it is higher than the baseline model of BERT in the original paper, it is lower than the baseline model of RoBERTA. Moreover, for the original MSRA dataset, the F1 score of our model is only 93. We analyzed the MSRA dataset and found that some test set entities of the dataset did not appear on the training set. Because the initial values of our model are random, it cannot learn the above characteristics, so we randomly divided the data set. By randomly dividing the dataset, our model achieved 94.98 F1 scores on MSRA. It is 0.4 higher than the local BERT + LSTM + CRF model using the same data set.

In the ablation experiment, we also proved that each module we designed played the expected role.

Not only did we compare the effects of the model, but we also tested the speed of the model. Although our model is slower than the BiLSTM + CRF model, its effect is much better than this model. And our model is better than the BERT + LSTM + CRF model, and has a better effect than this model. The performance of our model shows its superiority.

In this paper, we propose a model with two parallel feature extractors. The two features of Pinyin and words are better extracted through the Pinyin encoder and the word encoder. And we have proved the generalization ability of our model by using different types of datasets and verified its effectiveness through experiments.

However, it is undeniable that our model has some shortcomings. When using the training set and test set divided by MSRA itself for experiments, our model did not show good results.

In future research, we prepare to train a Pinyin vector to test the model effect. We are going to use different encoders to extract Pinyin features and word features so that our text feature extraction can become better. And in the subsequent work, we are ready to add the extraction of other features, such as the radical information in Chinese.

Acknowledgment

We would like to thank the anonymous reviewers for their insightful comments and suggestions. This research is supported by the National Natural Science Foundation of China (grant No.U2003208), the Xinjiang Autonomous Region key research and development project(grant No.2021B01002) and the Science and Technology Department of Xinjiang Uyghur Autonomous Region Fund Project (grant No.2020A03004-4).

Conflict of interest

The authors declare that they have no conflicts of interest.

Bunescu RC, Mooney R (2005) A shortest path dependency kernel for relation extraction. In: HLT/EMNLP
Dogra V, Singh A, Verma S, Alharbi A, Alosaimi W (2021) Event study: advanced machine learning and statistical technique for analyzing sustainability in banking stocks. Mathematics 9(24):3319
Zhang D.Z, Xie Y.H, Man L.I, Shi C (2017) Construction of knowledge graph of traditional chinese medicine based on the ontology. Technology Intelligence Engineering
Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Chen, Y. (2015). Convolutional neural network for sentence classification (Master's thesis, University of Waterloo).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Wang, J., Zhou, J., Zhou, J., & Liu, G. (2018). Multiple character embeddings for chinese word segmentation. arXiv preprint arXiv:1808.04963.
Sun, Z., Li, X., Sun, X., Meng, Y., Ao, X., He, Q., ... & Li, J. (2021). Chinesebert: Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:2106.16038.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Guo, B., Zhang, C., Liu, J., & Ma, X. (2019). Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing, 363, 366-374.
Kong, J., Zhang, L., Jiang, M., & Liu, T. (2021). Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. Journal of Biomedical Informatics, 116, 103737.
Zhang, Y., & Yang, J. (2018). Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023.
Han, X., Wang, Z., Zhang, J., Wen, Q., Li, W., Tang, B., ... & Lin, Y. (2020). Overview of the CCKS 2019 knowledge graph evaluation track: entity, relation, event and QA. arXiv preprint arXiv:2003.03875.
Xu, L., Dong, Q., Liao, Y., Yu, C., Tian, Y., Liu, W., ... & Zhang, X. (2020). CLUENER2020: fine-grained named entity recognition dataset and benchmark for chinese. arXiv preprint arXiv:2001.04351.
Zhang, S., Qin, Y., Hou, W. J., & Wang, X. (2006, July). Word segmentation and named entity recognition for sighan bakeoff3. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (pp. 158-161).
Ma, R., Peng, M., Zhang, Q., & Huang, X. (2019). Simplify the usage of lexicon in Chinese NER. arXiv preprint arXiv:1908.05969.
Li, X., Yan, H., Qiu, X., & Huang, X. (2020). FLAT: Chinese NER using flat-lattice transformer. arXiv preprint arXiv:2004.11795.
Mai, C., Liu, J., Qiu, M., Luo, K., Peng, Z., Yuan, C., & Huang, Y. (2022). Pronounce differently, mean differently: A multi-tagging-scheme learning method for Chinese NER integrated with lexicon and phonetic features. Information Processing & Management, 59(5), 103041.
Hu, B., Huang, Z., Hu, M., Zhang, Z., & Dou, Y. (2022, October). Adaptive Threshold Selective Self-Attention for Chinese NER. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 1823-1833).
Wan, Q., Liu, J., Wei, L., & Ji, B. (2020). A self-attention based neural architecture for Chinese medical named entity recognition. Mathematical Biosciences and Engineering, 17(4), 3498-3511.
Zhang, M., Wang, J., & Zhang, X. (2020, July). Using a pre-trained language model for medical named entity extraction in chinese clinic text. In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC) (pp. 312-317). IEEE.
Shi, J., Sun, M., Sun, Z., Li, M., Gu, Y., & Zhang, W. (2022). Multi-level semantic fusion network for Chinese medical named entity recognition. Journal of Biomedical Informatics, 133, 104144.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Chinese Characters and Pinyin: A Model with Two Parallel Feature Extractors for Chinese Entity Recognition

Status:

Version 1

Abstract

Figures

Introduction

Method

1 Character Representation

2 Two Parallel Encoder

2.1 Positional Encoding

2.2 Multi-Head Attention

2.3 Add & Norm

2.4 Feedforward + Add & Norm

3 Feature Fusion Layer

4 Bilstm Module

5 Crf Module

Experiments And Analyses

1 Experiment setup

2 Effectiveness Study

2.1 Performance comparison with basic models

2.2 Performance Comparison With Existing Models

3. Ablation Experiment

Discussion

Conclusions And Future Work

Declarations

References

Additional Declarations

Status:

Version 1