Background
Recently, how to structuralize electronic medical records (EMRs) has attracted considerable attention from researchers. Extracting clinical concepts from EMRs is a critical part of EMR structuralization. The performance of clinical concept extraction will directly affect the performance of the downstream tasks related to EMR structuralization. We propose a new modeling method based on candidate window classification, which is different from mainstream sequence labeling models, to improves the performance of clinical concept extraction tasks under strict standards by considering the overall semantics of the token sequence instead of the semantics of each token. We call this model as slide window model.
Method
In this paper, we comprehensively study the performance of the slide window model in clinical concept extraction tasks. We model the clinical concept extraction task as the task of classifying each candidate window, which was extracted by the slide window. The proposed model mainly consists of four parts. First, the pre-trained language model is used to generate the context-sensitive token representation. Second, a convolutional neural network (CNN) is used to generate all representation vector of the candidate windows in the sentence. Third, every candidate window is classified by a Softmax classifier to obtain concept type probability distribution. Finally, the knapsack algorithm is used as a post-process to maximize the sum of disjoint clinical concepts scores and filter the clinical concepts.
Results
Experiments show that the slide window model achieves the best micro-average F1 score(81.22%) on the corpora of the 2012 i2b2 NLP challenges and achieves 89.25% F1 score on the 2010 i2b2 NLP challenges under the strict standard. Furthermore, the performance of our approach is always better than the BiLSTM-CRF model and softmax classifier with the same pre-trained language model.
Conclusions
The slide window model shows a new modeling method for solving clinical concept extraction tasks. It models clinical concept extraction as a problem for classifying candidate windows and extracts clinical concepts by considering the semantics of the entire candidate window. Experiments show that this method of considering the overall semantics of the candidate window can improve the performance of clinical concept extraction tasks under strict standards.