Collecting Specialty-related Medical Terms: Development and Evaluation of a Resource for Spanish

doi:10.21203/rs.3.rs-118585/v1

Download PDF

Research Article

Collecting Specialty-related Medical Terms: Development and Evaluation of a Resource for Spanish

https://doi.org/10.21203/rs.3.rs-118585/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Ontologies and controlled vocabularies are fundamental resources for Information Extraction (IE) from clinical texts using Natural Language Processing (NLP). Standard language resources available in the healthcare

domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose. A known limitation is lexical ambiguity of clinical language, particularly regarding short forms. Much of them are unambiguous within

documents limited to a given clinical specialty. For this and other NLP tasks, the identi cation of the specialty using document classi cation would be of great value.

Methods: This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classi ed and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks bene t from collections of domain terms that are speci c to clinical subdomains. We use Pubmed queries that generate sub-domain speci c corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed.

Results: The generated term set, called SCOVACLIS (Spanish Core Vocabulary About Clinical Specialties), was made available to the scienti c community and used in a text classi cation problem obtaining improvements of 6 percentage

points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks.

Conclusion: The creation and validation of SCOVACLIS support the hypothesis that speci c term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.

Bioinformatics

Natural Language Processing

Vocabulary

Medical sub-language

Clinical Specialty

Medical Sub-domain