Background: Ontologies and controlled vocabularies are fundamental resources for Information Extraction (IE) from clinical texts using Natural Language Processing (NLP). Standard language resources available in the healthcare
domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose. A known limitation is lexical ambiguity of clinical language, particularly regarding short forms. Much of them are unambiguous within
documents limited to a given clinical specialty. For this and other NLP tasks, the identi cation of the specialty using document classi cation would be of great value.
Methods: This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classi ed and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks bene t from collections of domain terms that are speci c to clinical subdomains. We use Pubmed queries that generate sub-domain speci c corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed.
Results: The generated term set, called SCOVACLIS (Spanish Core Vocabulary About Clinical Specialties), was made available to the scienti c community and used in a text classi cation problem obtaining improvements of 6 percentage
points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks.
Conclusion: The creation and validation of SCOVACLIS support the hypothesis that speci c term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.