MapAffil is one of the highest performing bibliographic datasets for mapping author affiliation strings to their respective cities and geocodes, and we used their most recent dataset of geoparsed PubMed affiliations to generate an appropriate training and testing dataset for our NLP model.3 The current MapAffil 2018 dataset is based on a snapshot of PubMed taken in December of 2018, and maps all PubMed author affiliations (within the December 2018 snapshot) to cities and their geocodes worldwide with extracted disciplines, inferred GRIDs, and assigned ORCIDs. The complete dataset was first downloaded as a TSV file (single tab-delimited Latin-1 encoded file, with only the City column using non-ASCII characters), and then converted to a Parquet file for the purposes of our model. This resulted in a comprehensive Parquet file of the MapAffil 2018 dataset, containing approximately 52 million authorships. These 52 million authorships were then reduced to approximately 20 million unique affiliation texts. These unique affiliation texts were utilized as the training and test data for our NLP geoparsing model.
In order to clean our dataset, we took several measures that would filter out any “noisy” free text affiliations that were particularly ambiguous (Fig. 1). First, we removed all affiliations that had minimal information regarding academic institutions or geographical locations, rendering them impossible to geocode. In order to classify an affiliation as one that was impossible to resolve, we utilized spaCy's named entity recognition (NER) to identify all ORGs (companies, agencies, institutions) and GPEs (geopolitical entity, i.e. countries, cities, states) in each free text affiliation. We used spaCy’s English transformer pipeline (en_core_web_trf) with a batch size of 8000, disabling the following components: "tok2vec", "tagger", "parser", "attribute_ruler", and "lemmatizer". Utilizing these spaCy outputs, affiliations that had neither ORGs nor GPEs detected were removed from the full dataset. Affiliations where the only ORG detected was “Department of…” or “Division of…” and no GPEs detected were removed as well, as these affiliation texts had no significant identifying information for location inference. Next, affiliations with no country labeled in the MapAffil dataset were also removed. In addition, all affiliations denoted by the prefixes FROMPMC, FROMNIH, and FROMPAT were removed as well, as these were supplemented with data from PubMed Central, NIH grants, and Microsoft Academic Graph to compensate for missing affiliations in PubMed before 1988 or for non-first authors; these affiliation texts were found to be far lower in reliability in assigning affiliation data to specific authors.4 In order to address the issue of certain affiliations containing information for multiple authors, affiliations that exceeded a certain length (over 200 characters, accounting for about 7% of MapAffil affiliations), or contained semicolons were also excluded from the dataset, as most affiliations that exceeded 200 characters or contained semicolons were often observed to be ones that contained affiliation information for more than one author. Affiliations that had incomplete city information were excluded from the training dataset but were set aside to be later used as a validation set for our model, as MapAffil failed to successfully extract a city for a number of affiliation texts. Table 1 provides examples of each type of affiliation that was removed from the original dataset, illustrating why it was necessary for these to be filtered out. Ultimately, our dataset was left with approximately 16 million "clean" affiliations.
Table 1
Examples of noisy affiliations that were removed from training and testing data
No ORG, No GPE | • Due to the number of contributing authors, the affiliations are provided in the Supplemental Material. • Authors' affiliations are listed at the end of the article. • Departments of Pediatrics. • Affiliation: unknown; Email: unknown. |
No country labeled by MapAffil | • Department of Psychology. • Division of Cardiology. • Editor. • Associate Professor. |
PMC/NIH/PAT prefix | • FROMPMC: From the Laboratories of The Rockefeller Institute for Medical Research • FROMNIH: BOSTON UNIVERSITY MEDICAL CAMPUS • FROMPAT: Russian Academy of Sciences |
Over 200 characters | • Resource for Biocomputing, Visualization, and Informatics, University of California, San Francisco, CA 94143, USA and National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. (250 characters) • Department of Electrical Engineering, Stanford University, Stanford, CA 94035, USA, Department of Bioinformatics, Bina Technologies, Redwood City, CA 94065, USA, Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA, Mayo Clinics, Department of Health Sciences Research, Rochester, MN 55902, USA, Department of Statistics, Stanford University, Stanford, CA 94035, USA and Department of Health Research and Policy, Stanford University, Stanford, CA 94035, USA. (500 characters) |
Contains semicolons | • Investigation performed at The Carrell Clinic, Dallas, Texas, USA; Department of Orthopaedics, Washington University School of Medicine, St Louis, Missouri, USA; Department of Orthopaedics and Rehabilitation, Vanderbilt University Medical Center, Nashville, Tennessee, USA; and Reedsburg Area Medical Center, Reedsburg, Wisconsin, USA. • Novartis Institutes for Biomedical Research, Oncology Disease Area, Basel 4002, Switzerland; Cambridge, MA 02139, USA; and Emeryville, CA 94608, USA. |
Incomplete city labeling by MapAffil | • University of Minnesota, USA. (MapAffil “city”: MN, USA) • Department of Psychology, Florida State University. (MapAffil “city”: FL, USA) • Health Care Management Department, Wharton School, University of Pennsylvania. (MapAffil “city”: PA, USA) |
The dataset was then preprocessed and split into training and testing sets, enabling the NLP model to predict the city/state/country for each free-text affiliation. First, the affiliation texts and their corresponding cities were extracted from a subset of the top 1,000,000 most common authorships from the clean dataset and stored in separate lists. The affiliation texts were converted into numerical features using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique. TF-IDF is a numerical statistic that captures the importance of each word in the text relative to the entire dataset; this vectorization step assisted in down-weighting words that were common across all affiliations but up-weight words that were more specific to a particular affiliation.5 Stop words (common words like "the," "and," etc.) were also removed during this process to reduce noise and improve model performance. In addition to TF-IDF, we experimented with the Bag of Words (BoW) feature engineering technique, which represents a document text as an unordered collection of words while keeping track of word frequency.6 However, we observed that TF-IDF vectorization resulted in overall higher F1 scores across various text classifiers, and ultimately selected it as the final vectorizer for our geoinference NLP model. After each affiliation text was transformed into numerical vectors with TF-IDF, they served as the input variable for the model, and the list of corresponding cities (formatted as “city, state, country”) contained the target labels for which the model aimed to predict.
The text classifier of choice for the model was LinearSVC, a Support Vector Machine known for its effectiveness in high-dimensional spaces and multi-class data, and its implementation with the liblinear library provided greater flexibility, efficiency in handling large datasets, and competitive accuracy compared to other NLP text classifiers. The LinearSVC classifier is similar to SVC with the kernel parameter set to ‘linear’; however, it is implemented using the liblinear library instead of libsvm, offering greater flexibility in selecting penalties and loss functions. Additionally, this implementation is designed to handle large amounts of data more efficiently, making it well-suited for scaling to a substantial number of samples. The LinearSVC algorithm was selected after extensive experimentation with other leading text classification algorithms widely used in natural language processing tasks: these included Random Forest, an ensemble learning technique that functions by creating numerous decision trees, as well as Logistic Regression, a linear algorithm that models the relationship between the input features and outcome by estimating probabilities with a sigmoid function, and lastly Multinomial Naive Bayes, a probabilistic classification algorithm based on Bayes’ theorem that is particularly suited for text classification tasks involving discrete features. Table 2 presents the overall accuracy and F1 scores achieved by different NLP text classifiers. For each classifier, the table also includes results for two distinct types of feature engineering (TF-IDF and BoW). Figure 2 graphically represents the relative accuracies for various leading NLP text classifiers, including LinearSVC, Random Forest, Logistic Regression, and Multinomial Naive Bayes.
Table 2
Overall accuracy and F1 scores achieved by different NLP algorithms and feature types
Classifier | Vectorizer | Accuracy | F1 | Training Size | Test Size |
Linear SVC | TF-IDF | 0.91 | 0.89 | 10,000 | 1,000 |
Random Forest | TF-IDF | 0.87 | 0.84 | 10,000 | 1,000 |
Logistic Regression | TF-IDF | 0.77 | 0.70 | 10,000 | 1,000 |
Multinomial Naive Bayes | TF-IDF | 0.44 | 0.38 | 10,000 | 1,000 |
Linear SVC | BoW | 0.91 | 0.88 | 10,000 | 1,000 |
Random Forest | BoW | 0.86 | 0.84 | 10,000 | 1,000 |
Logistic Regression | BoW | 0.86 | 0.83 | 10,000 | 1,000 |
Multinomial Naive Bayes | BoW | 0.56 | 0.47 | 10,000 | 1,000 |