We first describe a typical ParseGBIF workflow and then show the results of the workflow application to the distribution records of the family Myrtaceae.
Typical ParseGBIF workflow
ParseGBIF can be run as a single workflow, or as the individual functions to address specific tasks (Table 1, Fig. 1). ParseGBIF can also be run in association with other packages such as Gridder (Grid Detection and Evaluation in R, https://biogeographylab.github.io/GridDER.github.io/ ) or BDC16. After converting the GBIF data to a desirable format and selecting the most informative records, the resulting datasets can be downloaded and used in further analyses, e.g. spatial.
Table 1
A typical workflow for a ParseGBIF project.
Project step | Relevant function |
GBIF data preparation | download_gbif_data_from_doi prepare_gbif_occurrence_data select_gbif_fields extract_gbif_issue |
Check species names against WCVP database | wcvp_get_data wcvp_check_name wcvp_check_name_batch standardize_scientificName |
Collectors Dictionary | collectors_prepare_dictionary collectors_get_name generate_collection_event_key |
Selecting the master digital voucher | select_digital_voucher |
Export dataset for further analysis | export_data parseGBIF_summary |
GBIF data download and preparation
To download GBIF occurrence data, the DOI URL address generated by the manual search previously performed on the GBIF portal is entered as a parameter in the download_gbif_data_from_doi function. The function will download and unzip the file in the indicated destination folder. To prepare occurrence data downloaded from GBIF to be used by ParseGBIF functions, it is necessary to run prepare_gbif_occurrence_data. This will select the desired fields (“standard” or “all”) and rename them. The “standard” format has 54 data fields, and the “all” format, 257 data fields. The selection is enabled by the select_gbif_fields function.
As part of the data preparation, it is possible to check whether downloaded records have been identified by GBIF as having “issues”, i.e. either altered by GBIF during processing or flagged as potentially invalid, unlikely, or suspicious. From the list of all issues identified by GBIF (https://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/OccurrenceIssue.html) we selected only those potentially influencing geospatial interpretation of records (33 categories, Supplementary Table S1). Each issue category was characterised as having a potentially no, low, medium or high impact on coordinate precision, and a selection score (0, -1, -3, -9) were assigned to each record. The lowest selection score (-9) indicates records unsuitable for spatial analysis. The function extract_gbif_issue produces a summary showing the total number of records in each issue category.
Species name check against WCVP database
The World Checklist of Vascular Plants (WCVP) database (Royal Botanic Gardens Kew: https://powo.science.kew.org/about-wcvp) can be downloaded to a folder of the user's choice or into memory using get_wcvp function. The output has 33 columns.
Species’ names can be checked against the WCVP database one by one or in a batch mode. To verify individual names, the function wcvp_check_name is used. To check names in a batch mode, there is wcvp_check_name_batch function. It uses the occurrence data (“occ”) and WCVP names list (“wcvp_names”) generated in the previous steps.
To bring species’ names in line with the WCVP format, the function standardize_scientificName inserts space between the hybrid separator (x) and specific epithet, and also standardizes abbreviations of infrataxa (variety, subspecies, form). The function get_lastNameRecordedBy returns the surname of the main collector in “recordedBy” field.
Creating Collectors Dictionary
A necessary step for parsing duplicate records is to have a robust key for each unique collecting event (aka ‘gathering’). The inclusion of the plant family name in the collection event key, proposed by Nicolson17 avoids the merging of distinct collection events in the case of low collection numbers common surnames. Including the collection event date in the key would be the most powerful way to avoid the merging of distinct collection events, however, collection date information is often not present in occurrence data records limiting the effectiveness of such an approach. We therefore generated a string that combined the plant family name + the first collector’s surname + the collection number (Moura et al. 2022). For the event key to be effective it is essential to consistently record the collector surname and for this purpose we provide a collector dictionary, i.e. a curated list of the main collector surnames extracted from the GBIF dataset. collectors_prepare_dictionary function extracts the surname of the main collector, based on the “recordedBy” field and generates a list relating the surname of the main collector and the downloaded data. This is followed by the standardisation of the main collector’s last name in the “nameRecordedBy_Standard” field, with respect to lowercase and non-ascii characters. This ensures that a botanical collector is always recognized by the same surname. Where the searched name is present in the collector’s dictionary, then the function retrieves the surname of the main collector as recorded in the “recordedBy” field, at which point the “CollectorDictionary” field will be marked as ‘checked’. If not recognised, then the function returns the surname of the main collector, extracted automatically from the “recordedBy” field and the record will remain unchecked. Unchecked records are then reviewed manually. The Collector Dictionary supplied with ParseGbif contains all of the collector surnames, checked, that were included in the GBIF download.
Digital voucher selection
Where duplicate records of unique collection events are present, the function select_digital_voucher selects the best record as the basis for the single unique collections event record. The best record is defined as that with the highest total score, for the sum of spatial data quality + record completeness.
Spatial data is scored based on the following GBIF issues (see section 2.1 above):
-
Not applicable, “selection_score” equals “zero”
-
Does not affect coordinate accuracy, “selection_score” equals − 1
-
Potentially affect coordinate accuracy, “selection_score” equals − 3
-
Records to be excluded from spatial analysis, “selection_score” equals − 9
Record completeness scored is measured using the sum of the following flags being equal to TRUE (all “selection_score” equals 1):
-
Is there information about the collector?
-
Is there information about the collection number? Is there information about the year of collection?
-
Is there information about the institution code?
-
Is there information about the catalogue number?
-
Is there information about the locality?
-
Is there information about the municipality of collection?
-
Is there information about the state/province of collection?
-
Is there information about the country (using a GBIF issue COUNTRY_INVALID)?
-
Is there information about the field notes?
The accepted TAXON_NAME selected is that which has most frequently been applied to the duplicate records at, or below, the rank of species. Where two names are applied with equal frequency, the first TAXON_NAME listed in alphabetical order is chosen to enable automation of the process. Where there is no identification at or below the rank of species, then the unique collection event is indicated as unidentified.
Where unique collection event duplicates cannot be parsed because the collection event key is incomplete, each record is treated as a unique collection event lacking duplicates. Where the collection event key is incomplete because the collector surname could not be recovered, then the record is tagged with the label, UNKNOWN-COLLECTOR.
Merged records
Once a master record is selected for each unique collection event key, duplicates can be removed from the dataset. For each complete unique collection event key, data fields that are empty in the digital voucher record will be populated with data from the respective duplicates. During content merging, we indicate fields associated with the description, location, and data of the unique collection event.
Outputs
Records can be exported at three levels of detail:
-
All data: all records processed and merged, plus all records removed, either because they are duplicates of a unique collection event, or unusable. To separate the records into three datasets by filtering “parseGBIF_dataset_result” field by ”useable”, “unusable” and “duplicates”.
-
Usable records: unique collection events complete with WCVP taxonomic identification and useable spatial data. Duplicate and unusable records are excluded.
-
Unusable records: unique collection events which lack key taxonomic and/or spatial data. This represents data that does not support taxonomic or spatial analyses but likely includes records that could be made useable by through further, manuals checks.
-
Duplicate records of unique collection events: duplicate records of the useable data could be useful for quality control or for verifying outlier records using a manual check.
Summary views of each export are available in the following configuration through the parseGBIF_summary function. The function summarizes the results of the select_digital_voucher (“occ_digital_voucher”) and export_data (“all_data”) functions. parseGBIF_summary function provides a general summary with totals for the entire dataset: total number of records / total number of unique collection events / total number of duplicate records of unique collection events / total number of usable records. The function export_data, with all records merged, provides a summary of the merged fields, summary of the merged fields for useable records only, including frequency of merge actions on fields in the usable dataset, and a summary of the unusable records, including the frequency of merge actions on fields in the unusable dataset.
Example application
All available Myrtaceae records were download from GBIF (https://doi.org/10.15468/dl.ykrqqv, May 2 2023). The parameters of the download were: 1. Basis of record: preserved specimen; 2. Occurrence status: present; 3. Scientific name: Myrtaceae. In total, the GBIF download contained information on 13,147 unique taxa; of which 9,669 taxa had geographic coordinates suitable for spatial analysis (Table 2, Supplementary Figure S1).
Table 2
GBIF and ParseGBIF datasets
Dataset | All data | Data suitable for spatial analysis |
| Records | Number of taxa | Records/Taxon | Records | Number of taxa | Records/Taxon |
GBIF | 1,301,480 | 13,147 | 98.9 | 840,937 | 9,669 | 86.9 |
ParseGBIF | 1,301,480 | 6,513 | 199.8 | 930,616 | 5,964 | 156.0 |
Records modified (merged) by ParseGBIF workflow
Using ParseGBIF, 1,301,480 records representing 6,513 unique taxa downloaded from GBIF were parsed into 930,616 unique collection event records representing 5,964 unique taxa (or 61.7% of the originally downloaded records). The parsing of duplicate records resulted in a doubling of the average number of records per taxa, from 98.9 to 199.8, where all records were parsed. Where only records suitable for spatial analysis were included, the parsing of duplicates resulted in a 1.8 fold increase, from 86.9 to 156. Of the unique collection events recovered, 252,723 (27%) had duplicates with non-identical spatial data.
In total, 37,156 records representing 3,874 taxa modified by the ParseGBIF workflow; of those, 92% (3,546 taxa) were suitable for spatial analysis (Fig. 1). The largest number of modifications to the occurrence related to the ‘Habitat’ field, followed by ‘Field Notes’ and ‘Municipality’ fields (Supplementary Table S2). Taxon richness, calculated for the parsed (merged) records was positively and significantly correlated with the overall taxon richness calculated for the ParseGBIF output (Pearson's product-moment correlation 0.6475, t = 40.5, df = 2271, p-value < 0.0001). This suggests that the areas of high taxonomic richness of Myrtaceae overlapped with the areas where the density of modified records was also high.
Myrtaceae taxon richness
According to both GBIF and ParseGBIF data, the taxon diversity of Myrtaceae was, in general below, 100 taxa per 1 x 1 degree grid square (ca 110 x 110 km at the equator, Supplementary Fig.S1, Fig. 2). The highest taxon richness was found in the coastal areas of Brazil, and in Australia (Supplementary Fig. S1 and Fig. 2). The largest difference between the downloaded GBIF and ParseGBIF taxon richness was found in Brazil (up to 50%, Fig. 3). However, the difference was generally rather small (below 20%) when expressed in taxon numbers (Supplementary Fig. S3). Raw GBIF and ParseGBIF taxon counts in 1 x 1 degree squares were positively and significantly correlated (Pearson's product-moment correlation 0.9699, t = 276.85, df = 4830, p-value < 2.2e-16).
Collector dictionary
The Myrtaceae dataset comprised 155,192 unique ‘recordedBy’ names, of which, 8.28% (12,855) were manually reviewed. The remaining names were suitable for automated processing. The unique collection event with the most duplicates MYRTACEAE_GARDNER_417, which had 32 duplicates in 13 herbarium (E, K, NHMUK, UEFS, S, F, US, CAS, NY, A, GH, W and MNHN (Campomanesia hirsuta Gardner)