Diet breadth datasets
To predict pollen specialist and generalist bee species we fit predictive models using bee species with known diet breadths for the United States. Our diet breadth data came from two sources. First, we used a list of pollen specialist bee species and their host plants in the eastern, central and western United States (compiled from Fowler 2020a, b, Fowler and Droege 2020); this list categorizes bee species as pollen specialists using multiple lines of evidence, of which pollen data are a main part. The host plants of some bee species on this list spanned multiple plant families, and we reclassified these bee species as a pollen generalists.
Second, we used a bee-pollen database we compiled ourselves using a literature survey. We searched the scientific literature between September 2019 and August 2022 for articles or books in English, German, French and Portuguese that reported descriptions of bees’ pollen hosts using primary pollen data or that were secondary sources synthesizing bee’s pollen hosts from primary data. Here, primary pollen data are defined as the plant genera or families found in bees’ scopal/corbiculate loads or nest provisions, or, rarely, the plants that the authors observed bees pollen-foraging from. We searched Google Scholar and Web of Science using the search terms “bee,” “mono/oligo/poly,” “lecty/lege/lectic,” “pollen host plant,” “pollen host,” “pollen specialization,” “pollen diet breadth,” and “host preference,” or the comparable search terms in German, French and Portuguese. We also searched for papers cited within the articles found using this search. For each bee species we classified a plant taxon as its pollen host if it made up at least 5% of the total scopal/corbiculate load or nest provision, following (Cane and Sipes 2006). We also considered a plant taxon to be a pollen host if the authors of the paper observed the bee collecting pollen from that plant taxon (although such studies were a minority). We classified bee species on this list as pollen specialists if their pollen hosts came from within one family and as pollen generalists if their host plants spanned more than one plant family. Our full list of specialist and generalist bees and the sources used to classify them is provided in the Appendix S2. Because this second dataset relied on a broader range of studies, occasionally a bee species classified as a specialist by the first dataset was classified as a generalist by the second dataset. In such cases we used the classification of the second dataset.
Visitation dataset
The visitation data we used to predict bee species’ diet breadths came from bee species visitation records found in Global Biotic Interactions (GloBI) (https://www.globalbioticinteractions.org/; Poelen et al. 2014). GloBI is an open dataset indexer that unifies species interactions across scientific literature, specimens from natural history collections and online observations. GloBI provided bee species visitation records across large spatial scales and with large sample sizes – the total dataset contains 259,210 bee visitation records between bees and plants (i.e., prior to filtering by geography or other variables, see below). We used version 0.5 of the GloBI indexed dataset (GloBI Community 2022).
We first filtered the GloBI database to only include interactions between bees and plants, searching for interactions between individuals within the seven families of bees and individuals within the kingdom Plantae. We excluded records where plants were not identified to at least the genus level and bees to the species level, as well as all duplicate records from the data. Hereafter, we refer to this dataset as the ‘visitation dataset.’ Note, this may exclude some species that have unresolved taxonomy and/or are rare – and thus harder to classify.
We further filtered the data to only include bee species that occur in the contiguous United States. Because not all records in the visitation dataset have geographic coordinates, we determined which bee species occur in the contiguous United States using a list from Chesshire et al. (2023), which was compiled using specimen records from GBIF and SCAN (GBIF.org 2021a, b, c). We then excluded cleptoparasites (Michener 2000; Appendix S2) and non-native bee species, using a list of non-native bee species for the United States (Russo 2016; Appendix S1: Table S1). We also removed records of eight bee species for which we did not have any phenological data (see ‘Estimating geographical and phenological predictors’). Finally, we excluded species we did not have diet breadth data for (Appendix S2).
We updated bee taxon names from the visitation dataset to the current valid name following the same methodology outlined in Chesshire et al. (2023; see Appendix S1 Supporting Methods for more details). We updated plant taxon names from the visitation dataset using The Plant List (www.theplantlist.org). Although this list is outdated (last updated in 2013), we opted to use it because this was the taxonomic standardization method employed by the plant phylogeny we used (Jin and Qian 2019; see next section). We accessed The Plant List using the R package Taxonstand (Cayuela et al. 2012) and updated plant family names separately, using the World Flora Online (http://www.worldfloraonline.org/). Plant taxa not on The Plant List were also updated using the World Flora Online.
After these filtering steps and taxon name updates, our visitation dataset contained 150,880 records of 682 bee species visiting 1,185 plant genera: 50,858 were records of pollen specialist bees, from 477 bee species, and 100,022 were records of generalist bees from 205 bee species. The records came from 40 sources total (Appendix S1: Table S2), with 22.6% of the records coming from observations, such as iNaturalist, 65.3% of the records coming from museum specimens (typically the flowering plant species the specimen was collected from), and 12.1% of the records coming from the scientific literature or unknown, compiled, sources (note, it is possible that some records in the “unknown” category may contain pollen data).
Bee and plant phylogenies
Since there are not species-level phylogenies of bees for all genera, we used a genus-level bee phylogeny from Hedtke et al. (2013). This phylogeny was missing 5 bee genera from our visitation dataset (of 63 total). To add them, we reconstructed the phylogeny using the R package U.PhyloMaker (Jin and Qian 2022), using the Hedtke phylogeny as the megatree. To add new taxa to this megatree, we used “scenario2,” from U.PhyloMaker; this adds genera randomly within their families (Jin and Qian 2022).
To make a phylogeny of plants, we used the megatree from Smith and Brown (2018), which we accessed using the R package V.PhyloMaker (Jin and Qian 2019). We pruned the tree to be a genus-level phylogeny. This phylogeny was missing 103 of the plant genera in our visitation dataset, of 1185 plant genera total, and we added them randomly within their family using scenario 2 from V.PhyloMaker (Jin and Qian 2019).
Occurrence dataset
To obtain geographic and phenological predictors for all bee species in our interaction dataset, we used specimen records from North America (Chesshire et al. 2023). This dataset included specimens in Canada, Mexico and Alaska. To ensure independence of the specimen records, we filtered the dataset to have one record of each species per combination of latitude, longitude and collection date. Latitude and longitude coordinates were rounded to three decimal places prior to this filtering step. We removed geographic outliers, defined as specimens collected at least 1500-km from any other specimen of the same species. Hereafter, we refer to this as the Chesshire et al. 2023 occurrence dataset.
Estimating visitation predictors
We used the visitation dataset to identify which plant taxa a bee species visits and how many.
To quantitatively measure which plant taxa a bee species visits, we used a multivariate approach: we first built a matrix with bee species as rows and plant genera as columns, with matrix cells filled by the number of interactions observed. We used the matrix and the Morisita-Horn index to calculate the difference between each pair of bee species in the plant genera visited. We took the first two eigenvectors of the resulting distance matrix to use in our models to get an eigenvalue for each bee species. Similar eigenvalues indicate bee species visit similar plant taxa. We also estimated these eigenvalues for interactions between bee species and plants at the plant-family level.
To estimate how many plant taxa each bee species visits, we calculated the diversity of plant genera and families visited using the inverse Simpson index. The diversity of plant families a bee visited was strongly correlated with phylogenetic diversity of plant genera it visited (r > 0.7) and the Simpson diversity of plant genera visited (r > 0.7). We thus excluded this variable from our final model.
Finally, we estimated the phylogenetic diversity of plant genera a bee species visits. In contrast to taxonomic diversity, phylogenetic diversity will be higher for bee species that visit distantly related plant genera than for bee species that visit closely related plant genera. We estimated phylogenetic diversity using a phylogenetic generalization of the inverse Simpson index (Chao et al. 2010, 2014), and hereafter refer to this metric as “phylogenetic Simpson diversity.” We also estimated phylogenetic richness. However, this variable was strongly correlated with a number of other variables in the model (r > 0.7). We estimated phylogenetic Simpson diversity using the function ‘hill_phylo’ from the R package hillR (Li 2018). More details about how we calculated this metric are provided in the Supporting Methods (Appendix S1).
Estimating bee phylogenetic predictors
We included bee phylogenetic information in our model, following the approach used in Lucas (2020). For each bee species, we calculated the phylogenetic distance to each bee genus in the dataset, and used these distances as predictor variables, resulting in 63 phylogenetic predictor variables, one for each bee genus. We used the function ‘cophenetic’ from the package ape to calculate pairwise phylogenetic distance between bee genera (Paradis and Schliep 2019).
Estimating geographic and phenological predictors
To estimate each bee species’ approximate geographic location, we used the median latitude and longitude of the specimen records in the Chesshire et al. 2023 occurrence dataset. To estimate each bee species’ extent of occurrence, we created a minimum convex polygon from all records and calculated the area in hectares. For species with fewer than four unique latitude-longitude combinations, there were too few points to create a minimum convex polygon. For these, we randomly added points within 100-km of existing specimen records to reach the four points needed to create the minimum convex polygon. We also calculated the sample size of each bee species in the dataset, a measure of the bee species’ regional abundance. Geographic analyses were conducted using the R packages sf (Pebesma 2018) and sp (Pebesma and Bivand 2005; Bivand et al. 2013). The minimum convex polygons were created using the function ‘chull’ from the R package grDevices.
To calculate phenological predictors, we excluded specimen records without collection dates (12.4% of records). We estimated the approximate time of year the bee was active by calculating the median date of collection. To estimate the length of the bee’s flight period, we subtracted the beginning of the bee species’ activity period from the end of its activity period by subtracting the 10th percentile of the bee’s collection dates from the 90th percentile, following Harrison et al. (2019).
Analyses
All analyses were conducted in R version 4.2.1 (R Core Team 2022). The code and data for running the analyses are available on Zenodo: https://doi.org/10.5281/zenodo.8347146.
1) How often do pollen specialist bees visit their host plants?
To assess how often specialist bees visit their host plants, we used the visitation dataset. We calculated the proportion of times a specialist bee species was visiting its host plant out of all visits recorded. For this analysis, we excluded bee species with fewer than 20 records, to avoid assessing the visitation records of incompletely sampled species. This left us with a sample size of 300 specialist bee species and 49,710 records.
We also conducted two post-hoc analyses to investigate why some pollen specialist bee species mostly visited non-host plants. First, we investigated if this was a statistical artifact driven by bees with small sample sizes, which are less likely to be representative of their true population. To do this, we visually examined the relationship between a bee species’ sample size and the proportion of visits to its host plant. We also calculated the Pearson correlation coefficient of this relationship.
Second, we examined whether these pollen specialist bees are explained by male bees, which do not collect pollen for their offspring and, as a result, might nectar at non-host plants more frequently. Because the sex of the bee was not specified in most records in our visitation dataset, we narrowed down our analysis to bee species with at least 10 records each for male and female bees, resulting in 260 specialist bee species from 34,822 records. We then compared the percentage of visits made by male and female bees to their pollen hosts using a paired Wilcoxon signed ranks test.
2) Can we predict pollen specialist and generalist bees?
To predict whether a bee species is a specialist or generalist, we used a random forest model for binary classification, using the R package randomForest (Liaw and Wiener 2002). Random forests are a type of supervised machine learning, which make no distributional assumptions and can detect complex, non-linear relationships. In our random forest, we used the default parameters from the R package: decision trees were created using bootstrapped samples the same size as the data, and ten random predictor variables were considered at each tree split. The decision trees were optimized by finding the tree with the smallest node impurity. The full set of predictor variables used in our random forest model are described in Table 1.
Table 1. Predictor variables considered in the random forest models to predict bee diet breadth. We removed some predictors due to collinearity (Pearson correlation coefficient > 0.7). Variables we excluded for this reason are indicated with a “no” in the “Included?” column.
Predictor variable
|
Description
|
Included?
|
Dataset used
|
Phylogenetic richness
|
Faith's phylogenetic diversity of plant genera visited
|
no
|
Global Biotic Interactions
|
Phylogenetic Simpson diversity
|
Phylogenetic Simpson diversity of plant genera visited
|
yes
|
Global Biotic Interactions
|
Simpson diversity (plant genus)
|
Simpson diversity of plant genera visited
|
yes
|
Global Biotic Interactions
|
Simpson diversity (plant family)
|
Simpson diversity of plant families visited
|
no
|
Global Biotic Interactions
|
Identity of plant genera visited
|
First and second eigenvalues of Morisita-Horn distance-matrix for plant genera visited
|
yes
|
Global Biotic Interactions
|
Identities of plant families visited
|
First and second eigenvalues of Morisita-Horn distance-matrix for plant families visited
|
yes
|
Global Biotic Interactions
|
Median latitude
|
Median latitude of bee specimen records in North America
|
yes
|
Chesshire et al. 2023
|
Median longitude
|
Median longitude of bee specimen records in North America
|
yes
|
Chesshire et al. 2023
|
Regional abundance
|
Number of specimen records in North America
|
yes
|
Chesshire et al. 2023
|
Extent of occurrence
|
Area in hectares of minimum convex polygon for specimen records in North America
|
yes
|
Chesshire et al. 2023
|
Median day-of-year
|
Median day-of-year of collection
|
yes
|
Chesshire et al. 2023
|
Duration of flight season
|
90% quantile of day-of-year of collection - 10% quantile of day-of-year of collection
|
yes
|
Chesshire et al. 2023
|
Pairwise phylogenetic distance
|
Phylogenetic distance to each bee genus in the dataset
|
yes
|
Hedtke et al 2013
|
To assess model performance, we used k-fold cross validation, in which separate datasets are used to train and test the model. In this process, the data are divided into k folds: k-1 folds are used to train the model and the remaining fold is used to test the model. This is repeated until all k folds have been used to test the model.
To evaluate how effectively our model predicts specialist bees in novel regions or phylogenetic groups, we used a special type of k-fold cross validation. While dividing our data into training and testing sets, we used spatial and phylogenetic blocking (Bahn and McGill 2013; Roberts et al. 2017). This approach leads to the creation of datasets that are either spatially or phylogenetically independent. It provides more accurate assessment of predictive power than the conventional random selection of folds (Bahn and McGill 2013; Roberts et al. 2017). By using this technique, we can assess how well our model performs when dealing with bee species located in different regions or originating from distinct families compared to those used to train the model. As a baseline, we also used random-stratified blocking to see how blocking methods affected our results.
For phylogenetic blocking, we blocked bees by family. However, the smallest family in our dataset, Melittidae, had only three generalist bee species. We therefore combined this family with Colletidae, the second smallest bee family in our dataset. We grouped by sample size rather than by phylogenetic distance because Melittidae is the likely basal family of bees (Danforth et al. 2013).
For the spatial blocking methods, we removed all spatial predictors from the models. For the phylogenetic blocking methods, we removed all phylogenetic predictors from the models. We did this to avoid extrapolating outside the predictor space used to train the model. The three blocking methods (random, spatial, phylogenetic) are described in more detail in the Supporting Methods (Appendix S1).
We used the same metrics to assess model performance for all blocking methods. As measures of overall model performance, we used the area under the receiver operator curve (AUC) and balanced accuracy (the arithmetic mean of specialist and generalist prediction accuracies); both are insensitive to class imbalance, which we had in our dataset (70% specialist species and 30% generalist species). We also calculated the prediction accuracies of specialists and generalists.
We found that model performance was similar between random-stratified blocking and the other two blocking methods (see Appendix S1: Figure S1). For simplicity, we report here model performance metrics for spatial and phylogenetic blocking methods only and provide the metrics for random blocking in Appendix S1 (Figure S1).
We also conducted a comparison between the random forest models and a simpler phylogenetic model. In this simpler model, we predicted that the diet breadth of a bee species was the same as the diet breadth of the majority of bee species within its genus, based on the training data. For bee species with no congeners in the training data, we predicted its diet breadth to be the same as the majority of bee species within its family. We evaluated the performance of this simpler model using spatial cross validation, employing the same methods as for the random forest model.
3) What variables are most important for predicting specialist and generalist bees?
To determine what variables are important for predicting specialist and generalist bees we used the “importance” function in the package randomForest to calculate each variable’s importance. The function calculates the change in the error rate of the model when a predictor variable is permuted, divided by the standard deviation of the difference. To rank the importance values, we took the mean of the importance values for each predictor. The means were calculated by aggregating across all model runs from all blocking methods. We also assessed the importance of phylogenetic predictors in aggregate by removing them from spatially blocked models and calculating the change in the models’ mean accuracy and AUC.