Total dataset
From the 13,078 specimens, we first removed duplicates (i.e., specimens collected at the same site and at the same time: 41.6%) and specimens that did not belong to the C. stoebe complex (1.5%). We further removed specimens for which it was not possible to identify the cytotype (10.7%) or to gather information on locality or collection date (6.8%). After these cleaning steps, we added the cytogeographical data, resulting in a total dataset of 5,821 Eurasian records (Fig. 6). We used this total dataset to explore the entire Eurasian range of the C. stoebe complex and to estimate which part of the ranges are native for both cytotypes.
Estimation of the native and expanded ranges of tetraploids
To delineate the native and expanded ranges of tetraploids, we developed a conceptual framework combining three approaches commonly employed to unveil native range expansions and cryptic invasions3,10: 1) revision of occurrence data, 2) phylogeographic analyses, i.e., analyses on contemporary spatial patterns of molecular variation, and 3) review of floristic publications. Among these three approaches, our delineation primarily relied on the revision of occurrence data.
Previous studies that used occurrence data to delineate native and expanded ranges have typically focused on spatio-temporal distribution patterns. Specifically, regions where occurrences had been recorded before or after a certain temporal threshold were regarded as being part of the native and expanded range, respectively3,10,45. However, this approach can be strongly biased by spatio-temporal patterns in collection efforts (reviewed by Lang et al.46). Instead of relying on a spatio-temporal approach, we focused on distribution patterns in natural habitats, independently of collection time. Given that the range expansion of tetraploids is largely driven by human activities5,19, it is crucial to recognize which natural habitats were already available before human activities completely transformed the European landscape2. For the light-demanding C. stoebe, this definition encompasses naturally canopy-free habitats, including zonal steppes, as well as extrazonal habitats such as rock outcrops and treeless habitats at high altitudes (i.e., “relictual habitats”). The occurrences of tetraploid populations in these relictual habitats were particularly interesting to us because C. stoebe is a predominantly barochorous species, which limits its uphill dispersal31. Consequently, the colonization of such relictual sites is unlikely to have occurred recently without the presence of nearby source populations over an extended period.
After defining the historical habitat requirements, we used our total dataset to identify geographical regions where tetraploids regularly occur in natural steppes or relictual habitats. We conducted these assessments across eleven geographical regions, which aligns with previous research investigating range expansions in a region-focused manner3,45. However, in contrast to previous efforts, our estimation accounted for the occurrence of a reference taxon, a recommended strategy for addressing spatial sampling bias in herbarium studies46. In particular, in regions where tetraploids were not found in natural steppes or relictual habitats, we examined whether diploids occurred at such natural sites. This was crucial in ensuring that the absence of tetraploid records at natural sites did not result from insufficient sampling activity. Because both closely related cytotypes share similar ecological niches14,19-21,31-34, it was anticipated that they would have had an equal opportunity to occupy relictual habitat types in regions where both cytotypes coexisted over an extended period. Thus, if only diploids were observed in the relictual habitats of a particular region, it is unlikely that this region was part of the native range of tetraploids. Diploids are considered to exhibit a historically widespread distribution across the entire study range5,19. As such, diploids should have had sufficient time to colonize numerous relictual habitats at high elevations and rock outcrops throughout Central and Western Europe, making them an appropriate reference for evaluating sampling efforts in relictual habitats across the eleven geographical regions. Further details on the methods of our estimation of the native and expanded ranges, including the detailed assessment across the eleven geographical regions are provided in the Supplementary Note 2.
As a second step, we investigated whether our estimated range delineation is concordant with the spatial patterns of molecular variation, observed from five published and three unpublished phylogeographic datasets. Analyzing spatial patterns of contemporary genetic diversity is a conventional method for identifying signatures of recent or historic cryptic invasions (reviewed by Morais & Reichard10). Together, the geographical distribution of genetic diversity within both cytotypes (especially with a view on rare alleles) and the distribution of closely related taxa that share ribotypes with allotetraploid C. stoebe supported our occurrence data-based assessment of the native and expanded ranges of tetraploids (see Supplementary Note 2 for details). In addition, the phylogeographic data facilitated decision-making processes in regions where tetraploids may be native but currently have sparse distributions (e.g., Ukrainian steppes, which have undergone extensive conversion to agricultural land in recent decades).
Finally, we conducted an extensive literature survey of publications proposing a non-native origin of distinct tetraploid populations. Combining occurrence data from herbarium specimen with information gleaned from local floristic publications is a suitable approach to uncover range expansions3,45. We found that our assessments were largely congruent with expert knowledge from local florists (see Supplementary Note 2 for details). Note that the same three approaches and criteria were applied to assess the native range of diploids showing no credible evidence of a recent range expansion in diploids (Supplementary Note 2).
We utilized ArcGIS 9.2 (ESRI, Redlands, U.S.A) to delineate the estimated border of the native range of tetraploids along the geographical barriers that we identified to separate the native and expanded ranges (e.g., rivers and mountains, see Supplementary Note 2). Occurrence data falling within this border were categorized as native, whereas those lying outside were classified as being part of the expanded range. Occurrence data within a 50-kilometer buffer around the border underwent individual confirmation of their range affiliation. We conducted this individual verification to avoid misassignments potentially arising from imprecise border delineation or inaccuracies in the georeferencing of respective study populations.
Comparison of our estimated native range with samplings from previous research
We conducted a comprehensive review of published studies comparing tetraploid C. stoebe populations from Europe and North America. We included only studies that explicitly investigated post-introduction evolution or ecological differences between the native and non-native ranges. Doing so, our literature survey was rather conservative because there are many more studies that have used tetraploid populations from their expanded range but without focusing on a native vs. non-native comparison. From the relevant publications, we took the provided GPS coordinates of the European tetraploid populations. We then assessed how many of the study populations fell either into or outside the border of our estimated native range. As above, occurrence data within a 50-kilometer buffer around the border underwent individual verification.
Focal dataset and study range
Our assessment identified two regions where tetraploids occur beyond their native range: 1) northwest from the native range encompassing Central, Western and Northern Europe and 2) east from the native range, specifically south and east of the Don River. For all analyses described below, we used our “focal dataset” focusing on our “study range”, i.e., the native range and the expansion toward Central, Western and Northern Europe (Fig. 6). Records east from the native range were not included because the estimated delineation of the native range may be less precise in European Russia due to lower sampling efforts. We had 4,417 records from Central, Western and Northern Europe but only 285 from the expanded range in European Russia despite both ranges having similar sizes. The lower sampling density in European Russia can be attributed to the relatively sparse distribution of herbaria (Supplementary Fig. 1). Note that including the data east from the native range did not change the general pattern in any of the presented results (data not shown).
Ruderal vs. natural habitat type assignment
Habitat information was extracted from the labels of the herbarium specimens or from field notes of the cytogeographical collections (Supplementary Tables 3 and 4). With the available information, we classified the habitat types according to the European classification system of habitats (EUNIS). For 28.8% of our records, we could not retrieve sufficient habitat information to classify them into the aforementioned categories. Following the approach outlined in Broennimann et al.14, habitat types classified as diluvial sediments (EUNIS category C), natural and semi-natural grasslands (E), and natural rock outcrops (H) were classified as “natural” habitats. In contrast, agricultural habitats (I) and transport networks, extractive sites, urban and industrial habitats (J) were classified as “ruderal” habitats.
Data analyses
To predict spatio-temporal patterns in the proportion of tetraploids, we fitted GAMs using the package mgcv 1.8-4147 in R 4.3.348. We used a binomial response in our GAMs to accommodate the inherent inconsistencies of herbarium collection efforts across space and time46. Specifically, spatio-temporal dynamics in species occurrences should not be predicted by the absolute numbers of specimens collected but rather in relation to a reference species, and this is particularly important in research on range expansions11,12,46. Diploid C. stoebe is a suitable reference taxon because it shows a stable distribution across the sympatric ranges of both cytotypes over the last two centuries (Supplementary Note 2) and both cytotypes show comparable ecological niches14,19-21,31-34. At the same time, using diploids as a reference directly addresses the conundrum that polyploid plants become more frequent than diploids in some, but not all, environmental contexts in the Anthropocene22,39.
We first predicted the proportion of tetraploids as a function of time in the native vs. expanded ranges using a logistic thin plate spline-based smoother function on the predictors “year by range” + “range”. To study habitat-specific temporal patterns in the proportion of tetraploids, we ran another two GAMs, one separate GAM for each range, using the predictors “year by habitat type” + “habitat type”. To account for spatial autocorrelation, all GAMs included a spline-on-the-sphere smoothing term based on latitude and longitude49. Concurvity among year, latitude and longitude was always below 0.15 indicating very low multi-collinearity49. To identify significant predictors, model performances were compared based on the Akaike information criterion (AIC), using ΔAIC ≤ -2 as a threshold for significance. Model structures of the GAMs can be found in Supplementary Table 5.
To assess the overall realized climatic niches, we used the 19 standard Bioclim variables bio1–bio19 from Chelsa 2.150 to perform a PCA using the vegan 2.6-4 package51. The resulting orthonormal system of principal components maximizes the present climatic variation, which was shown to appropriately quantify the realized climatic niches of diploid and tetraploid C. stoebe14. To estimate the percentage of overlap between the realized climatic niches of diploids and tetraploids across the study range, we used the 19 Bioclim variables to calculate dynamic range boxes, which quantify size and overlap of n-dimensional hyper volumes52. We similarly compared the overlap between the climatic niches of tetraploids in their native and expanded ranges. We then performed niche-over time plots according to Broennimann et al.14. To ensure conservative niche limits, we removed occurrence data out of the 5 and 95 percentiles. These outliers may reflect artifacts from the modeled climatic data or sites in terms of macroclimate but are significantly influenced by favorable microclimatic conditions14.
To study relationships between the dispersal routes and climatic dissimilarity, a MESS analysis was computed using the R-package dismo 1.3-1453. The native climatic niche of tetraploids was estimated from climatic data of all grid cells occupied by tetraploids in their native range. We then compared the dissimilarity of this native niche with each grid cell in the expanded range of tetraploids14,54. To visualize spatio-temporal patterns of niche dissimilarity of tetraploids in their expanded range, we plotted the spatial distribution of MESS values on a map with dispersal routes that were predicted by a minimum cost arborescence algorithm according to Hordijk & Broennimann55.
To evaluate the determinants of the initial spread of tetraploids, we first estimated the residence time of tetraploids in each pixel of the expanded range (i.e., years elapsed since the first record in distinct 10 km × 10 km pixels). This residence time was used as a response variable in a BRT to estimate the relative importance of several predictors for how early or how late distinct pixels got colonized. BRTs are frequently used to identify predictors of species distribution56. They are particularly suitable to analyze large datasets with numerous different predictor variables and are relatively insensitive to collinearity and missing values in the predictor variables57. We fitted our BRT using the R-package dismo, assuming a Gaussian error distribution, a bag fraction of training data of 0.5, a tree complexity of 1, a learning rate of 0.01, and a tolerance of 0.01. We then assessed the significance of predictor variables using model simplification based on model-internal cross-validations. Our predictor variables included four types of data: 1) spatial data: latitude, longitude and the spatial distance to the native range, 2) climatic data: a precipitation gradient (loadings of the first PCA axis), a temperature gradient (loadings of the second PCA axis) and the climatic distance to the native range (niche dissimilarity from the MESS analysis), 3) dispersal corridor data and 4) urbanization data. The latter two data categories encompassed spatio-temporally explicit data, extracted from 10 km × 10 km pixels. We used data from 1990 as it was the earliest year with high quality data available across the data sources. Moreover, in 1990, approximately half of the pixels were colonized. The dispersal corridor data included the density of railways (loge-transformed) and roads (loge-transformed), and a connectivity index (loge-transformed). Railway and road density data (km lengths within the 10 km × 10 km pixels) were extracted from the dataset in Garcia-López et al.18. The connectivity index was calculated as the reciprocal of the estimated travel time to the nearest city with at least 50,000 inhabitants from the Agglomeration Index database58. The urbanization data included human population density (loge-transformed) and proportion data of urban vs. rural landscape (logit-transformed), both downloaded from the Global Human Settlement Layer17, and the percentage of landscape covered by impervious structures (loge-transformed) from the annual maps of global artificial impervious area59.
To analyze the determinants of the current occurrence, we performed a second BRT, using binomial occurrence data (i.e., diploid vs. tetraploid) from 1989–2023 as a response variable. This BRT was fitted to the same predictor variables and using the same settings as the previous BRT, except that we assumed a Bernoulli error distribution. Moreover, for the spatio-temporally explicit predictors related to urbanization and dispersal corridors (extracted from 10 km × 10 km pixels), we retrieved data from the distinct collection year or used interpolated data for years where no data was available.