2.1. Species and study areas considered
Our primary focus was on species of the Lomas ecosystems, situated at the base of the Andean Mountain range along the coast of Peru. To compile an initial list of Lomas species, we thoroughly reviewed literature, including journal articles, local floristic books, and handouts, reporting botanical and floristic site-specific data for Lomas localities, primarily in Peru. Taxa identified only to the genus level were excluded from our analysis. Moreover, we categorized the species based on their native and endemic status and elevation preferences. Elevation preferences were determined using the average height of occurrence records for the studied regions (see section 2.2), aligning with delineations such as South America, Andes, Peru, and Lomas. The categorization process involved defining elevation ranges based on quantile thresholds within the Andes. These thresholds, established at the 0.33 and 0.66 quantiles (584 m and 2382 m, respectively) of the altitudinal data, served as benchmarks to classify species according to their median height of occurrence. Tailored to accommodate non-normal distributions, this approach allowed us to investigate how plant species at specific elevation ranges might respond to changing habitat conditions.
2.2. Species occurrence data
Following the compilation of the species list, geo-referenced occurrence data were obtained from various databases, including the Global Biodiversity Information Facility (GBIF), iNaturalist, Integrated Digitized Biocollections (IDigBio) using the spocc package (Owens et al., 2023) and Botanical Information and Ecology Network (BIEN) with the BIEN package (Maitner, 2023). Additional data were sourced from the National Forestry and Wildlife Service of Peru (SERFOR). Given that some of these species also inhabit other areas in South America than the Lomas, we collected occurrence data from this broader region to correctly model the species niche. Several cleaning and filtering steps were employed to ensure data quality. We restricted coordinate uncertainty to 1 km and filtered observations by the year of occurrence, excluding records from before 1975. Records from botanical gardens and museums were eliminated using the CoordinateCleaner package (Zizka et al., 2019). Duplicate entries were removed, and centroids were detected, as poorly geo-referenced occurrences are often mistakenly attributed to the centroid of the country or its capital (Serra-Diaz et al., 2017). Utilizing the rgdal package (Bivand et al., 2023), records were assigned the EPSG:4326 coordinate reference system, and points outside the South American region were excluded. Finally, species with a sample size fewer than 30 occurrence points were excluded from the analysis (Wisz et al., 2008). This rigorous filtering process resulted in a final dataset comprising 149 species from the initial list of 587 species (Table S1). Our occurrence dataset was limited by the scarcity of reported occurrences, as more than 74% of the total species list could not be included in the analysis.
2.3. Environmental data
Forty-one predictor variables at a 30 arcsec resolution (~ 1km) were preselected to characterize the study environment. This set includes 19 bioclimatic variables obtained from CHELSA version 2.1 (Information S1; Karger et al., 2017), representing the average macroclimate over 1981–2010 ('current' conditions). The topographic data (elevation, slope, aspect) utilized in this study were retrieved from the digital elevation model products of the 250 m Global Multi-resolution Terrain Elevation Data 2010 dataset (GMTED2010) (Danielson & Gesch, 2011). Sine and cosine transformations were applied to maintain a continuous representation of circular variables. These transformations integrated landscape orientation, particularly southwest orientation in this coastal region, aligning with expectations of landscape features facing seaward (Hesse, 2012; Moat et al., 2021). Five soil variables, i.e. organic carbon, sand fraction, electrical conductivity (EC), coarse fragments, and pH, were extracted from the SoilGrids database (Poggio et al., 2021). Remote sensing-derived cloud observations at 1 km² resolution, representing fog, were obtained from EarthEnv (Wilson & Jetz, 2016). Cloud cover was selected considering August as the representative month for Lomas fog (Ruhm et al., 2022). The Topographic Wetness Index (TWI), as a proxy for topography-driven soil moisture, was calculated using the System for Automated Geoscientific Analyses (SAGA) GIS software, as per Kopecký et al. (2021). This calculation was based on Digital Elevation Models (DEM) obtained from the United States Geological Survey (USGS) through their EarthExplorer tool. We employed the Multiple-flow FD8 algorithm to achieve precision and incorporated a flow dispersion close to 1.0 while considering the local slope gradient. Moreover, to assess aridity in this region, we utilized the Global Aridity Index, according to Zomer et al. (2022). All predictors were projected to a 1 x 1 km² grid (EPSG:4326) using the raster package (Hijmans, 2023). Environmental values were extracted at 10,000 points within a 2 km buffer around occurrence points, chosen for their relevance to the specific Lomas environment.
To mitigate overfitting in our model, we conducted a pairwise Spearman correlation test on the predictor variables (using values extracted at 10,000 points within a 2 km buffer around occurrence points; Figure S1). Variables with high correlation (Spearman's coefficients > 0.7) were systematically removed from the analysis, selecting the ecologically most relevant variables to include in the model, ensuring it remained concise and effective (Dormann et al., 2013). This process resulted in a final set of 18 predictor variables, including bioclimatic factors (BIO1, BIO2, BIO3, BIO5, BIO9, BIO12, BIO15, BIO19), topographic attributes (elevation, slope, aspect), cloud cover metrics (mean cloud cover in August, interannual variability of cloud cover), and soil properties (organic carbon at a depth of 5–15 cm, sand fraction at a depth of 0-5cm, cation exchange capacity at a depth of 0–5 cm, fraction of coarse fragments at a depth of 0–5 cm).
To project changes in species distribution under future climatic scenarios, bioclimatic variables were extrapolated for 2071–2100 ('future' conditions) using data from CHELSA. We focused on two SSP scenarios, encompassing a fusion of Representative Concentration Pathways (RCPs) and Shared Socioeconomic Pathways (SSPs). The SSP1-2.6 scenario represents a strong mitigation scenario, exploring the impact of limiting global warming to 2°C by the century's end. In contrast, SSP3-7.0 reflects a no-additional-climate-policy scenario, projecting high greenhouse gas emissions doubling from current levels by 2100 (Arias et al., 2021). These scenarios were established by integrating global climate models within the Coupled Model Intercomparison Project Phase 6 (CMIP6) framework. They are derived from five CMIP6 General Circulation Models (GCMs): GFDL-ESM4/3, IPSL-CM6A-LR, MPI-ESM1-2-HR, MRI-ESM2-0, and UKESM1-0-LL, providing a comprehensive representation of potential future climatic conditions (Zhou et al., 2023). Unlike the macroclimate variables, no estimated future data is available for the edaphic, topographic and cloud cover variables. Therefore, we assumed that these variables remained constant in all models.
2.4. Species distribution model
This study employed the MaxEnt (Maximum Entropy) algorithm to model the impact of environmental drivers on the geographic distribution and habitat suitability of Lomas plant species (Phillips & Dudik, 2008). MaxEnt utilizes species presence-only location data and environmental predictor data as input. MaxEnt predicts the Relative Occurrence Rate (ROR) per grid cell, representing the relative probability of a species being present in each cell and indicating the suitability of each cell as a habitat. The algorithm begins with a uniform distribution, assuming an equal probability of presence in all cells. It then identifies the distribution with maximum entropy that aligns with observed occurrence data while respecting environmental constraints. This resulting distribution is translated into a map with ROR values, providing insights into habitat suitability across the landscape. Operating as a presence-only algorithm, it necessitates background locations where presence is unknown. These background points were generated using a 2D kernel-density estimate via the MASS package (Venables & Ripley, B., 2002), with 10,000 background points sampled around known species presence locations. This approach, suggested by Lake et al., 2020 and Vollering et al., 2019, aimed to align the spatial density of background points with species occurrences, mitigating the risk of sampling points too far from species locations and reducing spatial bias. The resulting 2D kernel-density layer ensured that background points accurately reflected the observed spatial distribution of species presence data.
To address bias and optimize model complexity, hyperparameters were fine-tuned using the package ENMeval2.0 (Kass et al., 2021). Tuning parameters involved selecting linear, quadratic, and product feature classes, which is crucial in determining constraints on environmental predictors during model training and influencing the potential shape of the marginal response curve. A linear feature was chosen to align the mean of the predictor at the predicted species occurrence approximately with the observed mean, while a quadratic feature matched the variance (Merow et al., 2013). The product feature constrained the covariance of two pairs of predictors, corresponding to an interaction term in regression. Regularization was performed with coefficients 0.5, 1, and 2 to mitigate overfitting risk, minimizing the Akaike Information Criterion (AIC) among competing models. Smaller coefficients (e.g., 0.5) provide less regularization, allowing for more complex relationships, whereas larger coefficients (e.g., 2) encourage stronger regularization, favoring simpler relationships. AIC aids in model selection by favoring models that fit the data well while maintaining simplicity. To evaluate model performance, we employed spatially blocked 10-fold cross-validation using the blockCV package (Valavi et al., 2019). This approach partitions data into training and test subsets for evaluation (Merow et al., 2013). The Continuous Boyce Index (CBI) served as our primary assessment metric, chosen for its accuracy when utilizing only presence data, outperforming the Area Under the Receiver-operating Characteristic Curve (AUC) (Jiménez & Soberón, 2020). The CBI reflects the correlation between predicted habitat suitability and the distribution of occurrence records. In addition to predicted-to-expected ratio curves, the CBI provides insights into robustness, habitat suitability resolution, and deviation from randomness (Hirzel et al., 2006). Its values range between − 1 and 1, where values > 0 indicate a better-than-random fit, with higher CBI values generally corresponding to more reliable predictions (Hirzel et al., 2006). This rigorous evaluation strategy ensures a comprehensive assessment of model accuracy and reliability.
The final model was utilized to project current and future habitat suitability trends across the western side of the Andes Mountains and Lomas regions under two different climate scenarios and for the five GCMs using the package dismo (Hijmans et al., 2023). The predictions included the Andes but were specifically limited to the Lomas regions, with the ultimate boundary being the Peruvian desert (latitudes between 5°8' and 21°25' S). To address considerable variability among alternative models (Araújo & New, 2007), the final model prediction was derived as the unweighted average of the five GCMs. To estimate occurrence probability, we applied a complementary log-log transformation, converting continuous ROR values into a probability-like scale (Phillips et al., 2017). This transformation facilitated thresholding, creating binary (presence-absence) maps (Saupe et al., 2015; Syfert et al., 2014). Cells in the raster with modeled probabilities exceeding the threshold were considered suitable, with the choice of threshold significantly impacting inferred extinction risks (Nenzén & Araújo, 2011). To ensure the robustness of our analysis, we initially explored two threshold selection methods: a fixed 10% probability threshold and a prevalence-based threshold set equal to the ratio of presence points to total observations, resulting in a 50% probability threshold. The choice of threshold for converting logistic maps to binary maps significantly influenced the assessment of species distribution and extinction risk. Higher threshold values (50%) led to more projected extinctions. In light of these outcomes, we opted to proceed with the 10% threshold for our scenarios comparison and for comparing species endemic status and elevations.
Within the Andean and Lomas ecosystems, we identified species with the highest absolute horizontal and elevational distribution shifts (Figure S2) and species that are expected to lose their habitat under future SSP scenarios (Table S2). Moreover, we used binary maps to examine the effects of habitat change on characteristic Lomas species, including Heliotropium arborescens, Oziroe biflora, Solanum peruvianum, Stenomesson flavum, Tillandsia latifolia, and Trixis cacalioides. These species are present in the Lomas of Lima and play a crucial role in the ecological structure of the region (Table S3). By highlighting the current threats that these species face, particularly in the context of urbanization pressures, this analysis helps us identify plants with potential applications in urban green strategies.
2.5. Statistical analysis
To quantify changes in habitat suitability of all 149 species, we utilized centroid detection along latitude, longitude, and elevation dimensions. Employing the dismo package (Hijmans et al., 2023), we determined the center of gravity for current and future modelled distributions, enabling subsequent distance and angle calculations. The raster package (Hijmans, 2023) facilitated the derivation of total change and percentage between current and future suitable areas. To explore the projected range shifts in both the Andes and Lomas regions, we separately analyzed vertical (elevational) and horizontal shifts under each climate scenario. Our aim was to discern significant differences between scenarios, highlighting their critical impact on species distribution. Due to the non-normal distribution of our data, we opted for the Wilcoxon signed-rank test, considering a specific paired experimental design in this case. Additionally, to evaluate the expected changes in species distribution between low-, middle-, and high-elevation species, we employed the Kruskal-Wallis test and Dunn's post hoc test with Benjamini-Hochberg adjustment to correct for false discovery rates. Similarly, with respect to endemic status, we investigated the range shift distances for both endemic and non-endemic species using the Mann-Whitney test. These analyses, considering both elevation and endemic status, were conducted separately for each scenario and each region (Andes and Lomas). Finally, we assessed habitat suitability changes under a 10% occurrence threshold, calculating total changes in suitable habitat area (and percentage change) for the Andes and Lomas under the two future scenarios using the Wilcoxon signed-rank test for paired designs. This analysis of suitable habitat changes also considered species elevation and endemic status in both regions, employing Kruskal-Wallis and Mann-Whitney tests, respectively. We used a significance level (α = 0.05) for all analyses in this study. All aspects of the analysis, from statistical tests to visualizations, were conducted in RStudio (version 4.3.0), and binary map arrangements were performed with QGIS (version 3.28.2).