Study Area
The initial modelled study area covered the Arabian/Persian Gulf basin (subsequently referred to as the Gulf), an extension of the Indian Ocean, with restricted water inflow, marked by environmental gradients and well delineated marine habitats15,24,25. The basin is also characterized by an inclined seafloor towards the shelf break, with an average depth of 35 m and maximum depth of 100 m at the Strait of Hormuz. Due to the shallow nature of the Gulf, most parts are within the photic zone, with a few exceptions along the Iranian coast3. The extrapolation area upon which the initial model is projected includes the entire world’s oceans and seas (global extent) and suitable environments were predicted based on the lateral extent and resolution of several remotely sensed environmental variables.
Species data
The occurrence data of the selected four species (i.e., P. planatus, P. pertusus, P. arietinus and C. hemprichii) were collected from three major sources: (1) distributional data from benthic foraminiferal studies compiled from various records of their occurrences in the Gulf24–28(Fig. 5), (2) 171 global occurrence data for these species from13, and (3) 247 distributional records from29 including published and unpublished records around the globe (Supplementary Table 2). The Gulf dataset consists of 355 sampled locations (Supplementary Table 3), which were divided into training and validation datasets in an 90/10 split. Specifically, the sparse data from the eastern part of the Gulf, along the Iranian coast30 was deliberately included in the test-split dataset to evaluate the robustness and predictive power of the model built. The data from sources 2 and 3 were merged and checked for duplicates; these data were then applied as a ground truth-dataset for testing the performance of spatial-model extrapolations on a global scale.
Predictor variables
Ecologically meaningful environmental variables (EMEVs) are hosted by BIO-ORACLE31,32 and MARSPEC33 and made available through the ‘sdmpredictors’ package within the R-software environment. The package has several functions that facilitate listing, extraction, and management of global EMEVs. It also facilitates the integration of EMEVs with other bioclimatic modelling pipelines34,35. For the initial model, the EMEVs were cropped at the regional level (i.e., the Gulf basin); for subsequent projections, data were evaluated at the global scale. Some of the relevant EMEVs are summarized in Fig. 6 (Supplementary Table 4), including their corresponding units and spatial resolution. The EMEVs are relevant at the macroecological level and are specifically designed for assessing marine biodiversity associated with sea surface, bathyal and epibenthic habitats for predicting impacts of climate change on marine systems (Supplementary Table 4). These variables include water temperature, salinity, nutrients, chlorophyll, current velocity, phytoplankton, primary productivity, iron, and light reaching the bottom, with a uniform spatial resolution of 5 arcmin globally (9.2 km at the equator). The present-day environmental conditions were based on monthly averages spanning the years 2000–2014, and we used the Bio-ORACLE version 2 datasets that improved the reliability of marine data layers based on in situ quality-controlled data36,34.
Developing the model
The species-level habitat suitability inferences produced in this study are based on a maximum-likelihood interpretation of a maximum entropy model by using an infinitely weighted logistic regression for model fitting37,38. These are all implemented in the ‘MIAmaxent’ R package that allows for user-controlled transformation of a selection of EMEVs by nested model comparison, flexible model evaluation and projection37,39. The SDM is a type of gradient analysis aimed at establishing the relationship between a modelled target and its environment. The entire SDM procedure is summarized in the flowchart (Fig. 7).
The modelling task is essentially composed of two parts. (1) Initial modelling of the actual known distributions of species in a well-defined basin and subsequent validation. This entails data input, pre-processing, and eventual predictions over a range of explanatory data. This step captures the relationships between response and predictor variables and helps to understand strengths and weaknesses in the model. (2) Spatial extrapolation over a global extent. This is the applied part of the model, where relative probability of occurrence is projected over a large geographical extent beyond the modelled area and compared to ground-truth data. Briefly, we extracted values of a set of pre-screened EMEVs at the location of the species using a buffer. This involves running several modelled scenarios with many EMEVs (224) to create a shortlist. This process addresses issues of multicollinearity, statistical errors, and the information criteria in each EMEV using a generalized linear model and eliminate selection bias based on prior knowledge of the ecological requirements of the selected species.
In each scenario, we built a training model and compare it to the actual distribution pattern while documenting the performance of each with the Area Under the Receiver Operating Characteristics (AUROC) value. The AUROC is a graph that shows the performance of a classification model at all classification thresholds. The AUROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. That is, the higher the AUROC, the better the performance of the model at distinguishing between environment and response variable (presence/absence) that are correlated and those that are not40. At the end, a set of EMEVs were shortlisted. Subsequently, the obtained data were rescaled and transformed before applying multivariate data-reduction techniques to relevant ecological EMEVs to the occurrence data. The presence-absence-response variable (PA-RV) was selected through series of spline-type transformations (forward hinge, reverse hinge, threshold) and Chi-squared inferential tests aimed at retaining the EMEVs that efficiently explain most variation in the PA-RV. Finally, the model is trained by establishing a relationship between the PA-RV and EMEV’s through the estimation of a target probability distribution of maximum entropy based on the most uniform distribution.
Model Performance Evaluation
There are several metrics for assessing the quality of a model such as the Frequency of Observed Presence (FOP) plots for the subset of EMEV’s. FOP plots for a given explanatory variable help to capture the relationship between the occurrence data and the ecologically relevant variables. The rate of occurrence of the response variable across intervals or levels of the explanatory-variable density is low when the response curve deviates strongly from the pattern in the FOP. A local regression ("loess") of the FOP values is added to the plot as a line. The data density is plotted in the background (grey) to help in visualization where FOP values are certain. Similarly, AUROC is applied to the initial 10% occurrence validation data split to obtain an unbiased measure of model performance because the validation dataset must be independent from the training one. AUROC discriminating abilities are > 0.9 describes a ‘‘very good’’, 0.8–0.9 ‘‘good’’ and 0.7–0.8 ‘‘useful’’ models respectively9. This is also the preferred threshold used by most researchers compared to reusing the training data (90%). It indicates whether the model can generalize better on patterns of the modelled target or is simply producing patterns that are specific to the training data.
Modelling Environment and Tools
Analyses were conducted using the R Statistical language41 (version 4.1.1; on Windows 10 x64 (build 21390)), using the packages ggplot242 (version 3.3.5), stringr43 (version 1.4.0), tidyr44 (version 1.1.4), readxl45 (version 1.3.1), readr46 (version 2.0.2), dplyr47 (version 1.0.7), ggthemes48 (version 4.2.4), rsample49 (version 0.1.0), MIAmaxent50,51 (version 1.2.0), tibble52 (version 3.1.5), report53 (version 0.4.0), sf54 (version 1.0.3), sp55 (version 1.4.5), raster56 (version 3.5.2), rgdal57 (version 1.5.27), sdmpredictors35 (version 0.2.10), magrittr58 (version 2.0.1) and tidyverse59 (version 1.3.1).