The study area extends from Safaga to Ras Gharib, with an area of approximately 10,537 km2. It is situated between latitudes 26°40'00″ '' and 28°20'00'' N and longitudes 32°50'00'' and 34°0'00''″ E (Fig. 1). The watershed is characterized by various physiographic features, including mountains, hills, main wadis, and streams. The elevation ranges from 0 m (Red Sea coast in the east) to 2,068 m (mountainous areas in the west) above the mean sea level. The slope angle varied between 0° and 72° (with an average of 8.2° and a standard deviation of 10.4°). Approximately 16.1% of the total area has a slope greater than 30°, 16.4% of the area has a slope between 15° and 30°, 61.7% of the area has a slope between 5° and 15°, and 5.8% of the area has a slope of less than 5°. Precipitation is typically infrequent and occurs in the form of heavy thunderstorms from November to April. Unfortunately, there are no records of precipitation, as there are few precipitation stations in the area.
The study area is characterized by various rock units including bedrock complex in the west (40.8% of the study area), sedimentary rocks, and alluvial soils (wadi deposits), which occupy approximately 40.8%, 15.1%, and 44.1% of the study area, respectively. The area under study has been largely developed. Future planning and development in the study area will be affected by flood hazards. In 2014 and 2016, the study area experienced numerous flood events caused by heavy and short-duration rainfall that caused devastating destruction (Fig. 2). The area is characterized by numerous main streams that cut through the area, making it a flood-prone area (e.g., Wadi Abu Naakhra, Bali, Aish, Milaha, Abu Had, and Gharib). The eastern part of the study area, the low-lying areas, are particularly vulnerable to flooding from the western part. The area, which is constantly subject to flood damage, undergoes cascading changes over time. This presents a constraint for the spatial flood assessment. If the site information is erroneous, it can cause significant problems in the spatial analysis. However, drainage structures and water supply systems may have an impact on flood vulnerability assessments. The change in land use in the eastern portion of the study area from desert to residential and infrastructure, and the lack of action plans or inadequate engineering solutions to prevent flood events.
Data and Methodology
This study demonstrates the use of different datasets that can be used in flood susceptibility mapping. Several critical steps of the methodology were followed in this study to ensure the reliability of the yield models. These critical steps are shown in Fig. 3 and are explained in the following sections.
Data used
Table 1 describes the various datasets that were collected and extracted for this study. Field surveys were conducted to collect various features and evidence related to the consequences of flood events that affected the study area. Questionnaires with residents of the area (local people and Badwins) and historical documents (from the Civil Defense Agency and the Department for Transport) were collected and used to understand previous flood events. Photographs were taken and maintained documenting various flood events that affected different parts of the study area. Remote sensing data were acquired for the study area, including Landsat 8, OLI sensor (Operational Land Imager) (acquired in 2019, 30-m spatial resolution) from the Earth Explorer website (https://earthexplorer.usgs.gov). The image mosaic (30-m resolution) was created by overlaying the bands (1–7) and then fused with the panchromatic band (15-m resolution) to generate the final image mosaic (15-m resolution). Additional high-resolution images were obtained using an astro digital 2.5-m resolution and Professional Google Earth. Remote sensing imagery was used to create land use/land cover, flood inventory, lithology, and hydrolithology unit layers. In addition, a 30-m resolution DEM was obtained from ALOS World 3D-30m. DEM was used to generate various datasets (for example, elevation, slope aspect, lithology, LULC, LS, TWI, slope angle, plan and profile curvatures, SPI, and hydrolithology units. Finally, a 1:100,000-scale geologic map was prepared and digitized to delineate different lithological units and hydrolithological units. The data of this study were stored in GIS in a digital database with a uniform projection (UTM zone 36 and WGS84 datum).
Table 1
Data utilized and applied in the current work.
Dataset
No.
|
Dataset
Source
|
Dataset year &
Characteristics
|
Data
Style
|
Resolution & scale
|
Generated
Layers
|
1
|
Remote sensing data
Earth explorer website (https://earthexplorer.usgs.gov)
|
Landsat-8 (OLI-11 bands) 2014, 2016, 2019
Astro digital (2014 & 2016)
Google Earth (various years)
|
Grid
Grid
Grid
|
30, 15m
2.5m
< 1m
|
- LULC layer,
- Flooded areas after 2014 & 2016 events.
- Inundating areas after flood events in 2016
- Verify the flood locations after the events 2014 & 2016.
- Verify and update hydrolithology units.
|
2
|
Geologic map
Topographic map
|
Quadrangle 1985
Sheets 1975
|
Polygon
Lines
|
1: 100,000
|
- Lithology units
- Soil-drain
- Verifying wadis and streams.
|
3
|
Digital Elevation Model
(ALOS World 3D-30m)
|
DEM
|
Grid
|
30m
|
- Altitude, Slope-aspect, Slope-angle, TWI, LS, SPI, Plan- and profile-curvature
|
4
|
Field investigation; Field questionnaires, Historical data, & Photographs
|
Information on the flooded and destructed areas by 2014 & 2016 flood events
|
Points/ Polygon
|
Field trips
|
- Inundated and damaged areas in 2014 & 2016 events.
- Verify lithology and hydrolithology units maps
|
Flood inventory map
Based on historical data and previous flood events, flooded areas were extracted to construct an inventory map. The inventory map is an extremely crucial element in flood susceptibility modeling (Sarkar and Mondal 2020). Several authors have pointed out that areas that have been exposed to past flood events under the same conditions are most likely to be vulnerable to current flood events (Fotovatikhah et al. 2018). To prepare susceptibility maps, it is necessary to determine the relationship between the inventory map (existing problems) and various factors that are relevant to susceptibility (Petley 2008). Different types of data (e.g., historical records, field visits, and satellite imagery interpretation) were used to generate an inundated inventory layer (Fig. 2b). Previously flooded areas (in the form of points ) were extracted by comparing the study area before and after the flood events (2014 and 2016) using visual inspection of 1) high-resolution imagery (Google Earth and astrodigital imagery) and 2) medium-resolution imagery (Landsat-8 OLI). Flooded site data were examined and identified during field investigations following the 2014 and 2016 flood events (Fig. 2). Additional inundation data in the form of coordinated locations were collected from the Civil Defense Agency and past news over the past three decades. To isolate the exact flooded areas using medium-and high-resolution remote sensing images, Landsat-8 (2014) imagery with a spatial resolution of 15 m and Astro digital (2016) imagery with a spatial resolution of 2.5 m were used in two time periods. Cloud-free images were acquired before and after the flood events in 2014 and 2016. Visual inspection of the true color images (bands 1, 2, and 3 in RGB) using ArcGIS 10.8 software was used to extract the flooded areas (Fig. 4). The inundated areas identified using satellite imagery were verified using field investigations and civil defense data. Finally, a point feature layer ( 420 flooded locations) of the inundated locations was created to produce the flood inventory layer (Fig. 1b). The data points were randomly partitioned using R statistical software to divide the data into training and validation datasets (Naimi and Araújo 2016). In the current study, a general trend from previous literature was applied in which the inventory dataset was divided into 70% (including 295 flood locations) for training and 30% (125 flood locations) validation datasets (Wang et al. 2019) (Fig. 1b).
Flood-related factors (FRFs)
The determination of key flood-related factors (FRFs) is essential for flood susceptibility modeling (Sanyal and Lu 2004), and they vary according to catchment characteristics (Waqas et al. 2021). Rainfall is considered the most influential factor in the occurrence of floods. Lawal et al. (2012) pointed out that there are several other flood-related factors that contribute significantly to flood hazards. Runoff along the catchment depends on the characteristics of the catchment (e.g., catchment area, topography, and LULC types) (Hölting and Coldewey 2019). In the current study, eleven flood-related factors (FRFs) were selected as thematic layers based on the sound information from different types of literature (the most commonly used factors in flood vulnerability assessment literature), data availability related to the current study area and field investigation (Al-Juaidi et al. 2018; Kanani-Sadata et al. 2019; Liu et al. 2019; Paul et al. 2019; Wang et al. 2019; Vojtek and Vojteková 2019). These FRFs include altitude, slope aspect, lithology, land use/land cover (LULC), slope length (LS), topographic wetness index (TWI), slope angle, profile curvature, plan curvature, stream power index (SPI), and hydrolithology units (Fig. 5). They were generated and stored in spatial database themes with a grid cell size of 30 × 30 m in an ArcGIS environment for data processing. A digital elevation model (DEM) of the study area with a spatial resolution of 30-m was obtained from ALOS World 3D-30m), from which eight layers were generated. Of these, five factors, slope aspect, slope angle, altitude, plan curvature, and profile curvature were extracted using ArcGIS 10.8 software. The other three themes, including TWI, LS, and SPI, were generated using the SAGA software. Other factors such as lithology, land use/land cover (LULC), and hydrolithology unit maps were extracted using remote sensing images (Landsat 8 - OLI and Google Earth), geological maps, and field surveys. Different types of FRFs were used in the present study, such as nominal (lithology, slope aspect, land use/land cover, and hydrolithology unit layers) and ordinal (altitude, TWI, slope angle, LS, profile curvature, plan curvature, and SPI).
Altitude
According to several authors, altitude is influenced by various factors (e.g., lithologic unit, wind action, precipitation, and erosion) (Waqas et al. 2021). The occurrence of flooding is likely to be influenced by elevation, which is considered an influential factor in flooding. Low-elevation regions (flat areas) are more susceptible to flooding than higher-elevation areas because water flows from high altitudes to lower-elevation areas (Kia et al. 2012; Cao et al. 2016). The altitude layer was extracted from the DEM using ArcGIS and ranged from 0 m to 2,173 m (Fig. 5a).
Slope-aspect
The slope aspect is the direction of the maximum inclination of the Earth’s surface. It affects the direction of runoff, which maintains the soil moisture (Chu et al. 2020). The slope aspect may indirectly affect flooding, as the inclined shaded regions are characterized by relatively high soil moisture, indicating high runoff (Islam et al. 2021). The slope-aspect theme was created from the DEM map of the ArcGIS platform. The slope aspect map was divided into nine categories (Fig. 5b).
Lithology
Because of the varying permeability of rocks and sediments in a watershed, lithological units play a crucial role in hydrological processes (variations in the quantity and rate of water flow and sediment production) (Ward and Robinson 2000). The drainage density depends on the type of material used. Çelik et al. (2012) and Srivastava et al. (2014) indicated that a low drainage density is associated with highly resistant rock or highly permeable subsoil material. Stefanidis and Stathis (2013) concluded that flood hazard zones are influenced by geological units, especially torrential formations characterized by erodibility and permeability. In the current study, lithological units were generated from lithological maps (1:100,000-scale). Four main geological units were identified: (1) wadi deposits (alluvum), (2) sandstone, (3) limestone, (4) evaporates, (5) shales, and (6) basement rocks (Fig. 5c).
Land use/land cuver (LULC)
Land use/land cover type (LULC) plays a critical role in runoff velocity, interception, infiltration, and evaporative transport (Benito et al. 2010; Yalcin et al. 2011). Various LULC features can affect the infiltration and surface flow generation in a catchment (Rahmati et al. 2015). Tehrany et al. (2019) indicated that forested areas can infiltrate more water into the subsurface than other LU/LC types. Many studies have shown that LULC types have a significant impact on distinguishing flood-vulnerable areas (Karlsson et al. 2017; Komolafe et al. 2018). The LULC layer was generated based on 2018 Landsat-8 satellite imagery (OLI) and classified into five categories using supervised classification in ENVI 5.4 software: wetlands, bare rock, bare soil, built-up area, and sandy soil with trees (Fig. 5d).
Slope length (LS)
The slope length (LS) is one of the influential factors determining soil erosion, where soil erosion accelerates with increasing slope length owing to the effects of higher accumulation of surface runoff (Bera 2017). LS shows the combined impacts of gradient length and steepness and affects particle transport "soil loss" and upland (mountainous) hydrological processes (Park et al. 2019). In this study, LS was calculated from the DEM layer according to the slope gradient and specific basin area using SAGA software based on the universal soil loss equation (USLE) (Eq. 1) (Moore and Burch, 1986):
$$LS= {\left(\frac{\text{A}\text{s}}{22.13}\right)}^{0.4}{\left(\frac{\text{S}\text{i}\text{n}{\beta }}{0.0896}\right)}^{1.3}$$
1
As (m2) is the specific area of the catchment, and β is the slope angle in degrees. In this study, the slope length (LS) ranged from 0 to 59.1 (Fig. 5e).
Topographic wetness index (TWI)
The TWI reflects the variation in the quantity of water gathered in a basin (wetness values) and is the relationship between the specific basin area and the gradient (Beven, 2011; Gokceoglu et al., 2005). TWI can be strongly correlated with locations within a catchment that have a high potential for flooding (Chen and Yu 2011; Manfreda et al. 2011; Abdel Hamid et al. 2020). Tehrany et al. (2019) pointed out that flat areas can absorb more water than steep terrain. Accordingly, areas near drainage networks and flat lands (flood-prone areas) have higher TWI values than those in areas with slopes (Meles et al. 2020; Zhang et al. 2020). The TWI index value was extracted based on Eq. (2) (Beven and Kirkby 1979):
$$TWI=lin \frac{\text{A}}{\text{tan}B}$$
2
where A is the cumulative basin area (m2), and β is the slope angle (in degrees) at a point. In this work, the TWI was created using SAGA-GIS software ranging from 1.5 to 22.8 (Fig. 5f).
Slope angle
The slope angle is a crucial physiographic element for flood behavior and occurrence (Mukerji et al. 2009; Meraj et al. 2015). High-gradient areas have less time for perculation, which leads to the acceleration of runoff velocity, resulting in the accumulation of immense runoff in the lower lying areas (around the river or in flat areas) and are more vulnerable to flooding (Stevaux et al. 2020). The slope-angle layer was generated from a DEM map using ArcGIS. The slope angle ranged from 0° to 72° (Fig. 5g).
Plan and profile curvatures
The curvature represents the slope shape and the terrain morphology. It is one of the key terrain elements used in several geomorphometric works (Rau et al. 2019; Torcivia and López 2020). Curvature is a major flood-controlling factor in flood vulnerability mapping (Ahmadlou et al. 2019). Cao et al. (2016) reported that the curvature has a significant impact on surface flow and infiltration. Shahabi et al. (2020) stated that areas with zero curvature values are more prone to flooding than areas with positive or negative curvatures. The curvature can be represented by the plan and profile curvatures. The plan curvature is directly correlated with the convergence and dispersion of surface runoff (Nasiri Aghdam et al. 2016). At the same time, Xiao et al. (2019) indicated that profile curvature impacts material deposition on the slope (by controlling the deposition increasing or decreasing of these materials). In the following study, plan and profile curvatures were generated from the DEM layer using ArcGIS software. The values of the plane and profile curvatures ranged from − 0.0249 to 0.0233 and − 0.0193 to 0.0208, respectively (Fig. 4 (h) (i)nd i).
Stream power index (SPI)
The SPI is a crucial hydrological factor that plays a vital role in assessing the spatial variation of flood-vulnerable areas (Deepak et al. 2020). SPI directly correlates with the erosive power of the catchment, soil water content status in a basin, and discharge relative to a specific area within the watershed (the power of flood water to flow downward) (Cao et al. 2016). High SPI values indicate high flood power, and lower values indicate that the terrain in the watershed has the potential for impound flow (Turoglu and Dolke 2011). The SPI of the catchment was calculated using Eq. (3) (Moore and Wilson 1992; Wu et al. 2020a).
$$SPI=As \text{tan}{\beta }$$
3
AS is the specific basin area, and β is the local slope angle (in degrees). The SPI values in the current study range from 0 to 3.24 (Fig. 5j).
Hydrolithology units
Some soil types have a decisive influence on rainfall runoff mechanisms. The higher the infiltration rate, the less likely is the occurrence of flooding (Fluegel 1995; Phillips et al. 2019; Xie et al. 2019). In this study, a hydrolithology unit map was created by integrating Landsat 8 satellite imagery (OLI), Google Earth imagery, geological data, and field investigations. According to the national soil classes and soil taxonomy, the hydrolithology unit map of the study area was classified into three categories: well-drained, semi-drained, and impervious (Fig. 5k).
Theoretical background of methods used
Problems related to natural hazards, such as floods, landslides, and ground subsidence, have been identified and solved using various machine learning techniques (MLTs) (Park et al. 2014; Shi et al. 2016; Zhou et al. 2018; Ghorbanzadeh et al. 2019; Kavzoglu et al. 2019; Sevgen et al. 2019; Eini et al. 2020). Despite the continued advantages of MLTs as a powerful method, human expertise still plays an essential role in hazard assessment (Marjanović et al. 2011). In the current study, seven MLTs were utilized to evaluate their effectiveness in flood susceptibility mapping. These include SVM, RF, MARS, BRT, FDA, GLM, and MDA, which are discussed in detail in the following sections.
SVM model
The key elements of the SVM model are the utilization of classification and regression, which relate to the learning control approach (Vapnik 1998, 2013; Christianini and Shawe-Taylor 2000). SVM is a supervised learning method that deals with binary classification models (Amiri et al. 2019). The results provided minimal clustering errors and determined the optimal response (Vapnik 1998). This provides a key advantage in effectively identifying and analyzing factors (Micheletti et al. 2014). SVM has been used to create flood-prone areas (Yang and Cervone 2019). Many authors have provided detailed studies on SVM techniques (e.g., Yao et al. 2008).
RF model
Random forest (RF) is an ensemble learning approach based on regression trees, where many classification trees are aggregated to quantify a classification (Calle and Urrea 2010; Micheletti et al. 2014; Thanh Noi and Kappas 2018; Hawryło et al. 2018). The RF model is a robust ML model owing to several advantages, including a large number of trees in the analysis, insensitivity to noise, unbiased estimation of generalization error, acceptance of most types of data, and determination of significant variables (Breiman 2001; Rodrigues and De la Riva 2014; Kim et al. 2018). RF can overcome outliers in predictors, automatically deal with omitted data, and increase diversity among classification trees (Breiman and Cutler 2004). The model RF was run in R software version R 3.5.3, using the package "randomForest" (Briman and Cutler 2015).
MDA model
MDA, a supervised classification algorithm, is a linear discriminant analysis (LDA) in which a cluster is suggested as part of the closest group (Fraley and Raftery 2002). The normal distribution of variables is utilized to calculate the distance to the nearest collection, assuming that the variability and correlation between variables is uniform (Lombardo et al. 2006). MDA applies multiple normal distributions in every class. MDA can be derived from linear combinations using Eq. (4) (Hair et al. 1998).
Y represents the discriminant value, Wi (i = 1,2,3, ..., n) are discriminant weights, and Xi (i = 1,2,3,..., n) are independent variables. The MDA analysis was run in R software version R 3.5.3, using the package "mda" (Hastie et al. 2017).
MARS model
MARS is a powerful regression algorithm owing to its flexibility in predicting events (Adnan et al. 2019). MARS considers both linear and nonlinear relationships between independent and dependent factors and reflects these functions as coefficients used to calculate the effects of these factors separately (Gu and Wahba 1991; Busto Serrano et al. 2020). MARS has been used in various uses to evaluate relationships between different disciplines (e.g., geophysics, climatology, ecology, and geomorphology) (Deichmann et al. 2002; Hjort and Luoto 2013; Abdulelah et al. 2019). It also allows the determination of the relative importance of the independent variables in the predictions (Adnan et al. 2019). It is also used to split the data sets into multiple splines on an equivalent interval basis; each spline can be subdivided into subclasses by generating knots (Friedman 1991). The predictor MARS can be determined using Eq. 5, according to Hastie et al. (2001):
$$f\left(x\right)={\beta }_{0}\sum _{j=1}^{p}\sum _{b=1}^{B}[{\beta }_{jb}(+)Max(0,{x}_{j}- {H}_{bj})+{\beta }_{jb }(-\left)Max\right(0,{H}_{bj}-{x}_{j}\left)\right] \left(5\right)$$
where x, f(x), P, and B are the input, output, predictor variable, and basis function, respectively. Max (0, x-H) and Max (0, H-x) are BF and do not need to exist if their coefficients are 0. The H values are referred to as the nodes.
Three steps in applying the MARS algorithm: (1) applying a stepwise forward algorithm to select spline basis functions, (2) deleting BFs until the "best" set is found by applying a stepwise backward algorithm, and (3) providing the final MARS approximation with some degree of continuity by performing a smoothing method. Generalized cross-validation (GCV) criteria were applied to delete the BF in order of least contribution using Eq. 6 (Craven and Wahba 1979).
GCV= \(\frac{\frac{1}{N}\sum _{i=1}^{N}[{{y}_{i}-f\left({x}_{i}\right)]}^{2}}{[{1-\frac{C\left(B\right)}{N}]}^{2}}\) (6)
N stands for the number of data points, and C(B) stands for a complexity penalty escalated with the number of BFs in the technique and is determined by Eq. 7:
$$\text{C}\left(\text{B}\right)=\left(B+1\right)+dB$$
7
Here, d represents the penalty for every BF incorporated in the technique. This can also be considered a smoothing variable. The MARS technique was run in the software R 3.5.3, using the package "MARS" (Deichmann et al. 2002).
GLM model
The generalized linear model (GLM) is a linear regression model that can quantify and incorporate specific and temporal variables (McCullagh and Nelder 1989; Guisan et al. 2002; Ozdemir and Altural 2013). The use of GLM can increase the accuracy and quality of the results because it uses multiple regression to develop a clear relationship between the dependent and independent variables (Scott et al. 1991). Moreover, it can predict numerous events as it can identify the best regression model (Federici et al. 2007; Payne 2015). Several authors have applied GLM to different spatial models (Bolker et al. 2009; Dumbser et al. 2020). The relationship between the response variable and explanatory variables can be constructed using the GLM link function (Ahmedou et al. 2016; Kéry and Royle 2016; Soch et al. 2017). The predictions and variances of the response factors were estimated using Equations (8) and (9):
$${\mu }_{i}=E\left[{Y}_{i}\right]= {g}^{-1}\left(\sum _{j}{X}_{ij}{\beta }_{j}+{\epsilon }_{i}\right)$$
8
$$var\left[{Y}_{i}\right]=\frac{\varphi V\left({\mu }_{i}\right)}{{\omega }_{i}}$$
9
Yi denotes the vector of response parameters, Xij is the matrix of explanatory parameters, βj is the vector of floating variables, εi is the interference terms, g(x) is the corresponding link function, V(x) is the variance function, ϕ is the dispersion parameter of V(x), and ωi is the weight of the ith observed value.
In this work, it is assumed that Y is the response parameter representing the flooded area in a grid, and Xi is the i-th flood-related parameter. Thus, the occurrence probability of a flooding event Y is represented by Eq. (10): By logistic transformation, the link function g(yi) is represented by Eq. (11).
$$P=\frac{\text{e}\text{x}\text{p}({c}_{0}+{c}_{1}{X}_{2}+{c}_{2}{X}_{2}+\dots +{c}_{i}{X}_{i})}{1+\text{e}\text{x}\text{p}({c}_{0}+{c}_{1}{X}_{2}+{c}_{2}{X}_{2}+\dots +{c}_{i}{X}_{i})}$$
10
$$g\left({y}_{i}\right)={c}_{0}+\sum {c}_{i}{x}_{i}+ {\epsilon }_{i}$$
11
where P is the probability of occurrence of event Y, and \({c}_{0}\); \({c}_{1}\);...;\({c}_{i}\) are logistic regression coefficients, and εi is the residual error.
In the present study, R software was used to construct the GLM model. A simple Gaussian family is determined as a link function for normally distributed response data. The independent factors were included in the model separately, using a smoothing spline with only two degrees of freedom in a polynomial of degree 2 to avoid overfitting (Aertsen et al. 2009)
FDA model
Ramsay and Dalzell (1991) proposed the FDA model as a statistical method to analyze the effect factors. The crucial concept of FDA is to treat an observed object with functional properties as an integral, regardless of the order of the observed values (Battista et al. 2016; Wagner-Muns et al. 2018). It can discriminate unsupervised work, where each class is divided into subcategories with a unique value (Chamroukhi et al. 2012; Zou et al. 2019). The FDA is a nonparametric method that is widely used in problem classification (Lu 2007; Seifi Majdar and Ghassemian 2017). Ray et al. (2019) summarized that the FDA model is a combination of regression models that perform an unseen operation for each category in the modeling analysis when applying complex class models. The basic tasks in applying the FDA model include 1) implementing a functional data representation by selecting training and testing datasets, 2) using functional principal component analysis (FPCA) to extract functional data features, 3) using machine learning methods to classify data features, and 4) testing datasets to verify the validity of the classification method. In this study, the FDA model was used to generate a flood vulnerability map using the species distribution modeling (SDM) package in R software (Naimi and Araújo 2016).
BRT model
Friedman (2001) proposed BRT, which uses an integration of statistical and machine-learning techniques. The advantages of the BRT model are: 1) the ability to improve the model performance by fitting and combining several models, 2) no data transformation or outlier removal is required, 3) sophisticated nonlinear relationships can be fitted, and 4) interaction influences between variables are automatically accounted for (Schapire 2003; Elith et al. 2008; Park and Kim 2019). The combined strength of the regression tree and boosting algorithms can improve model accuracy and minimize variance (Aertsen et al. 2010). Model accuracy is improved by boosting, a powerful learning method that iteratively fits new trees to the residual errors of the existing tree composition (Doepke et al. 2017). The BRT model was run in R software version R 3.5.3, using the package "brt" (Ridgeway and Southworth 2013).
The FRFs effectiveness and contribution
Multicollinearity analysis is a technique used to determine the effectiveness of independent variables in a model (Dormann et al. 2012). It is a statistical method in which independent parameters in a model are highly correlated using multiple regression techniques, and the parameters with high collinearity are deleted (Saha 2017). The multicollinearity technique uses two indicators, namely variance inflation factors (VIF) and tolerance (TOL) (Eqs. 12 and 13):
$$TOL= 1-{R}_{J}^{2}$$
12
$$VIF=\left[\frac{1}{T}\right]$$
13
\({R}_{J}^{2}\) represents the regression coefficient of explanatory factor J for the rest of the descriptive factors. Previous studies have shown that a TOL < 0.10 and a VIF > 5, account for problems of multicollinearity (Hosmer and Lemeshow 1989; Menard 2001).
Evaluating the importance of independent factors is crucial for flood susceptibility analysis. It can be applied to determine the contribution of various flood-related factors and accurately determine their role in modeling production. Several methods have been applied to evaluate relationships between related factors and events. These techniques have been used and received much attention, including random forest (RF) and partial least squares (PLS) (Wang et al. 2016; Huang et al. 2018). PLS was used in this study. PLS is a strong multivariate regression technique that enables a broad spectrum of processes to be performed (Martens and Martens 2000). It has many advantages, such as: it allows a quick understanding of the essential sequence of variations in the data; it is suitable for the analysis of noisy data, collinear, and even incomplete parameters, and it helps to detect errors in the input data (Wold et al. 2001). PLS was used for multivariate calibration of a dependent parameter against many independent parameters. Accordingly, it was suitable for the selected critical factors in the analysis. Details of PLS functions and applications have been explained in various studies (e.g. Hastie et al. 2001; Abdi 2010; Lowry and Gaskin 2014). In the present study, the contribution and importance of all FRFs to flooding occurrence were evaluated using partial least squares (PLS).
Modelling prediction and performance
Evaluating the predictive and performance accuracy of the susceptibility models used was critical. The cross-validation approach using receiver operating characteristic (ROC) and area under the curve (AUC) has been applied quantitatively and graphically by various authors (Akgun et al. 2012; Ozdemir and Altural 2013;Youssef and Hegab 2019). The cross-validation approach offers many advantages, including quantitative evaluation of model prediction, determination of a better prediction approach, ability to compare the predictive capabilities of different models, ability to distinguish the less and most vulnerable areas, identification of the influencing factors and their contribution to prediction, evaluation of the effectiveness of the input parameters to the models, and improvement of the quality of model prediction. The ROC method is a statistical indicator of model performance based on the rates of true and false positives (sensitivity and 1-specificity) (Chung and Fabbri 2003; Mathew et al. 2009). The acceptable susceptibility model must have an AUC value between 0.5 and 1. The effectiveness, accuracy, and reliability of the model were improved by a higher AUC value (equal to or close to 1.0). An AUC value of less than 0.5 is considered a random model (Marzban, 2004). Sajedi-Hosseini et al. (2018) stated that the overall performance of the model can be identified by categorizing the AUC values as follows: incompetent model (AUC from 0.5 to 0.6), model with poor performance (AUC from 0.6 to 0.7), model with moderate performance (AUC between 0.7 and 0.8), and model with high fitness and performance (AUC 0.8).