Study Area
The study area considered in this study is the lower part of the Darby Creek (DC), along the southwest boarder of Philadelphia, PA, USA showed in the Fig. 143. The alluvial channel of the Creek flows through a floodplain with fully urbanized settings which is subject to frequent flooding. The population residing near to the creek are subject to flood significantly44. The portion of the river considered in this study flows from the Mt. Moriah Cemetery (upstream) to the confluence with the Delaware River (downstream) and carries alluvial deposits through an urbanized setting45, approximately 15 river kilometers (rkm). Darby Creek plays an important role to the adjacent environment and ecology; it is also flood prone area46. It also offers a unique environment for various plant and animal species47,48.
Preparing Hydraulic Dataset in iRIC
Hydraulic models are simulated in the iRIC platform to generate dataset for ML classifiers and DNN regression model. The iRIC is a numerical tool capable of modelling rainfall runoff generation, flooding, and sediment dynamics. It receives terrain and hydraulic data (e.g., water surface elevation, roughness) for the model calibration purpose. FaSTMECH (Flow and Sediment Transport with Morphological Evolution of Channel) is used as a solver in this study50. The terrain data is discretized to a size of 5 m2 for every computational cell. As the higher discharges from the upstream side of the river are responsible to the morphological changes, higher discharge values from the highest flood event in Darby Creek are chosen to create scenarios for AI models. Multiple scenarios are created using various constant discharge values upstream of DC within a certain range. The discharge data from observed flood events in the time span of 14th July to 16th September is obtained from USGS peak stream flow data (USGS gage 01475548)51. A set of discharge values is chosen to execute Machine Learning and Deep Learning algorithms used in this study. The discharge values are37, 42, 45, 50, 52, 61, 83, 95, 99 and 164 m3 per second (cms). The outcomes generated by the iRIC are water surface elevation and flooding depth. A set of urban hydraulic features i.e., the quantity of the impervious areas within a contributing area to a specific location, downstream distance from the hydraulic structures e.g., stormwater outfall and dam are introduced in this study to integrate the effect of urban attributes with the flooding extent and magnitude. Furthermore, average slope of the contributing area is derived through GIS analysis and incorporated to represent the flow accumulation to a specific location. Hydraulic model calibration requires elevation data from the floodplain and bathymetry of the channel. Water surface elevation at the upstream of USGS Cobb Creek gage at Mt. Moriah cemetery for the flooding event of 30th August of 2009 is utilized to calibrate the hydraulic model43.
AI Models
The quantification of the flood extent and depth by the ML framework was tackled in three steps. Firstly, an exploratory analysis and feature engineering is performed to study and transform the entire dataset prepared by multiple geographic and hydraulic features, listed in Table 1. After analyzing dataset and conducting necessary transformation on the features, classifiers, such as Logistic regression (LR), K-nearest neighbors (KNN), decision trees (DT), support vector machines (SVM), are trained using the data prepared in the first step to locate or classify the flooded locations for each scenario of various upstream discharges. Third, an artificial DNN is used to prepare a regression model to predict the depth of water within the computational domain. ML classifiers and neural networks-based regression models are evaluated using several error matrices, e.g., F1-score, Jaccard similarity score and Root Mean Square Error (RMSE). The algorithms are tuned and optimized by altering the hyperparameters to reduce the error and obtain satisfactory performance. The ML workflow of flood prediction is described in Fig. 2. The entire process can be divided into groups of tasks, i.e., data collection, exploratory data analysis, feature engineering, model training, model evaluation, model deployment, and model improvement. Details are provided in the following sections. The steps are further categorized into their distinct group namely transformer, estimator, and evaluator.
Table 1
Full descriptions of the predictor and target variables used to train/test the ML classifiers and DNN regression model.
Features | Full descriptions |
x1 | x-coordinates of every location in the model domain |
x2 | y-coordinates of the same location |
x3 | Elevation in meters of same location |
x4/y2 | Depth of water in meters |
x5/y1 | Flooded locations |
x6 | Average slope of the contributing area of every point in percentage |
x7 | Number of impervious locations of the contributing area |
x8 | Downstream distance from the stormwater outfalls |
x9 | Downstream distance from the dams |
x10 | Upstream river discharge in m3/s |
Feature Engineering
Scikit-learn is used as the ML library for feature engineering in Python52. It offers several classification, regression and clustering algorithms including LR, KNN, DT and SVM which are used as binary classifiers for identifying flooding locations in this study. Modules needed for ML and Deep Learning algorithms such as optimization, linear algebra, integration, interpolation, special functions can be accessed through SciPy41. Independent variables for Binary Classifiers and DNN regression model are listed in the Table 1. Flooded location is used as the target variable denoted by y1 in case binary classifiers and water depth, y2 as a target variable in case of DNN model. Spatial information, coordinates and elevation values are obtained from the original Digital Elevation Model (DEM) of the study area using ArcGIS Pro. Water Depth and Discharge values are extracted through simulating multiple hydraulic models in iRIC platform. Average Slope and number of impervious cells of the contributing area of every point of the DEM are urban hydraulic features, which have not been introduced before as a training feature for AI models. ArcPy, a Python site package that offers an effective and efficient way to perform geographic data analysis, data conversion, data management, and map automation using Python was utilized to generate the contributing areas of every cell upstream in the model domain53. It can be compared with the upstream area contributing to those cells. No modification was needed to alter the data type, as it is generated from iRIC-FaSTMECH simply as binary data type. The main data frame is constructed through concatenating datasets derived from different upstream discharges (Q) scenarios.
Before initiating the learning process, the feature importance was analyzed. Feature Engineering tasks used in this study to prepare the datasets for the ML/DL algorithms include numerical imputation, outlier detection with standard deviation and dropping, splitting training/testing datasets, and scaling with normalization. The proportion of the train-test split is assumed to be 80/20 for both ML classifiers and DNN regression. In the Eq. 1 how the normalization of the features performed can be observed. X denotes the feature vector including all the features used to train/test the models. Preparation of dataset for training the DNN is identical to the preparation of the training dataset for ML Classifiers. Eighty percent (80%) of the data is used to train and rest of the data is used to test both the ML classifiers and DNN regression model.
\({X}_{norm}=\frac{X-{X}_{min}}{{X}_{max}-{X}_{min}}\) | (1) |
Identifying Flooded Locations with ML Classifiers
Logistic Regression (LR)
Linear regression searches a function that build relationship to a continuous dependent feature/variable, y, to some outcome/predictors (independent features x1, x2, etc.). LR is a variation of linear regression, utilized when the existing dependent variable/outcome, y1, is categorical. It generates a formula that forecasts the probability of the category as a function of the independent features. Logistic regression fits a special s-shaped curve (sigmoid function) by taking the linear regression and converting the numeric into a probability with the function, which is known as the sigmoid function 𝜎54.
\({\text{h}}_{{\theta }}\left(\text{x}\right)={\sigma }\left({{\theta }}^{\text{T}}\text{X}\right)=\frac{{\text{e}}^{({{\theta }}_{0}+{{\theta }}_{1}{\text{x}}_{1}+{{\theta }}_{2}{\text{x}}_{2}+...)}}{1+{\text{e}}^{({{\theta }}_{0}+{{\theta }}_{1}{\text{x}}_{1}+{{\theta }}_{2}{\text{x}}_{2}+...)}}\) | (2) |
The probability of a category 1 (a location being flooded) = 𝑃(𝑌=1|𝑋) = \({\sigma }\left({{\theta }}^{\text{T}}\text{X}\right)=\frac{{\text{e}}^{\left({{\theta }}^{\text{T}}\text{X}\right)}}{1+{\text{e}}^{\left({{\theta }}^{\text{T}}\text{X}\right)}}\). Therefore, LR passes the features (e.g., x1 = elevation, x2 = slope of the contributing area, x3 = water depth, etc.) through the logistic/sigmoid functions; however, considers the outcome as a probability. The goal of LR algorithm is to identify the best parameters θ, for ℎ𝜃(𝑥) = 𝜎(𝜃𝑇𝑋), in such a way that the algorithm forecasts a cell is being flooded or not in the model domain.
Decision Tree (DT)
Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item e.g., features mentioned in the Table 1 (represented in the branches) to conclusions about the item's target value e.g., binary decision on a location being flooded or not (represented in the leaves)55. From scikit-learn, Decision Tree Classifier is used to perform the classification task on flooding location44.
Support Vector Machine (SVM)
SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong56.
K Nearest Neighbors (KNN)
The principle of KNN is based on the concept that the k closest objects or similar cases in the p-dimensional space (the number of dimensions is identical to the number of the features mentioned in the Table 1) determine the class of an unknown variable i.e., flooded locations. KNN aims to partition n observations (number of rows in the flood prediction data frame) into k clusters tagging each observation (rows in the dataframe) to a specific cluster with the cluster centers or cluster centroid or the nearest mean serving as a prototype of the cluster. The entire data space is partitioned into Voronoi cells in this approach46. When features are obtained in different physical units with vastly varying scale, normalizing the training features and outcomes can improve the accuracy of the KNN algorithm as it depends on distance of the data points for the classification47.
Predicting Flood Depth With Deep Neural Network
The goal of using DNN regression model is to construct a model to predict the water depth (y2) using existing features derived from the hydraulic model and GIS data. To do this, a full set of multiple hydraulic variables/features mentioned in the Table 1 is used to train/test the DNN model. Scikit-learn is used as the ML library for feature engineering using Python41. It offers several classification, regression and clustering algorithms including LR, KNN, DT and SVM which are used as binary classifiers for identifying flooding locations in this study. Modules needed for ML and Deep Learning algorithms such as optimization, linear algebra, integration, interpolation, special functions can be accessed through SciPy42. Independent variables for Binary Classifiers and DNN regression model are listed in the Table 1 are fed into the DNN. Open-source library TensorFlow is used in this study work to construct DNN model as it has an excellent particular focus on the inference and training of DNN48. Training a model with TensorFlow Keras typically starts by defining the model architecture.
Input layer contains features denoted by xi in general which is similar to the binary classification problem. The weights imposed on different features, aggregation of multiple features, further weights before the output layer and the activation functions are denoted with W, z, and a respectively. Finally, target variables (water depth) are generated from output layers. In Fig. 3 (a), it can be observed that introducing neural networks improves the prediction performance significantly through the introduction of non-linearity among the input and target features. The activation function used to introduce the non-linearity to the model is ReLU (rectified linear unit) function showed in the Fig. 3 (b). This function returns the standard ReLU activation: maximum (x, 0), the element-wise maximum of input tensor (x) and 0 with default values. Total number of layers used to perform DNN is four including a normalized input feature layer, two hidden layers and a linear single-output layer. Total number of weights for each trainable neuron is 4,609 where 11 neurons are found to be non-trainable.
Urban hydraulic feature importance is studied by analyzing the sensitivity of the change in feature values over the target variable, water depth and Permutation Feature Importance (PFI) technique in the computational domain. The values of impervious area, average slope of the contributing, downstream distance (DD) from the Stormwater Outfall (SO) and Dams (DO) are varied (5%, 10% and 20%) to observe the impact on the target variable in the DNN regression Model. The RMSE values are obtained from the difference between the series of the target variable, water depth after running the DNN model with the changed features and the series before running the model. In the PFI technique, DNN model is run with the values of a specific feature e.g., impervious areas of the contributing area permuted/shuffled keeping the other features constant and the change in the RMSE values are recorded41.