An urban flooding nowcasting prediction method was developed to enhance the dynamic nowcasting of urban flooding based on data-driven model and real-time monitoring data. The method consists of three modules, the data sources, real-time water level deduced in the whole system through machine learning (ML) algorithm and urban flooding rapidly nowcasting through data-driven method.
As shown in Fig. 1, the proposed model consists of three modules. Data sources (establish the dataset by hydraulic model), real-time monitoring (select monitor sites and deduce urban global water level) and urban flooding prediction.
Figure 1 Flowchart for construction of urban flooding prediction with real-time data (UFP-RD) methods
2.1 Data sources
A hydraulic model was constructed using InfoWorks ICM software to simulation the urban drainage system and flooding scenario, which including the conduit drainage model and the urban flood model. The river pump gate operation data was obtained from the management department, and the model was calibrated with these data and actual measurement data from the sensors. In several rainfall events for validation, the R2 of the hydraulic model at each flow meter reached 0.92–0.98. Then, 120 different rainfall events were simulated and used for training UFP-RD model.
2.2 Real-time monitoring
Optimizing the arrangement of monitoring sites is an important part of real-time monitoring, the essence of which is to obtain as much information of urban drainage system as possible from a limited number of monitoring sites. Thus, we adopt Principal Component Analysis (PCA) method to simplify the data dimensions for monitoring sites optimization. The results of PCA in the train data set are first ranked by the eigenvalue, and each principal component corresponds to the point of maximum load as the monitoring site. Then, select the top-ranked monitoring sites according to the number of required monitoring points.
We adopt ML techniques to quantify the cross-scale correlation in local-global water level. XGBoost is selected as the ML algorithm due to its fast, efficient, accurate and fault-tolerant compared with other ML methods (Chen and Guestrin 2016). The input of the XGBoost model is the water depth of monitoring sites and simulated precipitation events from the train set, and the output of the ML learning model is the water depth of urban flooding except for monitoring sites. The specific theories and calculation process are shown in the supplementary materials.
2.3 Urban flooding risk prediction
Data driven method is used to predict the water level of monitoring sites for developing the urban flooding prediction model. In this study, XGBoost model is adopted to develop a time-series model of water level at monitoring sites, where the input of the model is accurate precipitation prediction and monitoring water level data for the corresponding time (Fang et al. 2021). Then, the predicted water level of monitoring sites is used as the input of local-global deduction (LGD) model (section 2.2) to obtain the urban flooding prediction results.
Figure 2 illustrates the schematic of the proposed UFP-RD model used for urban flooding risk prediction. The observed water depth values at monitoring sites for the current time are used in the UFP-RD model to predict urban flooding during the prediction step. The UFP-RD model is run for the second time with the observed water level values of the next time step, and update the prediction results. The predicted water depth is simply converted to a probabilistic flood risk map. For the drainage system of the conduit component. The risk can be expressed as:
$$\begin{array}{c}{\widehat{RD}}_{i}=\left\{\begin{array}{c}1-{\left(\frac{{\widehat{H}}_{F,i}-{H}_{B,i}}{{H}_{T,i}-{H}_{B,i}}\right)}^{2}, {H}_{F,i}\le {H}_{B,i}\\ 1, {H}_{F,i}>{H}_{B,i}\end{array}\right. \#\left(1\right)\end{array}$$
where \({\widehat{RD}}_{i}\) is the flood risk of \(i\)th manhole in this rainfall event, \({\widehat{H}}_{F,i}\) is the predicted water level of most unfavorable moment in the \(i\)th manhole, \({H}_{B,i}\) is the elevation of the bottom in the \(i\)th manhole, \({H}_{T,i}\) is the elevation of the ground level in the \(i\)th manhole. For the urban ground segment, the risk can be expressed as:
$$\begin{array}{c}{\widehat{RG}}_{i}=\left\{\begin{array}{c}0, E\left({\widehat{H}}_{F,i}\right)\le 2\\ \frac{\left({\widehat{H}}_{F,i}-{H}_{G,i}\right)-2}{13}, 2<E\left({\widehat{H}}_{F,i}\right)\le 15\\ 1, E\left({\widehat{H}}_{F,i}\right)>15\end{array}\right. \#\left(2\right)\end{array}$$
where \({\widehat{RG}}_{i}\) is the flood risk of \(i\)th ground point in this rainfall event, \({\widehat{H}}_{F,i}\) is the predicted water level of most unfavorable moment in the \(i\)th ground point, \({H}_{G,i}\) is the elevation of the ground level in the \(i\)th ground point.
Figure 2 The schematic of UFP-RD model prediction process
2.4 Model evaluation
To evaluate the performance of our proposed model, we compared the UFP-RD with the traditional ML algorithm (XGBoost model) in urban flooding risk prediction. Model performances were evaluated with four indicators: Maximum Error (MAXE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Coefficient of Determination (R2). The specific calculation was shown in Eq. (3)-(6).
$$\begin{array}{c}MAXE=Max\left(\left|{y}_{i}-{\widehat{y}}_{i}\right|\right)\#\left(3\right)\end{array}$$
$$\begin{array}{c}MAE=\frac{1}{m}{\sum }_{i=1}^{m}\left|{y}_{i}-{\widehat{y}}_{i}\right|\#\left(4\right)\end{array}$$
$$\begin{array}{c}RMSE=\sqrt{\frac{1}{m}{\sum }_{i=1}^{m}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}\#\left(5\right)\end{array}$$
$$\begin{array}{c}{R}^{2}=1-\frac{{\sum }_{i=1}^{m}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{{\sum }_{i=1}^{m}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}\#\left(6\right)\end{array}$$
where \({y}_{i}\) is the measured value of FMPs, \({\widehat{y}}_{i}\) is the predicted value of FMPs, \(\stackrel{-}{y}\) is the average value of \({y}_{i}\), \(m\) is the number of samples. The MAXE, MAE and RMSE are all closer to 0 indicating better model results. The R2 value is between 0 and 1, the closer to 1 means the better the fit is (Chu et al. 2020).
Apart from the evaluation indicators, the probability indicators including Cumulative Distribution Function (CDF) Consistency Histogram (CCH) and Metric Consistency Deviation (CD) were used to evaluate the consistency of the two frequency distribution histograms (Chen et al. 2020), which was slightly modified to make it more reasonable in this study. The figure of CCH is a bar chart consisting the same number of columns as the corresponding CDF, and the value of the \(i\)th column in the figure was defined as:
$$\begin{array}{c}Bin {frequency}_{i}=\frac{\frac{{bin}_{i}^{e}}{{bin}_{i}^{o}}}{\left({\sum }_{i=1}^{n}\frac{{bin}_{i}^{e}}{{bin}_{i}^{o}}\right)}\#\left(7\right)\end{array}$$
where \({bin}_{i}^{e}\) is the frequency value of the predicted value in the column representing the \(i\)th CDF, \({bin}_{i}^{o}\) is the frequency value of the measured value in the column representing the \(i\)th CDF, \(n\) is the number of columns in the CDF. When the model works very well, the CDF figure of the predicted value and measured value are exactly the same. At the point the values of each column in the CCH are uniformly equal to \(\frac{1}{n}\). Due to the strong subjective arbitrariness of CCH, Chen et al. proposed CD to more objectively evaluate CCH (Chen et al. 2020):
$$\begin{array}{c}CD=\frac{n}{2n-2}{\sum }_{i=1}^{n}\left|{bin}_{i}-\frac{1}{n}\right|\#\left(8\right)\end{array}$$
where \(n\) is the number of columns in the CCH, \({bin}_{i}\) is the value of the \(i\)th column. The CD value is between 0 and 1, and the closer to 0 indicates the better stability of the model.