In the process of rapid urbanization, the underlying conditions of the city have undergone great changes, causing larger and faster hydrological responses (Wang et al. 2020). Moreover, extreme rainfall events triggered by global warming (Tabari 2020) have made several cities around globe suffer from severe urban floods frequently (Wu et al. 2020). According to the SHELDUS ( https://cemhs.asu.edu/sheldus ) database, from 1960 to 2016, direct property losses caused by floods in urban areas amounted to $107.8 billion, affecting 20141 urban counties in the U.S. What makes this problem more complicated is that flood will move and ponding level will change over time (Alizadeh Kharazi and Behzadan 2021). Therefore, real-time dynamic ponding levels monitoring during urban floods is urgently needed, which is essential for emergency managers to prioritize relief efforts and plan effective disaster prevention and mitigation measures (Rosser et al. 2017).
Traditionally, the most common methods for monitoring urban flood information are instrumental measurement and remote sensing. The instrument measurement method is measured by various sensors with different principles, which is convenient, fast, and high precision (Huang and Shu 2016). And the disadvantage of this method is that almost no sensors are distributed on streets dedicated to urban flood monitoring (Wang et al. 2018) because of the high cost and susceptibility to external interference. As for remote sensing, it has greater space scales than other methods (Joint and Groom 2000). The obvious disadvantage is that the data obtained from remote sensing tends to have low spatial and temporal resolution (Rosser et al. 2017), which does not match the highly heterogeneous urban underlying surface condition (Wang et al. 2017). Recently, researchers have been finding an efficient and low-cost monitoring method to meet data accuracy requirements.
With the continuous development of information technology, the construction of “smart city” (Batty et al. 2012) has emerged. Big data is a crucial technology for supporting the realization of “smart” in all fields of smart cities (Lim et al. 2018; Bibri 2018). Attributed to the arrival of the big data era, data sources have been greatly expanded. Researchers and practitioners have begun to focus on various new data sources, such as surveillance video, social media, and applications, which create a new opportunity to collect a large amount of informative image data (Witherow et al. 2019).
Surveillance video and social media record the process of urban pluvial floods in the form of images. The application of images, usually georeferenced, time-stamped, and cost-effective, obtained by these two new data sources in monitoring urban floods has received more and more attention. For example, (Alizadeh Kharazi and Behzadan 2021) presented a deep neural network approach to estimate ponding areas from street photos. (Liu et al. 2015) manually read ponding levels from surveillance videos.
Computer vision, defined as a field of study that seeks to develop techniques to help computers “see” and understand the content of digital images such as photographs and videos, is significant technical support for realizing the application of extracting flood information from images. Computer vision is divided into five categories: image classification, object detection, target tracking, semantic segmentation, and instance segmentation (Li et al. 2020). This technology has derived many fast-growing, practical applications, such as face recognition (Sinha et al. 2006), image retrieval (Paulin et al. 2015), computer games (Hämäläinen and Höysniemi 2003), biological recognition technology (Fada et al. 2017), smart cars (Luu et al. 2019), etc.
There have been preliminary applications of computer vision in obtaining urban flood information, and they can be classified into two topics: extracting ponding areas and estimating ponding levels. For the former, (Witherow et al. 2019) proposed an image processing pipeline for detecting ponding areas on inundated roadways by registering the pair images. And (Jafari et al. 2021) leveraged object-based image analysis (OBIA) to identify ponding areas from images, and the author introduced temporal smoothness to refine the segmentation.
Considering about estimating ponding levels, two kinds of computer vision techniques have been used widely. The first kind depends on the characteristics of the image itself to identify the ponding levels using image processing algorithms. For example, (Yu and Hahn 2010) proposed a difference image-based ponding levels measurement using sparsely sampled images in time domain, which references an indicator (an invariant feature in the image) to identify ponding levels. (Sakaino 2016) developed a two-step histogram-based method to estimate ponding levels by supervising the region of interest (ROI) from consecutive frame images. This kind of methods requires high resolution images, where obvious water surface lines can be seen, and the reliability of estimating results may be reduced in complex and noisy scenes.
Another kind of methods is to leverage computer vision to detect reference objects in the image to identify the ponding levels indirectly. For example, (Jiang et al. 2019) employed Convolutional Neural Network (CNN) to automatically detect the reference object (traffic bucket) in flooding and non-flooding images, and the bounding box height difference of these two period images is utilized to estimate ponding levels. A similar approach was used by (Alizadeh Kharazi and Behzadan 2021), but the author developed a title correct technique to correct the error caused by the inclination of the reference objects. Some researchers tried to estimate ponding levels based on dynamic reference objects existing more widely in the scene, and they predicted a grade data of ponding levels rather than a specific value. For example, (Feng et al. 2020) leveraged Mask R-CNN combined with body keypoint detection to divide the ponding levels into five ranks. While (Chaudhary et al. 2019) improved Mask R-CNN by adding a level layer to predict ponding levels of five objects directly.
Overall, the current research primarily estimates the ponding level at a single point or the ponding area with its boundaries in the image, however, has not yet monitored the ponding distribution in the scene of the image during urban floods. It would be possible to obtain the ponding distribution by taking advantage of dynamic reference objects that can dynamically cover the whole scene. Estimating the spatial-temporal ponding distribution may have following contributions. Firstly, information assimilation can reduce the uncertainty caused by single-point information, which could improve the accuracy of predicted ponding levels in the scene. Secondly, by analyzing the difference of ponding levels at different points can give a better understanding on where the urban ponding is serious and how the flooding process is developed, which is conducive to the arrangement of drainage facilities and effective disaster prevention and mitigation.
This study aims to obtain the overall spatial-temporal urban ponding distribution from surveillance videos with the help of computer vision technique. The proposed method makes full use of the dynamic movement characteristic of the dynamic reference object. The remainder of this paper is organized as follows: Section 2 introduces the study area and data collected; Section 3 describes the method proposed, including ponding levels identification, outlier detection, and ponding distribution determination; Section 4 shows the evaluation results of trained object detection model and the estimated ponding distribution map in two cases; Section 5 mainly discusses the findings and limitations; and the conclusion is given in Section 6.