This literature review examines how computer vision (CV) and instance image analysis have impacted urban design, shifting the focus from traditional aesthetic approaches to human-centric urban spaces. It highlights the evolution from basic data collection to advanced technologies like CV for understanding cities, blending technological objectivity with insights into human behavior. The review also addresses these technological advances' social and ethical implications in shaping urban environments.
Urban design theories have evolved from an aesthetic focus to a deeper understanding of human interactions within urban spaces [1]. This shift, notably in the mid-20th century, moved from prescriptive theories to empirical studies, driven by concerns over the alienating effects of modernist designs. This change underscores the importance of human-centric approaches in urban planning, exemplified by Lynch and Whyte's work on mental mapping and public space usage, which significantly influences contemporary urban design by emphasizing space vitality and usage [2, 3].
Traditional data collection methods in urban studies, such as images, videos, and direct observation, faced limitations in labor intensity and scalability. The advent of advanced sensing technologies and geotagged imagery enabled more efficient and extensive urban analysis [1]. Whyte's 1980 Street Life Project, utilizing time-lapse filming of pedestrian behavior, marked an early integration of technology and detailed observation in urban studies [4], revealing key behavioral trends.
Modern urban studies have evolved by incorporating hybrid sensing, big data, and AI, revolutionizing the analysis of physical and socioeconomic conditions and human dynamics [1]. The application of computer vision technologies in urban management represents a progression in employing image-based AI for data collection [4]. However, these objective approaches potentially overlook the complexities of human perceptions [5].
A balanced approach integrating objective and subjective measures is necessary. Subjective measures, derived from interviews and surveys, offer deeper insights into human behavior by considering the cognitive mapping of environments [2]. However, traditional methods for collecting perception data often lack consistency, and reliability, are time-consuming, expensive, and challenging to interpret [5].
By combining advanced technologies like CV with subjective insights, urban studies can comprehensively understand urban spaces, informing design decisions that cater to human needs and experiences while addressing social and ethical implications.
2.1 Technological Advancements in Urban Studies: Sensing, Big Data, and AI Integration
Street-level imagery has become vital in various research areas, including urban planning, public health, and real estate, due to its accessibility, coverage, and objective views [1]. Computer vision techniques, which convert images and videos into numerical matrices by analyzing Red, Green and Blue (RGB) pixel values, require extensive data to train models for accurately differentiating human actions from background noise [4]. Semantic image segmentation (SiS), as described by Csurka et al. [6], is crucial in computer vision, where each pixel in an image is assigned to a specific semantic class to understand different image parts.
Implementing activity surveys with computer vision involves data collection and processing handling distinct video and GPS data streams. Adherence to guidelines ensuring that videos reflect human perception and are consistent for accurate tracking is crucial [7]. Integrating these data streams facilitates accurate mapping of activity data in physical space, enhancing understanding of human interactions in urban environments.
2.2 Evaluating Urban Environments: Semantic Segmentation and Computer Vision Methods
The "Urban Visual Intelligence" framework, elucidated by Zhang et al. [1], integrates AI with imagery to analyze urban environments, addressing physical and socioeconomic dimensions. This framework overcomes the limitations of traditional methods, providing a nuanced perspective that includes observing urban environments at a human scale, deriving semantic information from imagery, quantifying physical environments, and exploring their physical and socioeconomic interplay.
Challenging the assumption of linear relationships between the built environment and walking behavior, Liu et al. [8] introduced an alternate perspective. Their research suggests that intrinsic motivations and utilitarian travel needs may influence walking desires, indicating a potential saturation point in the built environment's influence on walking. Addressing the geographical gap in research, the study explores non-linear associations between street view-derived environmental characteristics and pedestrian walking duration in Amsterdam, focusing on identifying influencing features and understanding their variations across different times, including weekdays and weekends.
To support this analysis, Liu et al. [8] utilized semantically segmented street environmental features with a fully convolutional neural network (CNN), specifically the Xception-71 CNN, pre-trained on the Cityscapes dataset comprising pixel-level annotated street scenes from 50 different cities, demonstrating favorable performance compared to alternative CNN architectures. Yan et al. [9] employed an urban perception evaluation framework to complement this approach. This framework analyzes a vast collection of old city landscape street images, focusing on image semantic segmentation to categorize data based on landscape spatial elements. The study quantifies elements such as building area, road area, green viewing rate, human and vehicle flow, and sky area, filtering out extreme proportions of certain elements to maintain accuracy. Their findings reveal an average greenness rate of 30.14% in the analyzed images.
Shifting the focus from social media imagery, Duarte & Ratti [10] emphasize using specialized urban cameras designed to collect visual data about cities. This shift from user-generated, geotagged photographs used in previous studies to assess urban attractiveness or aesthetic appeal represents a move towards more objective urban data collection [10]. Additionally, Lee et al. [11] highlight using computer vision technologies, such as semantic segmentation and edge detection, to evaluate urban design quality. These technologies assess aspects like enclosure, openness, greenery, and the ratio of feature areas, while edge detection quantifies complexity. The data obtained is processed and interpreted using a Machine Learning model and the SHAP algorithm, contributing to a comprehensive understanding of urban design elements and their impact on urban life.
2.3 Applications of Computer Vision Segmentation in Urban Design and Pedestrian Behavior Analysis
Computer vision segmentation techniques are invaluable for understanding and optimizing urban spaces for pedestrians by assessing factors that influence walkability and pedestrian behavior, and various other aspects, see Fig. 1. Walking shapes urban experiences and community dynamics [11], with street environment qualities like green spaces, building layouts, and open areas directly impacting walking appeal. Pedestrian satisfaction relates to perceptions of imageability, enclosure, human scale, openness, and complexity [12], measurable through segmentation.
Segmentation enables virtual assessments of pedestrian volumes, a key walkability indicator [13]. Street View Imagery (SVI) and segmentation algorithms like the Visual Walkability Index [14] evaluate aspects like crowdedness and obstacles, revealing walkability variations across locations. Urban design qualities like area ratio, enclosure, openness, and complexity are extracted through semantic segmentation [11]. Enclosure (D:H ratio) and openness (visible sky proportion) impact pedestrian comfort, while complexity enhances satisfaction. Greenery analysis, such as the Green View Index (GVI) from SVI [14], assesses urban greenery's pedestrian perspective, shading, and aesthetic value. Health studies utilize SVI to analyze links between physical activity, neighborhood greenery, sidewalk quality, and recreational facilities [14]. Urban perception research leverages SVI's human-centered street characterization for assessing safety, wealth, and vibrancy, integrating surveys and audio data. In transportation, SVI enables virtual street audits, road analysis, pedestrian volume assessment [13], traffic indicators, and cycling pattern exploration based on greenery and architecture [14]. Monitoring pedestrian and vehicle traffic informs planning decisions. Historical change detection through image comparisons supports urban heritage conservation.
However, using SVI raises ethical concerns regarding transparency, accuracy, and potential profiling [4], necessitating ongoing policy discussions for responsible and ethical implementation in urban studies.
2.4 Methods and models of segmentation:
Computer vision is a crucial component of artificial intelligence. It enables the extraction of valuable insights from visual data, which is indispensable for urban analysis. This section discusses the design of our model, which leverages established computer vision models and algorithms. A pilot approach was adopted to rigorously evaluate the novel computer vision framework developed for this study, centering on an in-depth analysis of one student's navigational experience through the campus. This methodological choice was driven by the intent to closely monitor and adapt the analytical process in real-time, ensuring the comprehensive testing of our software's capabilities and the framework's analytical precision.
Advancements in hardware and research into convolutional neural networks (CNNs) have led to sophisticated deep-learning models for image analysis. Notable examples include VGGNet [15], the YOLO (You Only Look Once) object detection framework [16, 17], and the Fast R-CNN [18, 19] for object detection and classification.
Object detection involves identifying and localizing objects within an image. Classification assigns labels to entire images or objects, which is essential for interpreting visual content. Segmentation offers a granular approach by assigning labels to individual pixels, providing detailed analysis crucial for applications like land cover delineation and augmented reality. Segmentation divides an image into meaningful regions, grouping pixels into semantically coherent clusters. Semantic segmentation classifies each pixel into predefined categories, while instance segmentation differentiates between individual object instances.
Our model represents an advancement in computer vision applications for urban space analysis, blending the YOLO framework's object detection and segmentation capabilities with geospatial data integration. Utilizing the Ultralytics library for using the YOLO segmentation model for prediction, the framework processes video frames to extract masks, calculate segmented object areas, and identify focal objects through bounding boxes.
Our algorithm calculates the total frame area, retrieves object labels, finds contours defining object shapes within masks, and computes the percentage of frame coverage by objects. It also determines the object closest to the frame's center using Euclidean distance calculations and bounding box center coordinates. The algorithm calculates the total area of the image frame (Area = height × width) and retrieves the label of the detected object (e.g., "car," "person"). It then finds the contours defining the object's shape within the mask using the Green formula and calculates the total area enclosed by those contours (sum of each contour object area). Utilizing this object area and the total frame area, it computes the percentage of the frame covered by the object (percentage of coverage = (object area / total area) × 100).
The proposed algorithm determines the object closest to the video frame's center by calculating the frame's center coordinates (center_x, center_y) and iterating over the detected object bounding boxes. Each bounding box computes the Euclidean distance between the box's center and the frame's center, verifying if the frame's center is within the bounding box for accuracy. If an object is found closer than the closest distance, the code updates the closest distance and stores the object's label and frame number. This process enables the detailed extraction of object coverage percentages and dynamic tracking of prominent objects over time, enriching the analysis of urban open spaces.
Integrating GPS location data from data loggers or mobile applications enhances each video frame with movement location. This comprehensive extraction of visual elements, including classified objects, pedestrian and vehicle counts, and focus of attention, is compiled into a CSV file and integrated into a GIS platform for spatial analysis within urban studies. The model's utility lies in analyzing visual attention patterns within open urban spaces, employing a corpus of systematically documented videos to identify areas that capture visual attention during navigation.
The proposed framework quantifies observable urban elements, such as the proportion of sky, landscape features, architectural structures, and human presence, offering insights into the visual structure and built environment quality. It governs a code for assessing visual engagement and focus within urban spaces by capturing focal points and gaze duration. Integrating computer vision analysis with geospatial data enhances the understanding of visual attention dynamics; see Fig. 2 for the full Python model process. The production of quantified outcomes in CSV files, geo-referencing these data, enables the correlation of visual attention data with other spatial datasets. This multidimensional analysis provides a holistic understanding of sensory experiences in urban environments, advancing urban studies by facilitating informed decision-making based on a multifaceted understanding of urban open spaces through both temporal dynamics and geographic context.
This model's design and functionality represent a significant computer vision application to urban space analysis. It affords researchers and urban planners a powerful tool for understanding the intricate dynamics of urban open spaces, facilitating informed decision-making and contributing to advancing urban studies.
2.4.1 Model Methodology
We use the following dataset for this research to detect and segment in open urban spaces. The dataset comprises 2,763 images across 16 distinct classes, with an average image resolution of 2048 x 1024 pixels. These images were collected from various sources, including ‘Cityscapes’ and other online repositories via ‘Roboflow’ datasets, to create a comprehensive dataset tailored to the research needs. Specific classes required consolidation and preprocessing from diverse sources to ensure representativeness for the study, see Figs. 3 and 4.
The dataset underwent preprocessing to address class imbalances and resize requirements. Automatic orientation adjustments were applied, and images were resized to 320x224 pixels. Data augmentation techniques were employed to achieve class balance, including horizontal and vertical flips and slight rotations within ± 5 degrees. The figure illustrates augmentation and represents the model's perception of classes like sky, road, and persons.
The YOLOv8 segmentation model was fine-tuned using a custom dataset of 5,000 images across five distinct classes, divided into training (80%), validation (10%), and testing (10%) sets. The training involved 100 epochs with images resized to 640x640 pixels, a batch size of 16, and using ‘Adamsw’ optimizer. The model achieved a mean Average Precision ([email protected]) of 43.8 on the held-out test set. Class-specific mAP analysis demonstrated the model's segmentation proficiency across various object categories.
Comparative assessment against state-of-the-art approaches emphasized the proposed method's competitiveness. Qualitative inspection through visual examples showcased the model's accuracy in segmenting objects across diverse scenarios. Analysis of the loss curve during training, see Fig. 5, provided insights into convergence behavior, with no significant signs of overfitting or underfitting observed. The results contextualized the achieved mAP score, outlining the strengths and limitations of the YOLO model for segmentation tasks and suggesting avenues for future research and enhancement, see Figs. 5 and 6.