Data-Driven City: An Innovative Approach to Urban Area Delineation

doi:10.21203/rs.3.rs-4642145/v1

Download PDF

Article

Data-Driven City: An Innovative Approach to Urban Area Delineation

https://doi.org/10.21203/rs.3.rs-4642145/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This study introduces a data-driven, bottom-up approach to urban delineation, integrating feature engineering with the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, marking a significant shift from traditional methodologies reliant on simplistic OpenStreetMap (OSM) road node data aggregations. By employing a broad array of OSM categories and refining data selection through feature engineering, our research significantly enhances the precision and relevance of urban clustering. Using Bavaria, Germany, as a case study, we demonstrate that feature engineering effectively reduces noise and mitigates common DBSCAN clustering pitfalls by filtering out irrelevant and autocorrelated data. The method's robustness is validated through a comprehensive assessment involving accuracy metrics, optimal clustering selections based on entropy values, and empirical and theoretical confirmations using nighttime light data and Zipf’s Law, respectively. This study contributes to urban studies by providing a scalable, replicable model that incorporates advanced data processing techniques and multidimensional data sources, supporting improved urban planning and policy-making while effectively delineating urban boundaries in varied settings.

Earth and environmental sciences/Environmental social sciences/Sustainability

Earth and environmental sciences/Environmental social sciences/Socioeconomic scenarios

OpenStreetMap

Feature Engineering (FE)

DBSCAN

Data-Driven City

In the wake of rapid global urbanization, our understanding of urban spaces has evolved significantly. We are currently experiencing a transformation that is reshaping cities across the world, with predictions indicating that by 2050, the global urban population will surge by 2.5 billion, increasing the urbanization rate to 68% (United Nations, 2018). This unprecedented expansion underscores the critical need for dynamic methodologies in urban planning and analysis to address complex urban phenomena ranging from urban ecology to climate impacts (Grimmond, 2007; Svirejeva-Hopkins et al., 2004).

Traditional approaches to urban area delineation, typically based on administrative boundaries or physical metrics such as population density, are increasingly inadequate. These methods fail to capture the timely and heterogeneous nature of urban expansion, leading to discrepancies in urban planning and policy-making (Eaton and Eckstein, 1997; Dobkins and Ioannides, 2001). As urban areas continue to evolve, the disparity in regional standards further complicates the objective analysis of urbanization levels (Cohen, 2004; Satterthwaite, 2010). In response, there has been a significant shift toward data-driven approaches in urban studies. These methods leverage diverse data sources to capture dynamic urban transformations more effectively, addressing the need for adaptable and timely urban management solutions.

OpenStreetMap (OSM) represents a transformative development in urban data sources. As a freely available and community-updated platform, OSM offers an unprecedented level of access to comprehensive, real-time geographic data. This global coverage and the up-to-date nature of OSM make it an invaluable resource over traditional data sources, providing detailed insights into urban dynamics without the constraints of cost or outdated information (Haklay and Weber, 2008). Despite the advantages of OSM, direct use of its raw data presents challenges, primarily due to issues of data quality and the lack of refined information necessary for detailed urban analysis (Mooney and Minghini, 2017). While previous studies have utilized OSM for various urban studies, there remains a significant gap in methodologies that effectively harness OSM data through advanced analytical techniques. There is a particular need for comprehensive approaches that integrate feature engineering (FE) and clustering techniques, such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), to enhance the precision and applicability of OSM data in urban studies (Graser et al., 2014; Neis and Zielstra, 2014).

The basis of this study lies in the application of feature engineering and DBSCAN clustering to enhance the use of OpenStreetMap (OSM) data for urban boundary delineation, aiming to develop a robust model that more accurately reflects the complexities of urban environments. By transforming raw OSM data into a format that highlights essential urban characteristics through feature engineering, the research makes the data more actionable for urban studies, allowing for a more natural representation of urban structures without the constraints of predefined cluster counts. Additionally, the use of clustering techniques like DBSCAN, which does not require predefined cluster counts, allows for a more natural and accurate representation of urban structures (Brinkhoff, 2016; Ye et al., 2019).

This paper aims to develop a methodological approach for data-driven city that leverages the power of OpenStreetMap (OSM) data to redefine urban area. By filtering and clustering relevant OSM data, this study addresses critical gaps in urban studies, particularly in how urban areas are delineated and understood in the context of rapid urbanization and data proliferation. The primary contributions of this research are manifold: (1) This research employs a multisource big data approach to provide a comprehensive definition of urban areas. (2) From the extensive pool of POI data available in OSM, this study applies feature engineering techniques to selectively extract data points that are most indicative of urban characteristics. (3) The study uses advanced clustering techniques that are not limited by prior knowledge, such as DBSCAN, which allows for a more organic understanding of urban data distributions. (4) To ensure the robustness and scientific reliability of the methodologies developed, this research also validates its findings using other forms of big data.

This paper is organized as follows: Section 2 reviews the relevant literature. Section 3 describes the data and methods employed in this study. Section 4 details an experiment conducted in Bavaria State, Germany, to test a data-driven city model. Finally, Sections 5 and 6 discuss the results and provide a conclusion of the study's findings.

2.1 Varying perspectives of delineating urban area

This section explores the diverse methodologies used to define and identify urban areas, highlighting the evolution from traditional methods to contemporary, data-driven approaches. Traditionally, urban areas have been delineated by administrative boundaries, which are legally defined and provide clear parameters for urban classifications (Ma, 2005; Qin and Zhang, 2014). These boundaries, while providing legal clarity, often do not reflect the dynamic nature of urban growth and expansion, which can extend beyond these predefined limits. The reasons for defining urban areas in this way are not always transparent, leading to differences in urban area delineations that are not comparable across different countries (Qin and Zhang, 2014). Furthermore, administrative boundaries often lag behind the current state of urban development, making it challenging for urban planners and policymakers to accurately track and respond to rapid urbanization (Feng and Wang, 2022).

Transitioning to more nuanced methodologies, modern approaches incorporate a variety of metrics such as population density, economic activities, and land use patterns. These data-driven models offer a flexible framework that adapts to the changing realities of urbanization, capturing the complexities of urban sprawl more effectively (Cockx et al., 2018; Fox et al., 2018). For example, urban population metrics consider the density and distribution of populations to define urban areas, providing a dynamic parameter that adjusts to demographic changes over time (Parr, 2007; Seto et al., 2013). Further, economic activities such as retail distribution and employment density can serve as indicators of urban vitality, while land use patterns, observing changes in land coverage and utilization, reflect urban expansion and the conversion of rural areas into urban settings (Satterthwaite, 2010; Potts, 2018). Despite the advantages of data-driven models, choosing appropriate thresholds for metrics like population density or economic activity can significantly influence urban delineations, sometimes leading to inconsistencies across different studies or geographic contexts.

Advancements in technology and the availability of large datasets have significantly transformed the delineation of urban areas. Geographic Information Systems (GIS) and remote sensing are pivotal in this transformation, enabling detailed mapping and analysis of urban growth and land use changes (Liu, 2021; Shi et al., 2015). The advent of big data analytics complements these technological advances, providing deep insights into urban dynamics through extensive demographic, economic, and environmental data, facilitating more sophisticated urban modeling and decision-making processes (Zhang et al., 2020). Urban morphology analysis through remote sensing technologies, such as the use of nighttime lighting images or high-resolution satellite data like MODIS and Landsat, has been widely adopted to map urban areas (Jun et al., 2021; Hu and Zhang, 2013). These methods, despite their realism and large-scale applicability, often struggle with issues like cloud cover and varying sensor resolutions, which can obscure the true extent of urban areas (Frate et al., 2004; Zhang and Seto, 2013). Remote sensing, while providing broad coverage and objective data, faces challenges such as the need for frequent calibration and the risk of misclassification, especially in regions where urban and rural features intermingle (Parekh et al., 2021; Sinha et al., 2016).

An effective method for urban delineation should involve leveraging a comprehensive collection of urban-related big data sources. This integration facilitates the construction and formation of data-driven cities by combining the clear legal authority of administrative definitions with the adaptability and precision of data-driven models. This approach enables a more thorough understanding of urban dynamics, blending historical city contexts with their modern growth patterns, thus supporting urban planners and policymakers in efficiently managing rapid urbanization (Feng and Wang, 2022). This strategic use of diverse data sources enhances the ability to monitor and respond to the evolving landscape of urban environments, ensuring that urban planning and policies are grounded in both current and predictive urban realities.

2.2 Identifying urban area through OSM data, Feature Engineering, and DBSCAN

Data-driven approaches to urban boundary delineation are transforming the way we understand and map urban spaces, adapting to their evolving nature. These methodologies enhance traditional techniques by leveraging diverse data sources, extracting pertinent urban-specific data, and utilizing advanced clustering methods that do not depend on predefined assumptions about urban areas. Reflecting this shift, the UN-Habitat document (2022a) asserts that 'metropolises are not defined neither by their population, territorial extension nor by the number of their local jurisdictions, but by their functional geography' (p. 3). This statement underscores the significance of methodologies that enable a dynamic and flexible understanding of urban areas, effectively adapting to their evolving characteristics (Neis & Zipf, 2012).

Urban geography has traditionally relied on data from government or proprietary sources, which can be restrictive due to cost, update frequency, and access limitations. OpenStreetMap (OSM), as a freely available and continuously updated repository of global data, presents a significant shift in data sourcing for urban analysis (Kunze, 2015; Chen, 2016; Zhai, 2019; Xu and Gao, 2016; Xue et al., 2020). OSM's comprehensive geographic details offer an alternative that not only enhances coverage but also includes user-generated updates that capture changes in real-time (Yu et al. 2013). However, challenges with data quality, which can vary widely in accuracy and detail due to its crowd-sourced nature, necessitate sophisticated validation techniques to ensure the reliability of OSM data for critical urban planning and analysis (Mooney & Minghini, 2017).

Feature engineering becomes a critical next step in enhancing the use of OpenStreetMap (OSM) data for urban analysis. In traditional urban studies, direct use of raw OSM data often fails to capture the nuanced dynamics of urban environments. Feature engineering addresses this by transforming raw data into a refined format that emphasizes the most informative aspects of urban areas (Ribeiro et al. 2016, Fan et al. 2019). The process of feature engineering is typically divided into three key stages: feature selection, feature extraction, and feature construction (Wei et al. 2021). Feature selection focuses on dimensionality reduction, aiming to minimize the number of data features while preserving essential information. This step is crucial for simplifying the data analysis process without losing significant insights. Feature extraction transforms a set of features into physically or statistically significant indicators, further refining the data for precise analytical tasks. Lastly, feature construction involves creating new features from existing data through various techniques, including combining different attributes to form new, meaningful indicators. By selecting and engineering features that reflect key urban characteristics, analysts can develop models that more accurately represent urban complexities. This methodological shift is crucial for overcoming the limitations of direct data analysis, which may overlook subtle but critical urban patterns due to noise and irrelevant information (Waring et al. 2020).

Building on the foundational work of feature engineering, the transition to clustering methodologies represents a natural progression in identifying urban areas. Traditional clustering techniques often hinge on predetermined assumptions about data structures, such as the number of clusters, which may not accurately reflect the intricate and varied patterns found within urban data. In contrast, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) offers a robust alternative by clustering data based on local density rather than preset parameters, thereby aligning more closely with the actual spatial distribution of urban features (Ester et al., 1996; Khan et al., 2014; Li et al., 2020). This approach is particularly effective in capturing the heterogeneous nature of urban patterns that vary significantly in density and scale, thus providing a more accurate depiction of urban environments.

By synthesizing OSM data, feature engineering, and DBSCAN clustering, this approach provides a robust framework for data-driven urban areas delineation. This integration allows for a comprehensive and nuanced analysis of urban dynamics, supporting urban planners and policymakers with tools that are adaptable to the rapid changes’ characteristic of modern urban environments. The application of these methodologies addresses critical gaps in traditional urban analysis and offers a scalable and replicable system for defining urban boundaries and identifying urban areas.

3.1 Data collection

This research integrates four datasets to support a comprehensive urban area analysis: (a) OpenStreetMap (OSM) data, (b) Sentinel-2 10m Land Use/Land Cover data, (c) nighttime light data, and (d) LandScan data, each contributing uniquely to the urban clustering and validation process.

(a) The OSM data was utilized primarily for its rich categorization of Points of Interest (POIs), which are essential for urban clustering. This dataset, sourced from Geofabrik (https://download.geofabrik.de/) for the year 2021, comprises elements such as nodes, ways, relations, and areas. The study focused on parsing and classifying over 20 million POI entries across 29 predefined categories, as detailed in Table 1, using Python 3.7.0, with further geographical projection into EPSG:25832 - ETRS89 / UTM zone 32N for precise spatial analysis.

(b) The Sentinel-2 10m Land Use/Land Cover data, retrieved from https://www.arcgis.com/home/item.html?id=fc92d38533d440078f17678ebc20e8e2, offers detailed land cover types crucial for distinguishing urban from non-urban areas. This 2021 dataset, developed by interpreting ESA Sentinel-2 imagery, categorizes land into nine types: water, trees, flooded vegetation, crops, built areas, bare ground, snow/ice, clouds, and rangeland. The built areas are utilized as ground truth for urban extent in our feature engineering and entropy-based cluster selection, aiding in more accurately delineating urban boundaries.

(c) Nighttime light data, crucial for validating the accuracy and relevance of our clustering results, was sourced from the Suomi National Polar-orbiting Partnership (Suomi NPP) satellite's Visible Infrared Imaging Radiometer Suite (VIIRS). This dataset, from June 2021 and available at https://eogdata.mines.edu/products/vnl/, captures artificial lighting indicative of human activities. Extensive processing, including data correction, resampling, and cropping, was conducted to ensure alignment with the urban clusters identified in our study.

(d) LandScan data, critical for analyzing population distribution within urban clusters identified in our study, was sourced from the Oak Ridge National Laboratory (https://landscan.ornl.gov/).This high-resolution dataset furnishes population density estimates per square kilometer, that are instrumental in verifying how demographic distributions align with the spatial patterns determined through our clustering approach, thereby substantiating the urban delineations.

Table 1. Classification and Description of Points of Interest (POI) Types in OpenStreetMap

Types Names of OSM	Description of POI Types
aerialway	station, Pylon, Cabin
aeroway	Aerodrome, Runway, Terminal, Apron
amenity	arts center, atm, bank, bar, bench, bicycle parking, bicycle rental, fountain, sustenance, education, transportation, financial, healthcare, public service, facilities, waste management…
barrier	bollard, gate, block, Linear barriers, Access control on highways…
Boundary	Boundary
building	apartments, building, hotel, house, accommodation, commercial, religious, civic/amenity, agricultural/plant production, sports, storage, cars, power/technical buildings, other buildings, additional attributes…
craft	beekeeper、blacksmith、boatbuilder、brewery、carpenter、clockmaker、electronics_repair、embroiderer、goldsmith、handicraft、hvac、jeweller、locksmith、painter、photographer、plumber、pottery、roofer、shoemaker、stonemason、tailor、tiler、watchmaker、winery、wickerwork
emergency	fire hydrant, defibrillator, ambulance station, emergency _ward _entrance, medical rescue, firefighters, lifeguards, assembly point, other structures…
Geological	outcrop、glacier、palaeontological_site、volcano、geothermal、geological_fault
healthcare	alternative、birthing_center、blood_donation、clinic、dentist、doctor、laboratory、midwife、optometrist、pharmacy、physiotherapist、rehabilitation、sample_collection、speech_therapist、vaccination_centre
highway	bus stop, crossing motorway junction, roads, link roads, special road types, paths, sidewalk/crosswalk, cycleway, lifecycle, attributes, other highway features…
historic	memorial, monument…
landuse	commercial, industrial, farmland, forest, meadow, developed land, rural and agricultural land, waterbody…
leisure	park, picnic table, playground, swimming pool…
man made	antenna, flagpole, monitoring station, tower…
Military	airfield、barracks、bunker、checkpoint、danger_area、naval_base、range、training_area
natural	peak, tree, grassland, tree, vegetation, water-related, geology-related…
office	accountant, company, government, adoption _agency…
place	Administratively declared places, Populated settlements, urban, Populated settlements, urban and rural, Other places, Additional attributes
power	cable、catenary_mast、compensator、converter、generator、heliostat、insulator、line、minor_line、plant、pole、portal、substation、switch、tower、transformer、terminal
public transport	platform, station, stop_ area, stop_ position…
railway	station, subway entrance, ventilation shaft, tracks, additional features, stations and stops, other railways…
route	bicycle、bus、canoe、detour、ferry、foot、hiking、horse、light_rail、mtb、pipeline、piste、power、railway、road、running、ski、train、tram
shop	alcohol, antiques, art, books, clothes, convenience, hairdresser, food, mall, charity, health and beauty, do-it-yourself, furniture and interior, electronics, outdoors and sport, stationery
sport	gym, yoga…
telecom	data_center、distribution_point、exchange、manhole、pole、service_device、street_cabinet。
tourism	artwork, gallery, hotel, museum…
water	lake、pond、reservoir、river、stream、waterfall、well。
waterway	basin、dock、lake、lagoon、oxbow、pond、reservoir、river、riverbank、stream、tidal_channel、waterfall、wetland

3.2 Feature Engineering for Data Processing

Feature engineering is the application of domain knowledge to transform raw data into meaningful features that enhance model performance, reduce complexity, and improve computational speed and accuracy. This process includes identifying the most relevant features, improving data quality through transformation, reducing dimensionality, and creating new features. In the context of this study, feature engineering is crucial for refining the dataset derived from OpenStreetMap (OSM), which includes 29 categories of Points of Interest (POI). Many of these POI categories are not pertinent to urban areas and could adversely affect clustering outcomes. Therefore, feature engineering is employed to select the POI categories most relevant to urban areas.

Feature selection is a pivotal step in feature engineering, involving techniques that rank variables and set thresholds to exclude less significant features (Chandrashekar and Sahin, 2014). Commonly used methods for feature selection include Pearson correlation coefficient, chi-square test, and Mutual Information (MI).

For this study, Mutual Information is chosen for its ability to measure the amount of information shared between a feature and the target variable, thereby identifying the features most representative of urban areas. Unlike Pearson correlation, which only captures linear relationships, and the chi-square test, which is limited to categorical data, Mutual Information is particularly advantageous because it captures both linear and non-linear relationships between variables. This makes MI a robust method for feature selection in diverse and complex datasets. The formula for Mutual Information is as follows:

where P(x,y) is the joint probability mass function of X and Y, and P(x) and P(y) are the marginal probability mass functions of X and Y (Kraskov et al. 2004).

3.3 DBSCAN Clustering for Defining Data-Driven City Boundaries

Density-based clustering methods, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), excel at identifying clusters by analyzing the density relationships among data points. Unlike other clustering algorithms like K-Means or Hierarchical Clustering, which may struggle with arbitrary-shaped clusters or require a predefined number of clusters, DBSCAN is well-suited for handling complex and irregularly spaced data without predefined structures. This makes it ideal for generating city boundaries from OpenStreetMap (OSM) data, where the number and shape of clusters cannot be predetermined, and noise within the data is a significant consideration.

DBSCAN, developed by Ester et al. in 1996, operates using two primary parameters: ϵ (the specified radius) and MinPts (minimum points). Understanding these concepts is essential for grasping how DBSCAN works. A core point is one that has at least MinPts other points within a given radius ϵ, satisfying the minimum density requirement for a cluster. Border points are those that are within the ϵ neighborhood of a core point but do not themselves have enough neighbors to be a core point. Noise points do not fall within the ϵneighborhood of any core points and hence do not belong to any cluster. The concept of density-reachability is crucial in DBSCAN. A point p is said to be density-reachable from another point q if there exists a chain of points where each point in the chain is within the ϵ distance from the next point, starting from q to p. This chain must pass through at least one core point.

As illustrated in Figure 1, the original data points are depicted in Figure 1(a). Figure 1(b) shows the number of points within a circle centered at point A with a radius of ϵ. Figure 1(c) demonstrates the process of iterating over all points and aggregating all points that are density-reachable from this central point into a single cluster. Within this illustration, the red points, such as point A, represent core points. The blue points, such as points B and C, are border points, while point N is identified as a noise point. It is evident that the parameters ϵ and MinPts are crucial to the outcomes of the DBSCAN clustering algorithm. Their flexibility also ensures the feasibility of using DBSCAN to generate the city.

The DBSCAN algorithm can be summarized in a series of steps. Initially, all points are marked as unvisited. For each unvisited point, it is checked whether it has at least MinPts neighbors within the ϵ radius. If so, it becomes a core point, starting a new cluster. The core point is then marked as visited, and all its ϵ-neighborhood points are added to a candidate set for cluster expansion. For each point in this set, if it also has at least MinPts neighbors, it is added to the cluster, and its neighbors are added to the candidate set. This process continues until there are no more points in the candidate set. The algorithm then proceeds to the next unvisited point and repeats the process until all points have been visited and classified either as part of a cluster or as noise.

Generating city boundaries using OSM data with DBSCAN involves creating a grid overlay on the satellite imagery, typically in 10x10 meter segments. The POI data is then joined to the grid, counting the occurrences of each category within the grid cells. Feature selection is applied to identify the top 15% of features most indicative of urban areas from the 29 POI categories. This step filters out the categories that do not contribute significantly to defining urban boundaries. The DBSCAN algorithm is then run on the selected features, with the parameters ϵ and MinPts tuned to optimize cluster detection, considering the local density of POIs. The resulting clusters are analyzed to define the urban boundaries. Core points indicate high-density urban areas, while border points help outline the periphery. Noise points are disregarded as they do not contribute to meaningful urban areas.

By leveraging DBSCAN's strengths in handling noise and detecting clusters of arbitrary shapes, this approach provides a data-driven solution to define city boundaries accurately, facilitating urban planning and analysis based on OSM data. As illustrated in the figures, DBSCAN effectively identifies clusters of urban areas by analyzing the density of relevant POIs, such as buildings, amenities, shops, power infrastructure, and emergency services. This method ensures a robust delineation of city boundaries, accounting for the complexity and variability inherent in urban data.

3.4 Validating Urban Clustering: Nighttime Light Data and Zipf's Law

In this study, we employ two distinct methodological approaches—nighttime light data and Zipf’s law—to perform a multifaceted validation of our clustering results. These methods are integral to enhancing the reliability and applicability of our findings within the domain of urban studies. The utilization of nighttime light data, which reflects socio-economic activities through the lens of artificial lighting visible from space, combined with the theoretical underpinnings of Zipf’s law that describes city size distribution, provides a comprehensive framework for assessing the accuracy of our clustering methodology.

3.4.1 Evaluation Using Nighttime Light Data

First, we employ nighttime light data to evaluate the robustness and scientific validity of our proposed clustering methodology. Nighttime light imagery, indicative of human activities and urban development, serves as a robust indicator for socio-economic dynamics, reflecting the intensity of artificial lighting observed via satellite (Mahtta et al., 2019). The procedure begins with the acquisition of nighttime light datasets from NASA's Earth Data portal (https://www.earthdata.nasa.gov/), followed by meticulous geographic calibration specific to Germany's contours. We establish a threshold for nighttime light intensity at 3.5 (Cao et al., 2023), which facilitates the demarcation of urban and non-urban regions. This delineation allows for an empirical comparison between the urban clusters derived from our clustering algorithm and the illuminated areas identified via satellite imagery, thus providing a quantitative measure for validating our clustering results.

3.4.2 Application of Zipf’s Law to Urban Studies

Furthermore, we apply Zipf’s law as a theoretical framework for validating the urban clusters identified through our study. It posits that the population size of a city inversely correlates with its rank in the urban hierarchy, suggesting that the largest city is approximately twice as large as the second largest city, three times as large as the third, and so on (Zipf, 1949). This statistical relationship can be leveraged to predict expected urban area sizes and compare them with the clusters identified through our analysis. By comparing the empirical data with the theoretical expectations, we can rigorously evaluate whether our clustering method accurately reflects the underlying urban structure. This comparison not only substantiates the reliability of our clustering outcomes but also aligns our findings with established economic models of urban distribution.

4.1 Background

4.1.1 Study Area

This study is centered on Bavaria (Fig. 2), located in southern Germany, recognized for its rich geographical diversity, strong economic performance, varied administrative frameworks, and major urban centers. These elements collectively provide valuable insights into urban planning and development, relevant both within the local context and more broadly, potentially influencing policies and practices in similar regions worldwide.

Geographically, Bavaria spans approximately 70,550 square kilometers, characterized by a varied landscape that ranges from the Alpine mountains in the south to expansive plains in the north. This environmental diversity establishes a solid foundation for investigating the effects of ecological and topographical factors on urban development.

Economically, Bavaria is a significant contributor to Germany's GDP, accounting for about 18.5% with its robust industries and high economic output, as highlighted by a gross regional product of €768.5 billion in 2023 The region's economic vitality is evident in the bustling activities across both its urban and rural areas, impacting urban expansion and infrastructure development.

Administratively, Bavaria comprises seven regions, 25 municipalities, and 71 counties, each with distinct governance structures and urban planning challenges. This administrative diversity provides a unique lens through which to examine the effects of various governance models on urban development and regional planning.

The major cities of Bavaria, including Munich, Nuremberg, and Augsburg, add further depth to this study. Munich, the state capital, acts as a cultural and economic nucleus with significant global influence and a comprehensive public transportation network. Nuremberg and Augsburg offer contrasting scenarios, showcasing rich historical contexts alongside modern urban challenges, enriching the understanding of urban dynamics in Bavaria.

4.1.2 Workflow

The workflow diagram presented in Fig. 3 outlines the methodological framework used in this study for urban area delineation through a three-step process: Data Cleaning & Feature Engineering, Clustering Generation, and Clustering Evaluation.

Step 1: Data Cleaning & Feature Engineering

The project begins by extracting data from OpenStreetMap (OSM), focusing specifically on Points of Interest (POIs) which are categorized into 29 distinct types. Through feature engineering, this is refined down to the five categories most relevant to urban areas, thereby enhancing the dataset for more targeted analysis. These selected features are then integrated with Sentinel-2 10m Land Use/Land Cover data, aligning the POIs with actual urban extents. This integration is crucial for accurate geographic data analysis. More specifically, a 10x10 meter grid is established around each satellite image point, and the POIs are spatially joined to these grids. This arrangement enables precise quantification of each category’s occurrence across the grids. Subsequently, a feature selection algorithm assesses these integrated data points to identify the top 15% with the highest Mutual Information (MI) values for clustering. This process highlights the most significant categories—buildings, amenities, shops, power, and emergency services—out of the original 29, which are critical for accurately representing urban areas. These selected features significantly enhance the clustering process and improve the overall efficacy of data-driven urban planning strategies.

Step 2: Clustering Generation

Clusters are generated using the DBSCAN algorithm, which methodically iterates through spatial densities from 10m to 1000m at increments of 10m, employing a minimum threshold for point inclusion to filter noise and irrelevant data points. Concurrently, entropy calculations are conducted for each cluster to assess the precision of the clustering process, aiding in the identification of well-defined urban regions.

Step 3: Clustering Evaluation

In this phase, the clustering results are evaluated by calculating the joint entropy with the ground truth values from Sentinel-2 data to determine the optimal clustering outcomes with the lowest entropy. Minimal entropy indicates the smallest discrepancy between clustering results and the urban areas in Sentinel-2 imagery, suggesting that the clustering outcomes are closely aligned with the built-up area distributions observed in the Sentinel-2 imagery. Further validation is conducted through comparisons with nighttime light data, providing an empirical confirmation of the urban areas identified by the clustering process. Lastly, the clustering results are also validated against Zipf's Law to ensure that the size distribution of the clusters follows expected theoretical patterns in urban settings.

4.2 Results on Classification and analysis

Figure 4 presents a heatmap representing the traversal results of clustering with ε ranging from 10m to 1000m and a filtering threshold from 500 to 10,000. The color red indicates lower entropy, while blue signifies higher entropy. In our traversal, the optimal results occur at the bottom left with ε at 10m and a filtering threshold of 1500. Subsequent analyses will use these parameters to further validate the reliability of the clustering results.

Figure 5 illustrates the distribution of cluster sizes after noise clusters have been filtered out and visualize the spatial dynamics and the scale of urbanization in Bavaria. The results highlight that a total of 652 clusters were identified. The majority of these clusters contain fewer than 50,000 points each, indicating a high concentration of smaller, densely packed urban activities or features. Most of these clusters, each containing fewer than 50,000 points, represent a large number of small clusters, highlighting the typical urban distribution in the region—primarily consisting of small to medium-sized clusters. Only three clusters exceed 100,000 points, corresponding to major Bavarian cities: Munich, Nuremberg, and Augsburg, which are among the largest urban centers in the state. The scarcity of large clusters reflects a concentrated urban development in key cities, distinguished by their significant size and comprehensive urban structures. This pattern not only informs about the distribution of urban activity across Bavaria but also about the effectiveness of the chosen DBSCAN parameters in distinguishing between high-density urban cores and less densely populated areas. The declining frequency of larger cluster sizes captured in this figure further suggests that while most urban activities are localized and confined, the significant clusters representing Munich, Nuremberg, and Augsburg highlight central urban hubs with extensive socio-economic activities.

Figure 6 further illustrate the urban clustering in Bavaria, delineating areas into five distinct levels based on natural breakpoints in area data. Table 2 provides a statistical summary including the number of urban clusters (N_urban), the total area (A_total) and population (P_total) of these clusters, alongside the area (A_largest) and population (P_largest) of the largest cluster in each level from Level 1 to Level 5.

Level 1 encompasses a single urban cluster representing Munich, Bavaria's largest city. It covers an expansive area of 530,000 square kilometers and hosts a population of 1.80 million. Traditionally noted for its dense urban core, this cluster stretches into the northeastern and southwestern peripheries, incorporating suburban and peri-urban regions typically absent in standard urban mappings. This expansion reflects Munich’s broad infrastructural and socio-economic influence, extending well beyond its official boundaries to include a more comprehensive metropolitan area.

Level 2, represented solely by Nuremberg, spans 320,000 square kilometers with a population of 650,000. The cluster captures a broader urban footprint than typically recognized, indicating areas of influence that extend into neighboring regions. This city stands as a critical economic and historical center, slightly smaller in scale than Munich but equally significant in its regional influence. The urban structure here transitions from the dense metropolitan fabric of Munich to a slightly more dispersed arrangement, characteristic of secondary urban centers.

Level 3 diversifies with two urban clusters: Augsburg and Landsberg am Lech, covering a total area of 360,000 square kilometers for 500,000 inhabitants. The clustering suggests an urban expansion that integrates peripheral areas, serving as transitional zones between dense urban settings and suburban areas, thus redefining their urban scope and connectivity.

Level 4 features 28 clusters covering 960,000 square kilometers and a total population of 1.60 million. Schwabach and Regensburg stand out within this category; Schwabach, with the largest area, and Regensburg, with the highest population, illustrate the shift towards more granular urban spread. The clusters depict a broader urban footprint that encompasses not just the cities themselves but also their extended suburban and semi-rural contexts. This level illustrates the diffusion of urban characteristics into traditionally non-urban areas, blurring the lines between city and countryside.

Level 5, the most fragmented, consists of 620 clusters spreading across 3.10 million square kilometers with a corresponding population. Freising and Erding exemplify the largest and most populous clusters, respectively, at this level. These areas represent suburban and peripheral urban forms, where urbanization is less intense but critically important for regional connectivity and development, suggesting these are primarily small communities or peripheral areas that complement the urban landscape.

Table 2

Statistics on the urban clusters from Level l to Level 5
Level	N_urban	A_total(km²)	A_largest (km²)	P_total (people)	P_largest (people)
1	1	5.30 × 10⁸	5.30 × 10⁸	1.80 × 10⁶	1.80 × 10⁶
2	1	3.20 × 10⁸	3.20 × 10⁸	6.50 × 10⁵	6.50 × 10⁵
3	2	3.60 × 10⁸	2.10 × 10⁸	5.00 × 10⁵	3.90 × 10⁵
4	28	9.60 × 10⁸	7.00 × 10⁷	1.60 × 10⁶	1.30 × 10⁵
5	620	3.10 × 10⁹	1.90 × 10⁷	3.10 × 10⁶	5.20 × 104

4.3 Robustness test

4.3.1 Evaluation of Feature Engineering and Non-Feature Engineering Results

To validate the efficacy of our clustering approach in urban area delineation, we employed high-resolution Sentinel-2 10m Land Use/Land Cover imagery as a baseline. Remote sensing image interpretation is a prevalent method for urban detection, which leverages the discernible characteristics of impervious surfaces visible in the imagery to outline urban extents. By using these high-definition images as a ground truth reference, we could critically assess the enhancements brought about by our feature engineering techniques in accurately capturing the urban landscapes.

We adopted the Accuracy (ACC) metric, a widely used statistical measure in classification accuracy assessments, to evaluate the precision of our clustering method after implementing feature engineering. This metric can not only quantify the correctness of our classifications but also help to validate the improvements made by integrating feature engineering into the clustering analysis. ACC is calculated by the formula:

$$ACC=\frac{TP+TN}{TP+TN+FN+FP}$$

where "True Positives" (TP) denote areas where both our clustering method and the remote sensing imagery independently confirm the presence of urban development, and "True Negatives" (TN) refer to areas correctly identified as non-urban by both sources. "False Negatives" (FN) are instances where our clustering fails to detect urban areas that are evident in the imagery, and "False Positives" (FP) occur when our method mistakenly classifies non-urban areas as urban.

Figure 7 utilizes heatmaps to conduct a comprehensive analysis of clustering accuracy, comparing the outcomes of feature engineering against traditional methods that do not employ feature selection. Figure 7(a) displays the Accuracy (ACC) values for clusters processed through feature engineering, where lighter colors indicate higher ACC values. Conversely, Fig. 7(b) shows ACC values for clusters incorporating all categories of OSM data without the application of feature engineering. During the testing, we systematically examined the values of ε (the specified radius) and MinPts (minimum points) to observe how clustering outcomes varied with changes in these hyperparameters. Notably, the heatmap in Fig. 7(a) is lighter than in Fig. 7(b), demonstrating that feature engineering leads to higher ACC values. This suggests that feature engineering significantly enhances clustering results across a wide range of ACC values compared to non-feature engineered approaches, emphasizing its efficacy in refining urban area delineation.

Figure 8 visually demonstrates the superior performance of feature-engineered clustering compared to traditional non-feature-engineered approaches across a wide range of accuracy metrics. In the heatmap displayed, yellow represents the distribution of Accuracy (ACC) values for clusters analyzed with feature engineering, while red denotes the ACC distribution for clusters processed without the application of feature engineering techniques. The three dashed lines in each plot correspond to the 25th, 50th, and 75th percentile marks of the ACC values, respectively. Notably, the ACC values at both the 25th and 50th percentiles are significantly higher for the feature-engineered outcomes than for the non-engineered ones, indicating a marked improvement in clustering precision. The tail end of the ACC distribution suggests no valid clusters formed due to the stringent criteria of too small a radius (ε) requiring too high a minimum point threshold (MinPts), underscoring a common challenge in parametric clustering methods. Overall, this pattern suggests that feature engineering results in a higher distribution of ACC values, highlighting its efficacy in refining the clustering process.

Table 3 contrasts the average and maximum ACC values between clustering outcomes with and without feature engineering. The average accuracy improved from 0.66 to 0.71, and the maximum accuracy rose from 0.847 to 0.888, indicating an enhancement of 5% and 4%, respectively. This improvement underscores the efficacy of feature engineering in filtering out irrelevant or autocorrelated data, thus reducing noise exposure in DBSCAN clustering and enhancing the overall accuracy of the clustering results.

Table 3

Comparison of Clustering Performance with and without Feature Engineering
ACC	Clustering Without FE (%)	Clustering with FE	Difference (%)
Avg	71.43	66.32	5.11
Max	88.89	84.74	4.15

4.3.2 Entropy-Based Evaluation of Clustering Results

During the DBSCAN clustering process, we iterated over the ϵ parameter, ranging from 10 meters to 1000 meters in increments of 10 meters, resulting in 100 different clustering outcomes. To determine the best clustering results, we used entropy as a measure to evaluate the clustering quality.

Shannon's entropy, introduced by Claude Shannon in 1948, measures the uncertainty of an information source. Entropy has since been applied across various fields, including biology, landscape ecology, and urban studies, as a measure of diversity and spatial dispersion (Cabral et al., 2013). In urban studies, entropy is used to describe land-use patterns, distinguishing between diverse and monofunctional areas (Cervero & Kockelman, 1997), and to define urban boundaries (Wei et al., 2009).

Entropy reaches its maximum when probabilities are evenly distributed and is zero when concentrated in a single location, making it a valuable metric for assessing spatial concentration or dispersion (Purvis et al., 2019). In the context of urban boundary definition, employing entropy relies on the assumption that higher entropy values indicate a more accurate resemblance to urban characteristics. Studies by Tannier et al. (2011), Arcaute et al. (2016), and Cao et al. (2020) have demonstrated that Shannon’s entropy values strongly correlate with urban areas. Liu et al. (2019) utilized the entropy method to compute optimal threshold values, aiding in the delineation of natural city limits through the aggregation of Voronoi polygons generated from POI service areas. However, this assumption may fail at certain scales or in specific cities due to the inherent complexity of urban environments (Cao et al. (2020); Cao et al. (2023)).

In our study, while we also use entropy to delineate urban areas, we calculate the joint distribution entropy between Sentinel-2 data and clustering data, rather than relying on the assumption that higher entropy values indicate a closer resemblance to urban characteristics. According to the definition of entropy, lower entropy signifies a higher similarity between the two datasets. By incorporating Sentinel-2 data, we extend the application of entropy in urban definition, providing a more nuanced approach to assessing urban boundaries. The formula for entropy is:

$$H\left(X\right)=-{\sum }_{i=1}^{n}P\left({x}_{i}\right)\text{log}P\left({x}_{i}\right)$$

Where P(x_i) is the probability of occurrence of state x_i. lower entropy values indicate more similarity and a more accurate representation of urban boundaries. By minimum entropy, we ensure that the clusters formed reflect the complexity and variability of urban structures, leading to more precise city boundaries.

4.3.3 Validation of Clustering Results with Nighttime Light Data

Figure 9 provides a visual comparison of urban clusters identified by our clustering method against urban areas delineated by nighttime light data. The red areas, highlighted by nighttime light observations, suggest regions of intense human activity and artificial lighting. The significant overlap between these areas and our clusters underscores the efficacy of our method in accurately capturing the true extents of urban regions. Moreover, it highlights the importance of integrating multiple data sources and analytical techniques to refine the delineation of urban spaces, ensuring that both visible and functional urban characteristics are comprehensively captured. However, the blue areas represent urban clusters identified by our method that do not align with nighttime light-based urban areas, indicating some functionally regions that may be day-active but less brightly-lit. This confirms that our clustering may correct some conventional misinterpretations or omissions by the nighttime light approach. Thus, the yellow areas not covered by our clusters could indicate regions that, while brightly lit, do not exhibit the density or connectivity typically characteristic of urban clusters, such as commercial billboards along highways or large facilities with extensive lighting. This distinction underscores the intricate aspects of urban clustering, which aims to identify true urban fabric rather than merely areas of light intensity.

4.3.4 Zipf's Law Validation of Clustering Results

Figure 10 demonstrates the Complementary Cumulative Distribution Function (CCDF) of cluster sizes with a power-law fit, validating urban clustering through Zipf's law. The graph shows a close alignment between empirical data and the theoretical power-law distribution, with a scaling parameter α of 2.51. This value, higher than the typical urban range of 1 to 2, indicates a "lighter tail" suggesting fewer large cities and a predominance of smaller urban clusters. The significant deviations from the model primarily occur at larger cluster sizes, particularly in major Bavarian cities like Munich, Nuremberg, and Augsburg. This pattern indicates a prevalent occurrence of smaller clusters relative to larger ones, a common feature in urban structures, where numerous small entities exist alongside a few dominant urban hubs.

The high p-value of 0.99 confirms an excellent fit, indicating that the data closely follows a power law, consistent with Zipf's theory. This suggests that while larger cities are less frequent, they have a significant influence on regional dynamics. This hierarchical pattern supports the effectiveness of the DBSCAN clustering and feature engineering methods used in the study, demonstrating their alignment with established urban development theories. This alignment enhances the credibility of the clustering approach and provides a robust basis for further studies aimed at exploring spatial dynamics and urban growth patterns.

Our study introduces a novel approach to urban area delineation, employing a bottom-up, minimally subjective methodology using OpenStreetMap (OSM) data. This technique offers distinct advantages by reducing the reliance on predefined urban definitions, thus enabling a more organic and data-driven formation of urban clusters. In the context of Bavaria, our approach not only identified urban boundaries accurately but also aligned with key theoretical and empirical benchmarks. The validation of our results through nighttime light data confirms their reliability, while adherence to Zipf’s rank-order power law corroborates the scientific robustness of our method. This dual validation underscores the effectiveness of our method in capturing the true scope and scale of urbanization.

Our research significantly advances the field of urban delineation by extending beyond the conventional reliance on specific types of OpenStreetMap (OSM) data, such as road nodes, which has been common in many previous studies. For instance, while some approaches focus primarily on clustering road network nodes along with economic units to define urban boundaries (Caudillo-Cos et al.,2024; Cang et al.,2024), our methodology encompasses a more comprehensive set of OSM data. We implement feature selection to filter out noise and irrelevant data points, which allows for a more detailed and accurate representation of urban environments, akin to how Chen et al. (2022) employ spatial indexing to manage data thresholds before clustering, ensuring only pertinent data is analyzed.

Distinguishing our study further is the reduction of reliance on prior knowledge in the clustering process. Utilizing entropy to refine our methodology, we achieve a more autonomous and unbiased urban cluster generation. This aspect of our study is particularly innovative compared to methods like those of Arcaute et al. (2016), who determine minimum cluster sizes, or Cao et al. (2020), who filter out fragmented areas based on predetermined sizes. These conventional methods typically define urban areas by applying subjective thresholds on data points. Our fully data-driven approach ensures that our urban clustering not only adheres to the Zipfian distribution but also captures the unique urban growth patterns and granular city shapes seen across Bavaria.

This paper introduces a novel method for urban area delineation based on OpenStreetMap (OSM) data, applied within Bavaria to analyze urban clustering at various scales. Our methodology is twofold: first, leveraging feature engineering to effectively sift through and select city-related points of interest (POIs) that are most indicative of urban areas; second, utilizing the DBSCAN algorithm to autonomously generate urban structures from the bottom up. Our findings confirm that incorporating feature engineering significantly improves the outcome of clustering algorithms by aligning urban representations more closely with real-world urban patterns.

Despite the progress demonstrated, our study acknowledges several areas for further enhancement and investigation. First, while DBSCAN's flexibility in not requiring predefined cluster numbers is advantageous, the variability in its hyperparameters introduces an element of uncertainty. Optimizing these settings to balance efficiency with accuracy remains a challenge. Second, as cities expand and regional interconnections strengthen, understanding these relationships becomes crucial. Ignoring the intensity of these connections could lead to misinterpretations about the direction and scale of urban growth. Third, geographical heterogeneity, such as terrain and natural features, plays a critical role in shaping urban landscapes, suggesting that future research should also consider these elements to enhance the accuracy of urban delineation. By addressing these challenges, future research can refine the methodology to provide even more robust tools for urban planners and policymakers engaged in managing and understanding urban environments.

Acknowledge

This study was conducted as part of the Ph.D. research program at the Professorship of Big Geospatial Data Management, Technical University of Munich(TUM). We extend our gratitude to the China Scholarship Council (No.202108080306) for their financial support of these doctoral studies. Additionally, we express our appreciation to the experts, mentors, doctoral candidates and reviewers whose insightful discusssions and constructive feedback significantly enhanced this paper.

Author Contribution

C. F : Conceptualization, Formal analysis, Funding acquisition, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.L. Z : Writing review & editing.X. G: Formal analysis, Validation, VisualizationX. L: Data curation, Investigation, Visualization. M. W: Project administration, Supervision, Writing – review & editing.

Data Availability

Data is provided within the manuscript or supplementary information files.

Agrawal, S. and Agrawal, J., 2015. Survey on anomaly detection using data mining techniques. Procedia Computer Science, 60, 708–713.
Arcaute, Elsa, et al. "Cities and regions in Britain through hierarchical percolation." Royal Society open science 3.4 (2016): 150691.
Basu, A., Garain, A., and Naskar, S.K., 2019. Word diﬃculty prediction using convolutional neural networks. In: TENCON 2019-2019 IEEE Region 10 Conference (TENCON), 1109–1112.
Batty, M., 2006. Rank clocks. Nature, 444 (7119), 592–596.
Breckenkamp, J., et al., 2017. Deﬁnitions of urban areas feasible for examining urban health in the European Union. The European Journal of Public Health, 27 (suppl 2), 19–24.
Brinkhoﬀ, T., 2016. OpenStreetMap data as source for built-up and urban areas on global scale. The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 41, 557.
Cao, Wenpu, et al. "Constructing multi-level urban clusters based on population distributions and interactions." Computers, Environment and Urban Systems 99 (2023): 101897.
Calantone, R.J. and Di Benedetto, C.A., 2007. Clustering product launches by price and launch strategy. Journal of Business & Industrial Marketing.
Cang, Jun, Peipei Wu, and Shanlang Lin. "Redefining the boundaries of Chinese cities—Analysis based on multisource geographical big data." Cities 149 (2024): 104984.
Caudillo-Cos, Camilo Alberto, et al. "Defining urban boundaries through DBSCAN and Shannon's entropy: The case of the Mexican National Urban System." Cities 149 (2024): 104969.
Chandrashekar, G. and Sahin, F., 2014. A survey on feature selection methods. Computers & Electrical Engineering, 40 (1), 16–28.
Chen, Yanguang. "Defining urban and rural regions by multifractal spectrums of urbanization." Fractals 24.01 (2016): 1650004.
City, B.L. and Assessment, E., 2010. Urbanization and health. Bull World Health Organ, 88 (4), 245–246.
Cockx, L., Colen, L., and De Weerdt, J., 2018. From corn to popcorn? Urbanization and dietary change: Evidence from rural-urban migrants in Tanzania. World Development, 110, 140–159.
Cohen, B., 2004. Urban growth in developing countries: a review of current trends and a caution regarding existing forecasts. World development, 32 (1), 23–51. Dacrema, M.F., Gasparin, A., and Cremonesi, P., 2018. Deriving item features relevance from collaborative domain knowledge. arXiv preprint arXiv:1811.01905.
Davies, D.L. and Bouldin, D.W., 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224–227.
de Araujo, A., do Valle, J.M., and Cacho, N., 2020. Geographic Feature Engineering with Points-of-Interest from OpenStreetMap.. In: KDIR, 116–123.
Dobkins, L.H. and Ioannides, Y.M., 2001. Spatial interactions among US cities: 19001990. Regional science and urban Economics, 31 (6), 701–731.
Dong, Q., et al., 2022. A method to identify urban fringe area based on the industry density of POI. ISPRS International Journal of Geo-Information, 11 (2), 128. Dzie˙zyc, M., et al., 2020. Can we ditch feature engineering? end-to-end deep learning for aﬀect recognition from physiological sensor data. Sensors, 20 (22), 6535.
Eaton, J. and Eckstein, Z., 1997. Cities and growth: Theory and evidence from France and Japan. Regional science and urban Economics, 27 (4-5), 443–474.
Ester, M., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In: kdd, Vol. 96, 226–231.
Fan, C., et al., 2019. Deep learning-based feature engineering methods for improved building energy prediction. Applied energy, 240, 35–45.
Fang, C. and Zhao, S., 2018. A comparative study of spatiotemporal patterns of urban expansion in six major cities of the Yangtze River Delta from 1980 to 2015. Ecosystem health and sustainability, 4 (4), 95–114.
Feng, R. and Wang, K., 2022. The direct and lag eﬀects of administrative division adjustment on urban expansion patterns in Chinese mega-urban agglomerations. Land Use Policy, 112, 105805.
Fox, S., Bloch, R., and Monroy, J., 2018. Understanding the dynamics of Nigeria’s urban transition: A refutation of the ‘stalled urbanisation’hypothesis. Urban Studies, 55 (5), 947–964.
Frate, F., Schiavon, G., and Solimini, C., 2004. Application of neural networks algorithms to QuickBird imagery for classiﬁcation and change detection of urban areas. In: IGARSS 2004. 2004 IEEE International Geoscience and Remote Sensing Symposium, Vol. 2, 1091–1094.
Goldschen, A.J., Garcia, O.N., and Petajan, E.D., 1997. Continuous automatic speech recognition by lipreading. Springer.
Grimmond, S., 2007. Urbanization and global environmental change: local eﬀects of urban warming. The Geographical Journal, 173 (1), 83–88.
Haghshenas, H., Vaziri, M., and Gholamialam, A., 2015. Evaluation of sustainable policy in urban transportation using system dynamics and world cities data: A case study in Isfahan. Cities, 45, 104–115.
Harris, R. and Lewis, R., 2001. The geography of North American cities and suburbs, 1900-1950: A new synthesis. Journal of Urban History, 27 (3), 262–292.
Hu, J. and Zhang, Y., 2013. Seasonal change of land-use/land-cover (LULC) detection using MODIS data in rapid urbanization regions: A case study of the pearl river delta region (China). IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 6 (4), 1913–1920.
Julisch, K., 2002. Data Mining for Intrusion Detection.. Applications of data mining in computer security, 33–62.
Jun, Z., Xiao-Die, Y., and Han, L., 2021. The extraction of urban built-up areas by integrating night-time light and POI data—A case study of Kunming, China. Ieee Access, 9, 22417–22429.
Kameshwaran, K. and Malarvizhi, K., 2014. Survey on clustering techniques in data mining. International Journal of Computer Science and Information Technologies, 5 (2), 2272–2276.
Karypis, G., Han, E.H., and Kumar, V., 1999. Chameleon: Hierarchical clustering using dynamic modeling. computer, 32 (8), 68–75.
Khalid, S., Khalil, T., and Nasreen, S., 2014. A survey of feature selection and feature extraction techniques in machine learning. In: 2014 science and information conference, 372–378.
Khan, K., et al., 2014. DBSCAN: Past, present and future. In: The ﬁfth international conference on the applications of digital information and web technologies (ICADIWT 2014), 232–238.
Kim, D., et al., 2007. A music recommendation system with a dynamic k-means clustering algorithm. In: Sixth international conference on machine learning and applications (ICMLA 2007), 399–403.
Kraskov, A., St¨ogbauer, H., and Grassberger, P., 2004. Estimating mutual information.
Physical review E, 69 (6), 066138.
Leung, S.H., Wang, S.L., and Lau, W.H., 2004. Lip image segmentation using fuzzy clus tering incorporating an elliptic shape function. IEEE transactions on image processing, 13 (1), 51–62.
Li, H., et al., 2020. Exploration of OpenStreetMap missing built-up areas using twitter hierarchical clustering and deep learning in Mozambique. ISPRS Journal of Photogrammetry and Remote Sensing, 166, 41–51.
Li, W., et al., 2018. Economic performance of spatial structure in Chinese prefecture regions: Evidence from night-time satellite imagery. Habitat International, 76, 2939.
Li, Y. and Zhao, X., 2012. An empirical study of the impact of human activity on longterm temperature change in China: A perspective from energy consumption. Journal of Geophysical Research: Atmospheres, 117 (D17).
Lin, L., et al., 2021. Remote Sensing of Urban Poverty and Gentriﬁcation. Remote Sensing, 13 (20), 4022.
Liu, X., Huang, Q., and Gao, S., 2019. Exploring the uncertainty of activity zone detection using digital footprints with multi-scaled DBSCAN. International Journal of Geographical Information Science, 33 (6), 1196–1223.
Liu, Z., 2021. Identifying urban land use social functional units: a case study using OSM data. International Journal of Digital Earth, 14 (12), 1798–1817.
Ma, L.J., 2005. Urban administrative restructuring, changing scale relations and local economic development in China. Political Geography, 24 (4), 477–497.
Mahtta, Richa, Anjali Mahendra, and Karen C. Seto. "Building up or spreading out? Typologies of urban growth across 478 cities of 1 million+." Environmental Research Letters 14.12 (2019): 124077.
Mitchell, B.S. and Mancoridis, S., 2001. Comparing the decompositions produced by software clustering algorithms using similarity measurements. In: 744–753.
Mullen, W.F., et al., 2015. Assessing the impact of demographic characteristics on spatial error in volunteered geographic information features. GeoJournal, 80, 587–605. Pansombut, T., et al., 2019. Convolutional neural networks for recognition of lymphoblast cell images. Computational Intelligence and Neuroscience, 2019.
Parekh, J.R., et al., 2021. Automatic detection of impervious surfaces from remotely sensed data using deep learning. Remote Sensing, 13 (16), 3166.
Parr, J.B., 2007. Spatial deﬁnitions of the city: four perspectives. Urban studies, 44 (2), 381–392.
Potts, D., 2018. Urban data and deﬁnitions in sub-Saharan Africa: Mismatches between the pace of urbanisation and employment and livelihood change. Urban Studies, 55 (5), 965–986.
Qin, B. and Zhang, Y., 2014. Note on urbanization in China: Urban deﬁnitions and census data. China Economic Review, 30, 495–502.
Ren, L., et al., 2002. Impacts of human activity on river runoﬀ in the northern area of China. Journal of Hydrology, 261 (1-4), 204–217.
Ribeiro, M.T., Singh, S., and Guestrin, C., 2016. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386.
Rozenfeld, H., et al., 2010. The area and population of cities: New insights from a diﬀerent perspective on cities. arXiv preprint arXiv:1001.5289.
Sagayama, S., 1989. Phoneme environment clustering for speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing,, 397–400. Satterthwaite, D., 2010. Urban myths and the mis-use of data that underpin them. 2010/28 WIDER working paper.
Schickel-Zuber, V. and Faltings, B., 2007. Using hierarchical clustering for learning theontologies used in recommendation systems. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 599–608.
Seto, K.C., Parnell, S., and Elmqvist, T., 2013. A global outlook on urbanization. Urban ization, biodiversity and ecosystem services: challenges and opportunities: a global assessment, 1–12.
Shelke, N.M., Deshpande, S., and Thakre, V., 2012. Survey of techniques for opinion mining. International Journal of Computer Applications, 57 (13).
Shi, H., et al., 2015. Accurate urban area detection in remote sensing images. IEEE Geoscience and Remote Sensing Letters, 12 (9), 1948–1952.
Sinha, P., et al., 2016. Urban built-up area extraction and change detection of Adama municipal area using time-series Landsat images. Int. J. Adv. Remote Sens. GIS, 5 (8), 1886–1895.
Soo, K.T., 2005. Zipf’s Law for cities: a cross-country investigation. Regional science and urban Economics, 35 (3), 239–263.
Svirejeva-Hopkins, A., Schellnhuber, H.J., and Pomaz, V.L., 2004. Urbanised territories as a speciﬁc component of the Global Carbon Cycle. Ecological Modelling, 173 (2-3), 295–312.
Tu, X., et al., 2022. DBSCAN Spatial Clustering Analysis of Urban “Production–LivingEcological” Space Based on POI Data: A Case Study of Central Urban Wuhan, China. International Journal of Environmental Research and Public Health, 19 (9), 5153.
Viana, C.M., Encalada, L., and Rocha, J., 2019. The value of OpenStreetMap historical contributions as a source of sampling data for multi-temporal land use/cover maps. ISPRS International Journal of Geo-Information, 8 (3), 116.
Vlahov, D. and Galea, S., 2002. Urbanization, urbanicity, and health. Journal of Urban Health, 79, S1–S12.
Waring, J., Lindvall, C., and Umeton, R., 2020. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artiﬁcial intelligence in medicine, 104, 101822.
Wei, W., et al., 2021. Towards integration of domain knowledge-guided feature engineering and deep feature learning in surface electromyography-based hand movement recognition. Computational Intelligence and Neuroscience, 2021.
Wineman, A., Alia, D.Y., and Anderson, C.L., 2020. Deﬁnitions of “rural” and “urban” and understandings of economic transformation: Evidence from Tanzania. Journal of rural studies, 79, 254–268.
Xu, Y., et al., 2017. Urban morphology detection and computation for urban climate research. Landscape and urban planning, 167, 212–224.
Xu, Z. and Gao, X., 2016. A novel method for identifying the boundary of urban built-up areas with POI data. Acta Geogr. Sin, 71 (06), 928–939.
Xue, B., et al., 2020. Analysis of spatial economic structure of Northeast China cities based on points of interest big data. Scientia Geographica Sinica, 40 (5), 691–700. Yadav, J. and Sharma, M., 2013. A Review of K-mean Algorithm. Int. J. Eng. Trends Technol, 4 (7), 2972–2976.
Yang, Z., Chen, Y., Guo, G., Zheng, Z., & Wu, Z. (2021). Using nighttime light data to identify the structure of polycentric cities and evaluate urban centers. Science of the Total Environment, 780, Article 146586.
Ye, Y., et al., 2019. Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. Landscape and Urban Planning, 191, 103434.
Yixing, Z. and Yulong, S., 1995. Toward establishing the concept of physical urban area in China. The Journal of Chinese Geography, 5 (4), 1–15.
Yu, C., et al., 2013. Web map-based POI visualization for spatial decision support. Cartography and Geographic Information Science, 40 (3), 172–182.
Zhang, Q. and Seto, K.C., 2013. Can night-time light data identify typologies of urbanization? A global assessment of successes and failures. Remote Sensing, 5 (7), 3476–3494.
Zhang, W., et al., 2020. Measuring megaregional structure in the Pearl River Delta by mobile phone signaling data: A complex network approach. Cities, 104, 102809. Zhao, W., et al., 2019. Exploring semantic elements for urban scene recognition: Deep integration of high-resolution imagery and OpenStreetMap (OSM). ISPRS Journal of Photogrammetry and Remote Sensing, 151, 237–250.
Zheng, A. and Casari, A., 2018. Feature engineering for machine learning: principles and techniques for data scientists. ” O’Reilly Media, Inc.”.
Zipf, G.K., 1949. Human behavior and the principle of least eﬀort: An introduction to human eoclogy. .
Zipf, G.K., 2016. Human behavior and the principle of least e ﬀ ort: An introduction to human ecology. Ravenio Books.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Data-Driven City: An Innovative Approach to Urban Area Delineation

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Data and Methods

4. Experiments

4.1 Background

4.1.1 Study Area

4.1.2 Workflow

4.2 Results on Classification and analysis

4.3 Robustness test

4.3.1 Evaluation of Feature Engineering and Non-Feature Engineering Results

4.3.2 Entropy-Based Evaluation of Clustering Results

4.3.3 Validation of Clustering Results with Nighttime Light Data

4.3.4 Zipf's Law Validation of Clustering Results

5. Discussion

6. Conclusion

Declarations

Acknowledge

Author Contribution

Data Availability

References

Additional Declarations

Status:

Version 1