3.1 Data collection
This research integrates four datasets to support a comprehensive urban area analysis: (a) OpenStreetMap (OSM) data, (b) Sentinel-2 10m Land Use/Land Cover data, (c) nighttime light data, and (d) LandScan data, each contributing uniquely to the urban clustering and validation process.
(a) The OSM data was utilized primarily for its rich categorization of Points of Interest (POIs), which are essential for urban clustering. This dataset, sourced from Geofabrik (https://download.geofabrik.de/) for the year 2021, comprises elements such as nodes, ways, relations, and areas. The study focused on parsing and classifying over 20 million POI entries across 29 predefined categories, as detailed in Table 1, using Python 3.7.0, with further geographical projection into EPSG:25832 - ETRS89 / UTM zone 32N for precise spatial analysis.
(b) The Sentinel-2 10m Land Use/Land Cover data, retrieved from https://www.arcgis.com/home/item.html?id=fc92d38533d440078f17678ebc20e8e2, offers detailed land cover types crucial for distinguishing urban from non-urban areas. This 2021 dataset, developed by interpreting ESA Sentinel-2 imagery, categorizes land into nine types: water, trees, flooded vegetation, crops, built areas, bare ground, snow/ice, clouds, and rangeland. The built areas are utilized as ground truth for urban extent in our feature engineering and entropy-based cluster selection, aiding in more accurately delineating urban boundaries.
(c) Nighttime light data, crucial for validating the accuracy and relevance of our clustering results, was sourced from the Suomi National Polar-orbiting Partnership (Suomi NPP) satellite's Visible Infrared Imaging Radiometer Suite (VIIRS). This dataset, from June 2021 and available at https://eogdata.mines.edu/products/vnl/, captures artificial lighting indicative of human activities. Extensive processing, including data correction, resampling, and cropping, was conducted to ensure alignment with the urban clusters identified in our study.
(d) LandScan data, critical for analyzing population distribution within urban clusters identified in our study, was sourced from the Oak Ridge National Laboratory (https://landscan.ornl.gov/).This high-resolution dataset furnishes population density estimates per square kilometer, that are instrumental in verifying how demographic distributions align with the spatial patterns determined through our clustering approach, thereby substantiating the urban delineations.
Table 1. Classification and Description of Points of Interest (POI) Types in OpenStreetMap
Types Names of OSM
|
Description of POI Types
|
aerialway
|
station, Pylon, Cabin
|
aeroway
|
Aerodrome, Runway, Terminal, Apron
|
amenity
|
arts center, atm, bank, bar, bench, bicycle parking, bicycle rental, fountain, sustenance, education, transportation, financial, healthcare, public service, facilities, waste management…
|
barrier
|
bollard, gate, block, Linear barriers, Access control on highways…
|
Boundary
|
Boundary
|
building
|
apartments, building, hotel, house, accommodation, commercial, religious, civic/amenity, agricultural/plant production, sports, storage, cars, power/technical buildings, other buildings, additional attributes…
|
craft
|
beekeeper、blacksmith、boatbuilder、brewery、carpenter、clockmaker、electronics_repair、embroiderer、goldsmith、handicraft、hvac、jeweller、locksmith、painter、photographer、plumber、pottery、roofer、shoemaker、stonemason、tailor、tiler、watchmaker、winery、wickerwork
|
emergency
|
fire hydrant, defibrillator, ambulance station, emergency _ward _entrance, medical rescue, firefighters, lifeguards, assembly point, other structures…
|
Geological
|
outcrop、glacier、palaeontological_site、volcano、geothermal、geological_fault
|
healthcare
|
alternative、birthing_center、blood_donation、clinic、dentist、doctor、laboratory、midwife、optometrist、pharmacy、physiotherapist、rehabilitation、sample_collection、speech_therapist、vaccination_centre
|
highway
|
bus stop, crossing motorway junction, roads, link roads, special road types, paths, sidewalk/crosswalk, cycleway, lifecycle, attributes, other highway features…
|
historic
|
memorial, monument…
|
landuse
|
commercial, industrial, farmland, forest, meadow, developed land, rural and agricultural land, waterbody…
|
leisure
|
park, picnic table, playground, swimming pool…
|
man made
|
antenna, flagpole, monitoring station, tower…
|
Military
|
airfield、barracks、bunker、checkpoint、danger_area、naval_base、range、training_area
|
natural
|
peak, tree, grassland, tree, vegetation, water-related, geology-related…
|
office
|
accountant, company, government, adoption _agency…
|
place
|
Administratively declared places, Populated settlements, urban, Populated settlements, urban and rural, Other places, Additional attributes
|
power
|
cable、catenary_mast、compensator、converter、generator、heliostat、insulator、line、minor_line、plant、pole、portal、substation、switch、tower、transformer、terminal
|
public transport
|
platform, station, stop_ area, stop_ position…
|
railway
|
station, subway entrance, ventilation shaft, tracks, additional features, stations and stops, other railways…
|
route
|
bicycle、bus、canoe、detour、ferry、foot、hiking、horse、light_rail、mtb、pipeline、piste、power、railway、road、running、ski、train、tram
|
shop
|
alcohol, antiques, art, books, clothes, convenience, hairdresser, food, mall, charity, health and beauty, do-it-yourself, furniture and interior, electronics, outdoors and sport, stationery
|
sport
|
gym, yoga…
|
telecom
|
data_center、distribution_point、exchange、manhole、pole、service_device、street_cabinet。
|
tourism
|
artwork, gallery, hotel, museum…
|
water
|
lake、pond、reservoir、river、stream、waterfall、well。
|
waterway
|
basin、dock、lake、lagoon、oxbow、pond、reservoir、river、riverbank、stream、tidal_channel、waterfall、wetland
|
3.2 Feature Engineering for Data Processing
Feature engineering is the application of domain knowledge to transform raw data into meaningful features that enhance model performance, reduce complexity, and improve computational speed and accuracy. This process includes identifying the most relevant features, improving data quality through transformation, reducing dimensionality, and creating new features. In the context of this study, feature engineering is crucial for refining the dataset derived from OpenStreetMap (OSM), which includes 29 categories of Points of Interest (POI). Many of these POI categories are not pertinent to urban areas and could adversely affect clustering outcomes. Therefore, feature engineering is employed to select the POI categories most relevant to urban areas.
Feature selection is a pivotal step in feature engineering, involving techniques that rank variables and set thresholds to exclude less significant features (Chandrashekar and Sahin, 2014). Commonly used methods for feature selection include Pearson correlation coefficient, chi-square test, and Mutual Information (MI).
For this study, Mutual Information is chosen for its ability to measure the amount of information shared between a feature and the target variable, thereby identifying the features most representative of urban areas. Unlike Pearson correlation, which only captures linear relationships, and the chi-square test, which is limited to categorical data, Mutual Information is particularly advantageous because it captures both linear and non-linear relationships between variables. This makes MI a robust method for feature selection in diverse and complex datasets. The formula for Mutual Information is as follows:
where P(x,y) is the joint probability mass function of X and Y, and P(x) and P(y) are the marginal probability mass functions of X and Y (Kraskov et al. 2004).
3.3 DBSCAN Clustering for Defining Data-Driven City Boundaries
Density-based clustering methods, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), excel at identifying clusters by analyzing the density relationships among data points. Unlike other clustering algorithms like K-Means or Hierarchical Clustering, which may struggle with arbitrary-shaped clusters or require a predefined number of clusters, DBSCAN is well-suited for handling complex and irregularly spaced data without predefined structures. This makes it ideal for generating city boundaries from OpenStreetMap (OSM) data, where the number and shape of clusters cannot be predetermined, and noise within the data is a significant consideration.
DBSCAN, developed by Ester et al. in 1996, operates using two primary parameters: ϵ (the specified radius) and MinPts (minimum points). Understanding these concepts is essential for grasping how DBSCAN works. A core point is one that has at least MinPts other points within a given radius ϵ, satisfying the minimum density requirement for a cluster. Border points are those that are within the ϵ neighborhood of a core point but do not themselves have enough neighbors to be a core point. Noise points do not fall within the ϵneighborhood of any core points and hence do not belong to any cluster. The concept of density-reachability is crucial in DBSCAN. A point p is said to be density-reachable from another point q if there exists a chain of points where each point in the chain is within the ϵ distance from the next point, starting from q to p. This chain must pass through at least one core point.
As illustrated in Figure 1, the original data points are depicted in Figure 1(a). Figure 1(b) shows the number of points within a circle centered at point A with a radius of ϵ. Figure 1(c) demonstrates the process of iterating over all points and aggregating all points that are density-reachable from this central point into a single cluster. Within this illustration, the red points, such as point A, represent core points. The blue points, such as points B and C, are border points, while point N is identified as a noise point. It is evident that the parameters ϵ and MinPts are crucial to the outcomes of the DBSCAN clustering algorithm. Their flexibility also ensures the feasibility of using DBSCAN to generate the city.
The DBSCAN algorithm can be summarized in a series of steps. Initially, all points are marked as unvisited. For each unvisited point, it is checked whether it has at least MinPts neighbors within the ϵ radius. If so, it becomes a core point, starting a new cluster. The core point is then marked as visited, and all its ϵ-neighborhood points are added to a candidate set for cluster expansion. For each point in this set, if it also has at least MinPts neighbors, it is added to the cluster, and its neighbors are added to the candidate set. This process continues until there are no more points in the candidate set. The algorithm then proceeds to the next unvisited point and repeats the process until all points have been visited and classified either as part of a cluster or as noise.
Generating city boundaries using OSM data with DBSCAN involves creating a grid overlay on the satellite imagery, typically in 10x10 meter segments. The POI data is then joined to the grid, counting the occurrences of each category within the grid cells. Feature selection is applied to identify the top 15% of features most indicative of urban areas from the 29 POI categories. This step filters out the categories that do not contribute significantly to defining urban boundaries. The DBSCAN algorithm is then run on the selected features, with the parameters ϵ and MinPts tuned to optimize cluster detection, considering the local density of POIs. The resulting clusters are analyzed to define the urban boundaries. Core points indicate high-density urban areas, while border points help outline the periphery. Noise points are disregarded as they do not contribute to meaningful urban areas.
By leveraging DBSCAN's strengths in handling noise and detecting clusters of arbitrary shapes, this approach provides a data-driven solution to define city boundaries accurately, facilitating urban planning and analysis based on OSM data. As illustrated in the figures, DBSCAN effectively identifies clusters of urban areas by analyzing the density of relevant POIs, such as buildings, amenities, shops, power infrastructure, and emergency services. This method ensures a robust delineation of city boundaries, accounting for the complexity and variability inherent in urban data.
3.4 Validating Urban Clustering: Nighttime Light Data and Zipf's Law
In this study, we employ two distinct methodological approaches—nighttime light data and Zipf’s law—to perform a multifaceted validation of our clustering results. These methods are integral to enhancing the reliability and applicability of our findings within the domain of urban studies. The utilization of nighttime light data, which reflects socio-economic activities through the lens of artificial lighting visible from space, combined with the theoretical underpinnings of Zipf’s law that describes city size distribution, provides a comprehensive framework for assessing the accuracy of our clustering methodology.
3.4.1 Evaluation Using Nighttime Light Data
First, we employ nighttime light data to evaluate the robustness and scientific validity of our proposed clustering methodology. Nighttime light imagery, indicative of human activities and urban development, serves as a robust indicator for socio-economic dynamics, reflecting the intensity of artificial lighting observed via satellite (Mahtta et al., 2019). The procedure begins with the acquisition of nighttime light datasets from NASA's Earth Data portal (https://www.earthdata.nasa.gov/), followed by meticulous geographic calibration specific to Germany's contours. We establish a threshold for nighttime light intensity at 3.5 (Cao et al., 2023), which facilitates the demarcation of urban and non-urban regions. This delineation allows for an empirical comparison between the urban clusters derived from our clustering algorithm and the illuminated areas identified via satellite imagery, thus providing a quantitative measure for validating our clustering results.
3.4.2 Application of Zipf’s Law to Urban Studies
Furthermore, we apply Zipf’s law as a theoretical framework for validating the urban clusters identified through our study. It posits that the population size of a city inversely correlates with its rank in the urban hierarchy, suggesting that the largest city is approximately twice as large as the second largest city, three times as large as the third, and so on (Zipf, 1949). This statistical relationship can be leveraged to predict expected urban area sizes and compare them with the clusters identified through our analysis. By comparing the empirical data with the theoretical expectations, we can rigorously evaluate whether our clustering method accurately reflects the underlying urban structure. This comparison not only substantiates the reliability of our clustering outcomes but also aligns our findings with established economic models of urban distribution.