Calculating the 3-30-300 index for multiple cities at the resolution of single buildings was undertaken in five steps. First, suitable open data was assembled, which also served to determine which cities were included in the study. Second, a generic process was prepared by which the data could be analysed to determine achievement of the 3-30-300 benchmark. Third, this process was tailored to each city’s particular input data. Fourth, recognising that full achievement of the metric is rare, we experimented with scoring systems to offer a clearer sense of how close each building might be to achieving the standard. Fourth, we examined correlations between our results for each of the tests (3,30 and 300) to understand how independent they are and therefore how necessary each rule is to include in the metric. Fifth, we determined the distribution of the age and planting density of trees in Seattle and New York to understand why the ‘3’ and ‘30’ tests seem so unrelated in the cities in our study.
This section describes each of these five steps, concluding with a discussion of the data and processing limitations encountered.
1. Identifying cities with appropriate data
We used ‘opentrees.org’ as an initial point of departure for assembling open data for each city, as it aggregates and maps freely-accessible tree inventory datasets from dozens of cities around the world. By browsing the map, candidate cities were identified, aiming to ensure representation of most continents; however, the key determinant of which cities formed part of this study was the completeness of their data. We aimed to secure data for at least one, and up to three, cities per continent. To be included in the study, the following datasets needed to be freely available for download in easily-used formats (e.g. csv, shapefile, geojson):
For the ‘3’ rule:
- Tree inventory data (i.e. point data showing tree location)
- Building footprints (i.e. the outline of shape of the structure, as seen from above)
For the ‘30’ rule:
- Tree canopy data showing tree cover (not just green cover from NDVI)
- ‘Neighbourhoods’ – often census tracts were used as a substitute
For the ‘300’ rule:
- Building footprints (same as for the 3 trees rule)
- Roads (centreline data, not polygons)
- Parks
While a few cities had adequate open data (e.g. Seattle, Melbourne), most cities we considered had one or two important datasets missing; useful canopy data in particular was scarce, though tree inventories and even building footprints led to the rejection of some candidates. Figure 5 summarises our search process.
Easily-accessed data was prioritised, with data for some large US cities proving relatively simple to find. Australian data was less comprehensive, with successful searches limited to only the small central-city municipalities of Melbourne and Sydney (each being a small fraction of the wider city); even in these relatively well-resourced state capitals, some augmentation of local data with state datasets was necessary. Assembly of data to ensure inclusion of a city from Asia, South America and Europe required significantly more effort. Open datasets in many European cities proved particularly difficult to navigate. This was firstly due to unique data structures (e.g. instead of a single citywide tree inventory, three unlinked datasets titled ‘trees in parks’, ‘trees in flower beds’, ‘trees in streets’ might exist, perhaps each broken into sets of 4 due to file size). Secondly, data was often stored in the local language and only searchable in that language; even with the aid of digital translation and careful assembly of these disparate datasets, datasets were usually found to be incomplete. By accessing a combination of municipal, national and EU data, the necessary components were assembled for Amsterdam, though some clipping and file type conversion was required to deliver useable datasets. Similar creative searching was required to prepare the components for an Asian and South American city.
At the completion of this search, data was assembled for New York, Seattle, Denver, Melbourne, Sydney, Amsterdam, Buenos Aires and Singapore. Regrettably, no suitable data for cities in Africa could be identified by our process, nor for a city in mainland Asia or the Middle East. Supplementary 5 details the individual datasets used. Generally, municipal open data was the main source for each city, supplemented by data either provided by higher levels of government (e.g. national or state level canopy data) or from large international datasets (e.g. an EU dataset of building footprints, roads and parks from OpenStreetMap, or large datasets of building footprints generated by Microsoft and Google generated by machine learning tools). In the case of New York, a very detailed tree inventory dataset was used arising from a recent study 39. In Singapore and Buenos Aires, we relied on averaging ‘green view index’ data (calculated by MIT by processing images from Google Streetview) within neighbourhoods as a substitute for canopy cover 40,41. In all cases, datasets were trusted to be reasonably error-free; we did not systematically validate each input dataset beyond checking their coverage and completeness.
This study’s use of open data means it has not produced a perfect ‘apples-with-apples’ comparison between cities. Calculations, input data and processing methods are broadly similar, but vary between cities. The 3-30-300 index is a heuristic, and our intent is to examine its application at large spatial scales; the somewhat different components used in each city do not detract fundamentally from the broad insights that this study aims to derive. However, different datasets may introduce a bias towards over- or under-estimation of achievement of the metric, and the specific dataset traits that served as limitations to this study are discussed at the end of Methods.
2. Development of generic analytical workflow
The work of Browning et al. (2024) was drawn on to establish our approach to each of the three tests that form the ‘3-30-300’ rule; the experts involved in that study assessed the suitability of a range of options for the calculation of each criterion. The methods employed in this paper generally follow this expert guidance, though deviations were necessary in a few instances. In all cases, geospatial software with a graphic user interface (GUI) was used; while the analysis started out using ArcMap 10.6.1 and QGIS 3.32, the analysis was moved to FME Workbench 2021 software due to its relative stability and efficiency of processing 42. The description that follows is generic and can be executed in any spatial software package, or using a myriad of coding approaches.
a. Testing whether three trees are in view of each building
This step of the analysis used only building footprints and tree inventory data. The workflow used in FME Workbench is shown at Supplementary 6, as is a figure showing the product of the analysis, with ‘lines of sight’ between trees and buildings demonstrated.
To determine whether each building was likely to have a view of three or more trees, each building footprint polygon was converted into a set of vertices (or points) at five metre intervals. This served to represent potential window locations on each building façade, assuming that most unobstructed facades on buildings will have one or more windows at reasonable intervals, and that obstructed facades could be identified in subsequent analytical steps. Second, a ‘nearest neighbour’ test was used to draw lines between each ‘window’ vertex and each tree location. The closest three trees within 30m of each window vertex were identified by this mechanism. The resulting lines represent sightlines from a notional window location at every 5m along the building façade. Two clean-up steps were required to derive meaningful insights: first, the ‘sightlines’ were intersected with a building layer, and any sightline that crossed a building was deleted, because buildings between a given tree and window represent visual obstructions. Second, as the conversion of building footprints into notional ‘window’ vertices created multiple sightlines from each building to each tree, the risk of double counting needed to be managed. Therefore, the sightlines were sorted by length, and then checked for duplicates to eliminate all but the shortest sightline to each tree for a given building. A simple statistical summary counting the number of trees per property was then conducted, and joined back to the building polygon layer. An estimated visible tree count per building was thus derived.
b. Testing whether each neighbourhood has 30% canopy cover
This was a relatively simple step of the analysis, using only the polygon layers representing canopy cover and neighbourhood boundaries (or census tract boundaries). We calculated the area of each neighbourhood in this study, then the canopy polygons were erased or clipped from the neighbourhood layer. The resulting clipped area – representing land that does not fall under canopy – was calculated by area again. The difference for each neighbourhood was computed, and this canopy cover was expressed as a simple percentage of the original neighbourhood area.
c. Testing whether each property is within a 300m walk of a park
This is a standard application of network analysis tools, available in many spatial software packages, and is recommended by Browning et al (2024). The analysis requires use of building, parks and road centreline layers. A routing algorithm is used to calculate and record the shortest distance between each property and the closest park; this is not conceptually complex but usually requires some setup, and is relatively computationally demanding at the city scale. The many intermediate steps in this analysis are shown Supplementary 6 as an annotated export directly from the GUI-based processing software used (FME Workbench).
In the case of the software used in this study, the most important tool used was called ‘ShortestPath’ and required two inputs: a road network, and a ‘from-to’ line essentially linking each property to its closest park (as the crow flies). The tool calculates the shortest path along the network to link the start of the line to its end. This meant that setup involved deriving ‘from-to’ lines using Nearest Neighbour analysis, which in turn required all origins and destinations to be single vertices (points) instead of polygons. Therefore, the analysis converted each building into a centroid, and each park into a set of points at 2m intervals. Nearest neighbour analysis used the building centroid as a base, and identified only the nearest park point, thereby avoiding duplication. With each building centroid now equipped with a set of coordinate attributes for its nearest park, generation of the ‘from-to’ line became possible. The ‘from-to’ line and the road layer were then fed into the ‘shortest path’ analysis for routing. This effectively measured a walking route along the road network for each from-to pair. However, testing a prototype of the analysis, we found that the tool calculates distances between intersections, sometimes routing a walker to an intersection some distance from their actual park destination. Therefore, a ‘chopper’ tool was created to break every road into 10m segments, ensuring that routing calculated routes that start and finish as close as possible to the relevant park and building points. Even with this improvement added, the shortest path tool struggled at times to find routes between buildings and parks where one end of the trip was that were slightly distant from a road (as shown at Figure X) so we added a ‘snapping distance’ to the shortest path calculator which automatically brought origin and destination vertices onto the network at a maximum distance of 100m. This introduces a level of approximation to estimates of walking distance, but avoids producing a high proportion of ‘path not found’ responses. Even with the snapping tool active, a portion of legitimate paths were not solved by the tool, and on inspection these were often in locations where buildings closely abutted parks with no interceding road – an example of this problem is included at Supplementary 6.
This posed a risk of underestimating the accessibility of parks to these properties, because recording the ‘300’ test result for these properties as ‘path not found’ would fail to recognise that these are often places that have very convenient park access. Accordingly, we added a final step which analysed any ‘path not found’ properties and determined whether they were in a crow-flies distance of under 50m. We replaced any ‘path not found’ attributes for these buildings with the measured distance, thereby recognising these locations as likely ‘pass’ results for the 300 test.
3. Tailoring the generic analysis to each city
New York was used to iteratively develop a prototype of the analysis. Being a large city with over 1 million buildings and over 6 million trees, standard desktop computers quickly proved inadequate for the required analysis on the city as a whole. To reduce processing demands, the city was broken into its component boroughs. That enabled the ‘3’ and ‘300’ analysis, but the spatial intersection of millions of tree canopy polygons with the city’s census districts proved excessively demanding (processing crashed frequently, and forecast processing times often exceeded 1 week). The tree inventory data for the city included an ‘area’ field representing canopy area for each tree, which enabled a much simpler analysis involving the summing of this area based on census districts. Otherwise, the analysis followed the generic workflow.
Melbourne, Sydney, Seattle and Denver proceeded smoothly through the generic analysis with minimal modification, though Seattle, Melbourne and Sydney had some inputs from larger datasets which required the data be clipped to the city’s boundaries. The small municipalities of central Melbourne and Sydney processed in a matter of hours, whereas the full extent of Denver and Seattle took some days, with network analysis and canopy analysis both proving quite computationally demanding.
Amsterdam and Singapore both did not offer comprehensive Open Data layers for their road networks, which required extraction and processing of OpenStreetMap (OSM) data, as well as an unsuccessful attempt to exploit national-level data provided by Microsoft offering machine-learned road networks 43. While promising, these files proved unsuitable, as they were very large, unusually formatted (.tsv files) and required extraction using coding approaches (as distinct from the GUI-based approach employed in this paper). Network analyses with OSM data required a level of additional processing to ensure road segments were linked, but even after multiple runs and inspections continued to produce anomalous results at unacceptable rates, and ultimately network analysis approaches were abandoned in these cities, and crow-flies distances between parks and buildings were calculated instead simply by buffering each building by 300m and testing whether a park intersected each buffer. We note the work of Jafari et al. (2022), which offers detailed instructions on network setup using OSM data in the R programming language for city-scale analysis of walk distances. As this required coding it fell outside the scope of this study.
Buenos Aires and Singapore both do not provide canopy datasets, but were analysed as part of the ‘Treepedia’ study conducted by the MIT Senseable City Lab 41,45. This free dataset quantifies canopy at street level at observation points at regular intervals in city streets using computer vision. While park canopy is excluded from this study, it gives a good sense of streetscape canopy cover, and with simple averaging within neighbourhood polygons it serves as a proxy for polygon-based canopy analysis in these cities. Amsterdam also did not have an immediately available canopy dataset, but the Dutch government offers one for the entire country; this was in a raster format so processing was required to ensure raster values were appropriately averaged within neighbourhood polygons in Amsterdam. Sydney’s canopy came pre-measured, as part of a dataset provided by the state government of New South Wales 46. This data was at a sub-neighbourhood scale (‘modified mesh blocks’, often as small as just one side of the road on a city block) so this required aggregation to a larger unit , the ‘SA1’, which approximates a census district or neighbourhood, and usually includes 200-800 residents47.
Buenos Aires and Singapore also required careful clipping of country-scale building datasets as part of data setup.
4. Quantifying internal correlations between 3,30 and 300 tests
The Spearman correlation coefficient was used to assess the strength and direction of relationships between the variables of interest. The Spearman correlation is particularly suitable for analysing non-normally distributed data and does not assume linearity. The correlation coefficients were calculated for the pairs of variables: canopy cover vs. tree views, tree views vs. distance to parks, and canopy cover vs. distance to parks. The magnitude of the coefficient reflects the strength of the association, and the sign reflects the direction of the effect, i.e. with values closer to +1 indicating strong positive relationship and -1 indicating a stronger negative relationship.
5. Examining New York and Seattle to understand differences between ‘3’ and ‘30’ results
Tree point data for New York and Seattle was of a standard that enabled analysis of tree populations within viewing distance of buildings used for the ‘3’ test (30m as defined in this study). By subsetting each dataset spatially, we isolated trees in this range, then for each census district we calculated the median area of each tree’s canopy, and the planting density per hectare of trees. This enabled us to visualise the distribution of tree sizes and platning density within the subset of each tree population that sits within 30m of buildings.
6. Limitations
Two key potential areas of limitation should be noted in this study’s methods, especially given that we seek to interrogate the difficulty and viability of calculating this heuristic metric within municipal teams. The first limitation relates to a set of flaws in the input data, which were inevitably incomplete and/or asynchronous given their diverse sources. The second key area of limitation is in the approaches we took to processing the data, which have introduced margins of error through their inherent assumptions and imperfect tools. Supplementary 7 summarises the issues observed through the course of this study, and offers a detailed description of the limitations in processing and data.