A novel spatiotemporal prediction approach to fill air pollution data gaps using mobile sensors, machine learning and citizen science techniques

doi:10.21203/rs.3.rs-4667713/v1

Download PDF

Article

A novel spatiotemporal prediction approach to fill air pollution data gaps using mobile sensors, machine learning and citizen science techniques

https://doi.org/10.21203/rs.3.rs-4667713/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Particulate Matter (PM) air pollution poses significant threats to public health. Existing models for predicting PM levels range from Chemical Transport Models to statistical approaches, with Machine Learning (ML) tools showing superior performance due to their ability to capture highly non-linear atmospheric responses. This research introduces a novel methodology leveraging ML tools to predict PM_2.5 levels at a fine spatial resolution of 30 metres and temporal scale of 10 seconds. The methodology aims to demonstrate its proficiency in estimating missing PM_2.5 measurements in urban areas that lack direct observational data. A hybrid dataset was curated from an intensive aerosol campaign in Selly Oak, Birmingham, UK, utilizing citizen scientists and low-cost Optical Particle Counters (OPCs) strategically placed in both static and mobile settings. Spatially resolved proxy variables, meteorological parameters, and aerosol properties were integrated, enabling a fine-grained analysis of PM_2.5 distribution along road segments. Calibration involved three approaches: Standard Random Forest Regression, Sensor Transferability Evaluation, and Road Transferability Evaluation. Results demonstrated high predictive accuracy (R² = 0.85, MAE = 1.60 µg m⁻³) for the standard RF model. Sensor and road transferability evaluations exhibited robust generalization capabilities across different sensors (best R² = 0.65, MAE = 2.76 µg m⁻³) and road types (R² = 0.71, MAE = 2.46 µg m⁻³), respectively. This methodology has the potential to significantly enhance spatial resolution beyond regulatory monitoring infrastructure, thereby refining air quality predictions and improving exposure assessments. The findings underscore the importance of ML-based approaches in advancing our understanding of PM_2.5 dynamics and their implications for public health. The paper has important implications for citizen science initiatives, as it suggests that the contributions of a small number of participants can significantly enhance our understanding of local air quality patterns for many 1000s of residents.

Earth and environmental sciences/Environmental sciences/Environmental impact

Earth and environmental sciences/Environmental sciences/Environmental chemistry/Environmental monitoring

Particulate Matter (PM) air pollution has a considerable negative influence on the human health, especially with respect to the cardiovascular and pulmonary systems. According to the European Environmental Agency (EEA, 2023)¹, 97% of the urban population in Europe is exposed to fine particulate matter (PM_2.5, i.e. PM with a diameter of 2.5 µm or smaller) concentrations above the World Health Organization's (WHO) 2021 recommendations of 5 µg m⁻³ for annual average. Within the EU, air pollution is estimated to lead to 238 000 premature deaths in 2021 and is the largest environmental health risk in Europe¹. PM_2.5 is a critical air pollutant with primary PM_2.5 originating mainly from combustion processes and secondary PM_2.5 from the reaction of organic or inorganic gas compounds, finally contributing eventually up to more than 50% of PM_2.5 mass depending on the season and the location². Also PM₁₀ (PM with a diameter of 10 µm or smaller) is a critical air pollutant, with coarse particles, i.e. between 2.5 µm and 10 µm, resulting mainly from mechanical processes³.

PM_2.5 has become the leading environmental contributor to the global burden of disease, representing a substantial departure from its position as the fifth major contributor among environmental risk factors in 1990⁴. Studies have shown that spending a substantial amount of time in areas even with low ambient PM_2.5 levels can have adverse effects on human health⁵. The health impact of air pollution is critical in urban areas, where most of the world population resides, therefore rapid reduction strategies are required. For these strategies to be successful, they need to be targeted and hence an accurate description of the spatial-temporal variability of PM is required⁷. Urban areas exhibit high heterogeneity in PM concentrations due to the diversity of the emission sources, the variability in land use patterns, and of the interaction between the meteorological factors and the urban canopy, which influence air pollutants’ dispersion⁸. This spatial and temporal variability poses challenges for exposure assessment and air quality management⁹.

Regulatory monitoring networks, such as the UK's Automatic Urban and Rural Network (AURN), serve as the main UK infrastructure for ensuring compliance with ambient air pollution standards. Nevertheless, the acquisition and maintenance costs of regulatory-grade instruments are high, and the sparsely distributed station network fails to capture the small-scale spatial variations observed in pollutant concentrations in urban areas, as indicated by numerous studies^10,11. These localized variations contribute to differences in human pollutant exposures, ultimately influencing associated health impacts¹².

To detect and quantify the fine-scale spatial fluctuations in pollutant concentrations, there is a growing interest for utilizing low-cost sensor (LCS) networks. This interest is attributed to the improved capabilities of sensor technologies and the development of innovative methods for sensor calibration^13,14,15. However, challenges remain in optimizing sensor placement strategies^16,17,in data quality assurance due to e.g. LCS drift or sensitivity to meteorological variables ^18,19 and in interpreting LCS data in the context of regulatory air quality standards^20,. To accurately estimate population exposure, monitoring at a high spatial and temporal resolution should be pursued. Mobile low-cost sensors provide a cost-effective solution for monitoring air quality in areas with limited existing infrastructure, owing to their compact size and portability. Examples include PM_2.5 measurements performed by citizen-operated mobile sensors mounted on bikes²², deployed on routine fleet of vehicles such as trash trucks²³, tram-based mobile sensor network in Zurich²⁴, taxi motorcycles in Kampala²⁵, etc. However, sampling every location continuously throughout a given geographic area is an unattainable goal.

A diverse array of models are utilized in the prediction of PM levels. Some are based on atmospheric processes and emissions, e.g., Chemical Transport Models (CTMs) or Lagrangian particle dispersion models. These models play a crucial role in simulating and understanding the intricate dynamics of air pollutants, incorporating factors like atmospheric chemistry, emission sources, and dispersion patterns. For instance, Sokhi et al.,²⁶ evaluated four regional chemistry transport models, with a horizontal resolution of approximately 20 km which systematically underestimated PM₁₀ and PM_2.5 concentrations in Europe by 10–60%, varying with models and seasons, when compared with the European Monitoring and Evaluation Programme (EMEP) measurements. Zhang et al.,²⁷ employed a simplified Lagrangian particle dispersion model (LPDM) with Bayesian-RAT (multiplicative ratio correction optimization) to enhance regional PM concentration predictions, demonstrating superior accuracy compared to other models(WRF-Chem and CAMX), showcasing the LPDM's advantage in forecasting PM and potentially other pollutants. However, both CTMs and LPDMs may encounter challenges in accurately predicting PM_2.5 concentrations due small size of the dataset, low predictive performance for small areas, high computational cost and achieving sufficient spatial and temporal resolution²⁸. Other prediction approaches include the use of statistical approaches based on meteorological variables and emission proxies²⁹.

Data-driven models, in contrast to physically-driven models, have garnered significant attention due to their ease of implementation³⁰. Machine learning (ML) models have been shown to be highly effective for PM prediction, showcasing robust performance with non-linear variables and flexible modelling³¹. Supervised learning involves the integration of tree-based algorithms (random forest, extreme gradient boosting, light gradient boosting, etc.) and vector-based algorithms (k-nearest neighbour, support vector regression, etc.), capable of learning label data through classifiers or regressors³². Nevertheless, classifier methods proved to be less suitable for PM prediction compared to regressor methods, and, in general, vector-based algorithms exhibited lower predictive power than their tree-based counterparts³³. Hence, tree-based machine learning algorithms, known for their low computational costs and high prediction accuracy, are extensively employed in PM prediction research^34,35.

The existing literature reveals a notable research gap concerning the limitations of current air pollution prediction models, particularly in the context of fine-scale spatial and temporal variations. To address these gaps that demand innovative and cost-effective solutions for enhanced spatial resolution, especially in densely populated urban settings, our study proposes a novel methodology that leverages on ML tools, particularly tree-based models, to predict PM_2.5 levels with unprecedented precision at both spatial and temporal scales.

While our research maintains a broad scope, we conducted a comprehensive testing phase within a measurement campaign from Selly Oak, Birmingham, United Kingdom where we deployed a combination of static and mobile Optical Particle Counters. Our primary objective is to craft predictive models using tree-based ML algorithms that excel in estimating PM levels. To achieve this, we are harnessing the potential of a hybrid dataset, curated to integrate information from both static, mobile low-cost sensors and diverse ancillary datasets. Our focus extends beyond scenarios with active mobile sensor deployment, aiming to create models that can reliably forecast PM concentrations even when the mobile sensors are not in operation.

2.1 Study Area

The area of study is an approximately 1×1 km² block in Selly Oak, situated approximately 3 km south-west of city centre of Birmingham, UK, which is a major city with population of 1.14 million. Selly Oak is highly influenced by its close proximity to the University of Birmingham, which is just to the north. The community is deeply influenced by the academic institution, hosting the main Edgbaston campus, and serving as a prominent residential area for students (hosting around 38,000 students and 8,000 staff). The northern and southern boundaries of the block are delineated by two busy roads, namely Bristol Road and Raddlebarn Road, respectively (Fig. 1). The area has a railway station to the west of the block that connects it to the city center and other parts of the West Midlands.

2.2 Low-cost Optical Particle Counters

During the measurement campaign from April 15th to June 20th, 2023, Alphasense optical particle counters (OPC-N3, Alphasense Ltd), which cost approximately 350 GBP per unit, were employed in both stationary and mobile configurations. The OPC-N3 employs a Class 1 laser (wavelength ~ 658 nm) to detect, size, and count particles within the range of 0.35-40 µm, distributed across 24 size bins. An embedded algorithm translates this size distribution into estimated mass concentrations within the PM₁, PM_2.5, and PM₁₀ size fractions. Employing an internal fan, the OPC-N3 draws the air sample through the detection region, with a total flow rate of 5.5 L min^− 1. For this instrument, the manufacturer declares a sensitivity of < 1 µg m⁻³ and a measurement range spanning from 0 to 2000 µg m⁻³.

The OPC-N3 was configured with the default settings for particle refractive index and particle density, with values set at 1.5 and 1.65 g cm⁻³, respectively. Measurements were stored at 10-second intervals and included the following parameters: date, number size distribution, flow rate, relative humidity (RH), temperature and estimated concentrations for PM₁, PM_2.5, and PM₁₀. In static configurations, the sensor drew power from a car battery housed within a robust all-weather box (Fig. 2a), providing reliable operation for a span of 18–20 days. All the four static sensors were strategically positioned to capture comprehensive data across the study area. The need for battery replacement, conducted as part of routine maintenance was needed to sustain the continuous functionality of the sensors in environments where access to the mains was not feasible. Additionally, 4 mobile sensors were deployed at street level, using a user-friendly backpack setup that was both lightweight (< 1 kg) and easy to handle (Fig. 2b). These sensors ran on a mobile power bank (5000 mAh), requiring a recharge every two days to keep them working smoothly without any interruptions. Using a citizen science approach, local businesses and schools actively contributed to data collection from static sensors throughout the two-month period, while university students similarly engaged in the collection of data from mobile sensors over a one-month period (15th May- 20th June 2023). Within the enclosed box, a Bosch BME-280 sensor was present for measuring the relative humidity and temperature. Additionally, a GSM module was included in the microcontroller connected to the OPC-N3 and served as real-time clock and allowed the transmissions of the measurements to a dedicated cloud server.

2.3 Birmingham Air Quality Supersite (BAQS)

Birmingham Air Quality Supersite (BAQS) is one of three highly instrumented air quality stations in the United Kingdom. The regulatory station, characterised as an urban background is located at the grounds of the University of Birmingham (52.45^∘N, 1.93^∘W), about 3 km southwest from the city centre³⁶. The reference instrument at BAQS was the PALAS Fidas 200 providing continuous and simultaneous measurements of PM₁, PM_2.5, PM₄, PM₁₀, TSP (PM_tot) and the particle number concentration with a time resolution of 1 minute.

2.4 Calibration

All sensors were calibrated by collocation with the research grade instruments at the BAQS. Two collocation periods, before and after the campaign, for a total of 4 days were performed, in which the inlets of the sensors were placed next to the inlets of the research grade instruments at BAQS for simultaneous measurements. For each OPC-N3, after removing the outliers (about 5% of the measurements), occurring mainly with extreme RH conditions, an exponential model was fit between the ratio of the PM_2.5 from measurements of the low-cost sensor and the Fidas and atmospheric RH. Using this, we calibrated the performance of the sensors accounting for variable RH conditions. The precision of low-cost optical particle counters is known to be greatly affected by atmospheric conditions and especially RH. As a result, an overestimation of the PM_2.5 and the PM₁₀ concentrations from the sensor is generally observed with higher RH due to PM hygroscopicity effects^13,37,38. While this is a common feature of the meteorological conditions in the United Kingdom, the RH during the campaign was rather low reducing the discrepancy found between the measurements of the sensors and the reference instruments. Nevertheless, the calibration greatly improved the precision of the low cost OPC estimates especially for PM_2.5 and PM₁₀, with the Pearson correlation coefficient (r) for PM₁ = 0.81 to 0.84, PM_2.5 = 0.63 to 0.75, PM₁₀ = 0.32 to 0.57.

2.5 Hybrid dataset and ML model

Building upon the comprehensive understanding of the study area and the deployment of a set of low-cost OPC-N3 detailed in Sections 2.1 and 2.2, our methodology proceeds to the construction of a hybrid dataset and the implementation of machine learning (ML) models.

2.5.1 Hybrid Dataset Preparation

To address the challenge of predicting PM_2.5 levels at fine spatial and temporal scales, we compiled a hybrid dataset incorporating additional information along with that gathered by both static and mobile low-cost sensors. The road network of the whole study area was ranked into primary, secondary, and tertiary segments based traffic volume characteristics using telematics data collected for the area^39,40. Furthermore, continuous traffic data was captured through a traffic monitoring camera next to a supermarket from Bristol Road, a key thoroughfare in the study area. These data were provided by the Birmingham City Council and included vehicle count and vehicle speed differentiated by vehicle type (motorbike, passenger car, Trailer, Rigid, Heavy Good Vehicle (HGV), bus). This hybrid dataset also integrated demographic (census) data and meteorological variables from BAQS, including wind speed, wind direction, atmospheric pressure, RH, and atmospheric temperature. PM_2.5 from the Fidas at BAQS was also incorporated into the dataset, finally resulting in a comprehensive and multi-faceted analytical framework to investigate the area. VSP (vehicle specific power)³⁹, representing the instantaneous total power demand per vehicle mass, was initially included in the hybrid dataset from the telematics data but later excluded due to its reduced importance as determined by the machine learning model (see Fig. S5).

For spatial analysis and integration, the mobile sensor dataset was converted into a spatial object and reprojected to the British National Grid. To allow for a detailed analysis of the spatial distribution of PM_2.5 along the road network, the reprojection was divided into 30-meter-long segments, and for each segment the position of the corresponding centroid was computed (Fig. 3). The centroids served as focal points for the assignment of all spatially resolved proxy variables (e.g. demographic data, average traffic counts) and PM_2.5 by pedestrian mapping. The spatial analysis in this study was conducted using the open-source software QGIS. Details of the input data used can be found in Table 1.

Table 1

– Input Parameters used in the ML-model
Context Feature	Parameter	Data source
Meteorology	Temperature Relative humidity Wind speed Wind direction Atmospheric pressure	Birmingham Air Quality Supersite (BAQS)
Road network	Road type	OpenStreetMap
Traffic data	Traffic count by vehicle type (Motorbikes, Cars, Trailers, Rigids, HGVs, Buses). Frequency distribution of vehicle speed	Birmingham City Council
Population	Population	Office for National Statistics
Air Quality	PM_2.5	BAQS, Static Sensors
Categorical	Mobile sensor ID	Mobile sensors

2.5.2 ML Model Development and Calibration Approaches

Random Forest (RF) is a powerful ensemble supervised learning algorithm, introduced by Breiman (2001)⁴¹, leveraging on the classification and regression trees algorithm (CART) for prediction. Employing a bagging approach with replacement, RF draws samples from various subsets of variables. Each decision tree within the RF is generated using the bootstrap method, with node splitting occurring through random subsets of variables (set by the mtry hyperparameter). The forest is expanded using a specified number of trees (hyperparameter ntree) to mitigate bias error⁴¹. The final predicted result is derived by aggregating and averaging predictions from individual trees, selected at random (Fig. 4). Random Forest simulations were conducted to meticulously select and optimize the hyperparameters. The model configuration involved the utilization of 1000 trees, limiting each split to a maximum of 5 randomly selected features (mtry = 5), and establishing a minimum node size of 5. In this study, RF modelling was performed using the “RandomForestSRC” package in R. The modelling process took place on a dedicated High-Performance Computing (HPC) ARIES platform of the University of Modena and Reggio Emilia.

The calibration of predictive models for PM_2.5 concentrations involved the implementation of three specific approaches: 1) Standard Random Forest Regression (RF): This method employed the standard RF regression technique, utilizing an 80 − 20 train-test split. The model was trained on 80% of the hybrid dataset (95,072 data points) and tested on a 20% (23,769 data points) of the total (118,841 observations) to evaluate its predictive performance based on the provided input features. 2) Sensor Transferability Evaluation: This approach involved calibrating the RF model using the hybrid dataset but using only one of the four OPC-N3 involved in the pedestrian mapping. The performance of this approach was then assessed based on an independent OPC-N3 sensor. This evaluation aims to gauge the model's ability to generalize across different OPC-N3 sensors. 3) Road Transferability Evaluation: In this calibration approach, the model was calibrated using the hybrid dataset, but including only one road and its performance was evaluated on a different road. This evaluation explores the capability of the model to generalize across distinct roads in the area.

For each approach, the individual contribution of the variables to the model prediction were analysed and assessed using the variable importance (VIMP). This is achieved by quantifying the change in model error when a single variable is permuted, i.e., randomly shuffled.

This technique is widely adopted and has been extensively utilized across the machine learning literature^{41,42,43,44,45}

To validate these models, various performance metrics were computed. The coefficient of determination (R²) was computed for the linear least-squares regression of predicted concentrations versus observed concentrations. Also, the Spearman Rank Coefficient (\(\rho\)) was computed, to assess the non-parametric linear correlation between model estimates and observations. Additionally, Mean Bias Error (MBE), the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE) were calculated to assess the average deviation between predicted and observed concentrations, to indicate the overall model bias. Detailed equations for each metric are provided in SI for clarity.

The following section presents the outcomes of our study, which focused on predicting PM_2.5 levels using the three distinct ML calibration approaches. The agreement between predicted and observed data points was also assessed in terms of frequency distribution density plots for each calibration approach, to highlight areas of convergence or divergence between predicted and actual PM_2.5 levels. The values of the performance metrics are given in Table 2.

3.1 Calibration Approach 1: Standard Random Forest Regression

The Standard Random Forest Regression, utilizing an 80 − 20 train-test split, demonstrated strong predictive performance at 10-seconds temporal resolution. The R² score was 0.85 and MAE of 1.60 µg m⁻³. The corresponding RMSE of 2.40 µg m⁻³ further complements these results. As seen in Fig. 5.a, it showcases a peak in density around the central values, indicating a strong alignment between predicted and actual PM_2.5 levels. Upon removing outliers (PM2.5 > 40 µg/m³), the slope moves slightly closer to 1, suggesting a marginally stronger correlation at lower concentration levels. Notably, the intercept is close to the stated sensitivity threshold of the OPC-N3 sensors (< 1 µg/m³), indicating that the model's predictions align with the lower detection limit of the instruments. This calibration approach serves as the reference point, representing the best achievable performance within the entire setup. It serves as our benchmark and provides a valuable standard against which we measure the effectiveness of other calibration approaches.

3.2 Calibration Approach 2: Sensor Transferability Evaluation

In the Sensor Transferability Evaluation, our focus shifted to the model’s ability to generalize across different sensors. Specifically, the Random Forest model was initially calibrated on a specific mobile OPC-N3 unit and subsequently evaluated on an independent mobile OPC-N3 unit. The results present detailed insights into the model's performance across a range of sensor configurations. When employing a single-sensor test-and-train configuration, the model demonstrated good generalization capabilities. The R² score of 0.65 signifies a moderate yet substantial level of explained variance, while the MAE of 2.76 µg m⁻³ and an overestimated mean bias error of just 0.43 µg m⁻³ reflects the model’s accuracy in predicting PM_2.5 levels across different sensors. Additionally, the KDP (Fig. 6a) indicates a notable pattern, with predicted PM_2.5 levels exhibiting a density peak at 0.14, while the actual PM_2.5 levels have a density peak at 0.7 suggesting systematic tendency of the model to slightly underestimate PM_2.5 levels in this scenario. However, it is crucial to note that not all sensor calibration/validation pairs perform equally under this approach. The variability in sensor pair performance becomes apparent, revealing the challenge for model’s adaptability to specific OPC units. If the performance reported above is the best one among all pairs, swapping this same calibration/validation pair results in a significant decrease in model performance (MBE = -2.64 µg m⁻³, MAE = 4.45 µg m⁻³ and R² = 0.25). When combining data from multiple sensors, a different perspective emerges (MBE 2.47 µg m⁻³, MAE = 4.05 µg m⁻³, and R² = 0.21). The combined sensor configuration yields suboptimal results compared to the single-sensor approach. These nuanced findings underscore the importance of considering sensor-specific behaviours and configurations. The model’s adaptability across different sensors is evident, but the intricacies of individual sensor performance, necessitate careful consideration in the interpretation of results. Factors such as differing time-of-mapping among sensors, variations in traffic patterns or PM_2.5 concentrations at specific locations, contribute to the observed variability in model performance. It should be noted that the results presented throughout the paper are based on a 10-second temporal resolution. However, it is interesting that after averaging the dataset into 5–15 minutes data intervals, the R² score showed improvement up to 0.4–0.5. This suggests a potential temporal sensitivity, emphasizing the consideration of the trade-off between temporal resolution and model performance, given the corresponding reduction in data points.

Calibration Approach 3: Road Transferability Evaluation

For the Road Transferability Evaluation, the model was calibrated on one road and evaluated on a different one. The results indicated a satisfactory ability to generalize across different road stretches, with an R² value of 0.71 and a MAE of 2.46 µg m⁻³ (Fig. 7 trained on Bristol Road and tested on Raddlebarn Road). Notably, the model’s consistent performance across different roads within the area suggests a capacity to adapt to varying environmental conditions and road characteristics. This observed generalization may be attributed to the similarity in atmospheric and pollution background conditions across the roads, showcasing the model’s adaptability to diverse but locally consistent factors. For instance, scenarios with the road ranking explained previously in Section 2.5.1 we calibrated the model on a secondary road (Raddlebarn Road) and tested it on a tertiary (Bournbrook Road), the results yield comparable success with R² = 0.66, MBE = 0.21 µg m⁻³ and MAE = 2.20 µg m⁻³ or training and testing on tertiary and primary, etc. This versatility is particularly valuable in urban environments where road types can change rapidly, and atmospheric conditions may exhibit subtle variations.

Table 2

– Performance metrics under various calibration configurations
Calibration Approach	R²	MAE (µg m^− 3)	MBE (µg m^− 3)	RMSE (µg m^− 3)	ρ
Standard Random Forest	0.85	1.60	-0.01	2.40	0.92
Sensor Transferability Evaluation	0.65	2.76	0.43	4.11	0.83
Road Transferability Evaluation	0.71	2.46	-1.14	3.22	0.81

3.3 Variable importance (VIMP) of the input features

Figure 8 shows the relative VIMP from the standard RF regression model for PM_2.5 predictions. It highlights the high importance assigned to the 'Mobile Sensors' variable, which is a categorical variable representing sensor ID. The model attributes considerable weight to the sensor ID, indicating that specific sensors consistently capture important patterns or localized pollution sources. However, it is important to acknowledge the potential introduction of sampling biases due to the choice of routes taken by individuals carrying the sensors. Routes through busy intersections or areas with construction activities, could result in higher variability in the measured PM_2.5 concentrations, thus influencing the perceived importance of certain sensors. Figure 8 also underscores the significance of static sensors such as 'Static Sensor 2', 'Static Sensor 1', and so forth. Unlike mobile sensors, static sensors offer continuous and consistent measurements at specific locations. The importance of these static sensors may be attributed to their strategic placement in areas that serve as key indicators of the overall air quality. 'Static Sensor 2', for instance, was in a place with high vehicular traffic, while 'Static Sensor 1' was positioned near an urban background location. Despite its proximity to the primary road, the Static sensor 4 exhibited lower importance in the model, potentially due to its position in a densely populated area with local emission sources like nearby vehicular traffic and a pub, creating micro-environmental conditions that poorly correlate with expected PM_2.5 levels based on the regression model. The observed high concentrations of PM_2.5 measurements from Static sensor 4 suggests that the local sources of emissions in that specific area introduce complexities that are not fully captured by the model's inputs. Consequently, the model may have difficulty accurately associating the PM_2.5 measurements from Static 4 with the predictor variables, leading to its lower importance in the overall model. The model likely identifies these static sensors as indicators of the average pollution conditions in the area, and the mobile units as effective indicators of localized pollution sources and patterns. Moreover, the consistent nature of data collection from static sensors provides a stable reference for understanding baseline pollution levels in specific regions. PM_2.5 measurements from BAQS also emerged as a crucial predictor variable across all calibration models. Situated within the university campus at an urban background site less affected by emission peaks at roadside, BAQS is anticipated to predominantly reflect the regional background.

The model (Standard RF regression) highlights the pivotal role of traffic-related variables in predicting PM_2.5 concentrations, with buses emerging as the most influential factor. This underscores their substantial contribution to urban air pollution, given their diesel-fuelled engine emissions and consistent routes (see Fig. S6). Notably, the presence of seven bus stops along the primary road further amplifies the impact of bus-related traffic on the local air quality. 'Vehicle Speed' has significant importance, emphasizing the influence of traffic flow dynamics on air quality. Higher speeds contribute to increased pollutant levels. Additionally, the variable 'Total Vehicles' which is inclusive of all the vehicles in the area is an important predictor, suggesting areas with a higher traffic volume, exhibit larger PM_2.5 concentrations. Meteorological variables also contributed to the prediction of PM_2.5 concentrations confirming the role of meteorology on atmospheric dynamics and on the variability of particulate matter. While relative humidity can influence the physical properties and measurement of particulate matter through aerosol hygroscopicity, this effect was accounted for during the calibration of the PM_2.5 data used in the model. Atmospheric pressure shapes general atmospheric dispersion patterns, impacting the vertical mixing of pollutants, and is reflected in the VIMP rankings. Temperature and relative humidity, acting as proxies for time of day and boundary layer height, influence chemical reactions, particle volatility, and pollution dispersion dynamics. Meanwhile, wind direction and speed are critical for directly determining pollutant transport pathways, revealing potential sources of PM_2.5. The population density exhibited a relatively small contribution as a predictor variable, despite its typical association with anthropogenic emission sources impacting particulate matter levels. This low importance of population could potentially be due to the model capturing the impacts of population-related emissions through other more direct predictor variables, such as mobile/static sensor data, traffic counts, etc.

Figure 9 displays the Partial Dependence Plots (PDPs) for the Standard RF regression of PM_2.5 with important variables from the input dataset. PDPs illustrate the relationship between a specific feature and the model's predictions for the target variable^46,47. These plots depict how changes in each feature influence the average predicted PM_2.5 while holding other variables constant, providing valuable insights into the impact of individual factors on the model's outcomes. Among the PDPs generated, particular attention is drawn to the plot for the 'Mobile Sensors' variable which reveals significant fluctuations in PM_2.5 predictions, indicating varying impacts of different sensor IDs on the model's predictions. This suggests that certain sensors consistently capture crucial patterns or localized pollution sources, as evidenced in the discernible fluctuations in the plot. However, it's important to note that OPC 4 consistently records high PM levels, while OPC 1 and OPC 2 exhibit similar patterns, and OPC 3 falls somewhere in between. This observation raises questions about the transferability of sensor data between OPC 1 or OPC 2 to OPC 4, suggesting potential limitations in predictive capabilities across different sampling conditions and the model's adaptability to specific OPC units. Nonetheless, the unexpectedly high importance attributed to sensor ID underscores the significance of the location of individual sensors in predicting PM_2.5 levels. A consistent pattern was seen among the static sensors, where changes in the static sensor data led to uniform shifts in the model's predicted PM_2.5 levels. This suggests that the static sensors exhibit similar behaviour in capturing PM_2.5 levels, despite potential differences in their exact locations or environmental conditions. From the PDP of Total vehicles, we observe the largest contribution to PM_2.5 at highest traffic volumes, lowest PM_2.5 correlated to low-moderate traffic volumes (300–400 vehicles hour^− 1) and an increase in PM_2.5 with minimum traffic rates. This is consistent with the relation between vehicular emissions and vehicular speed: during traffic peaks, PM_2.5 is highest because of traffic queues and jams; lower PM_2.5 is associated to a moderate traffic flow, with vehicle moving at optimal cruise speed in terms of consumptions and emissions. Finally, the increase in PM_2.5 associated to low traffic volume is due to the higher speed of vehicles in this traffic conditions, which are also associated to higher emission rates. The pattern seen in the PDP of wind speed with the model predicted PM_2.5 is typical, consistent with a study conducted in London⁴⁸, the detected high concentrations during both low and high wind speeds are indicative of resuspension phenomena. Conversely, the noted decrease in concentrations amidst moderate wind speeds suggests effective dilution processes. In examining the various meteorological variables, the differences in predicted concentrations appear relatively small. Despite numerical discrepancies as modest as ± 0.4 µg m⁻³ in certain variables, the significance of these variations lies in the nuanced context of our analysis. It is important to recognize the sensitivity of our model to scale, where seemingly small numerical changes may hold practical implications based on the specific range of our data. In considering the real-world impact, we can explore how these subtle changes may manifest in air quality assessments, health implications, and other pertinent applications.

The study has several limitations, including the relatively short duration of the campaign (one month) and the relatively small study area in Selly Oak, Birmingham. The short duration and limited geographical scope may limit the capture of the full range of seasonal variations and diverse atmospheric conditions that can impact PM_2.5 levels. To enhance the generalizability of the model, future studies should incorporate data collected over more extended periods and from diverse locations, spanning different seasons and weather conditions. Additionally, the citizen science approach employed in data collection, while fostering community engagement, introduces certain challenges. The variability in data collection frequency and routes taken by individuals carrying mobile sensors may introduce biases. Moreover, participant behaviours such as smoking can also introduce biases that affect the measurements. Future research should explore strategies to standardize data collection procedures and address potential biases associated with the citizen science aspect.

This research addresses a critical gap in air pollution monitoring by introducing a novel spatiotemporal prediction approach to fill missing data in low-cost particulate matter sensors, specifically focusing on PM_2.5 concentrations. Our primary objective was to predict PM_2.5 concentrations at fine spatial and temporal scales, even in areas lacking direct observations. One key aspect of our contribution lies in the ability of our methodology to predict missing PM_2.5 measurements even when mobile sensors were not actively collecting data or, in other words, no one was walking around with a sensor in the area. This feature is crucial for extending the applicability of the model to scenarios where continuous data collection may not be feasible or practical. The robust performance of the RF model, demonstrated through calibration and evaluation processes, emphasizes its reliability in estimating PM_2.5 concentrations beyond active sensor deployment.

The hybrid dataset integration encompassed meteorological parameters, aerosol properties, and spatially resolved proxy variables, contributing to the model's accuracy. Our approach demonstrated the ability to predict PM_2.5 levels but also showcased the adaptability of the model across different sensors and road types. The importance of the study lies in its potential to significantly enhance spatial resolution beyond regulatory monitoring infrastructure, providing refined air quality predictions and improving exposure assessments. The findings underscore the performance of machine learning-based approaches, particularly tree-based algorithms, allowing to advance our understanding of PM_2.5 dynamics and of their implications for public health.

Our methodology also distinguishes itself through its remarkable computational efficiency and speed. The utilization of low-cost Optical Particle Counters in both static and mobile configurations offers a cost-effective solution for air quality monitoring and significantly reduces the computational burden. In contrast to more complex models, our approach excels in delivering high prediction accuracy while maintaining relatively lower computational costs. Notably, the second and the third ML calibration approaches showcase that relatively few mobile monitors can effectively characterize air pollution levels across an entire area at different times of day when measurements are made. This finding has profound implications for citizen science initiatives, as it suggests that the contributions of a small number of participants can significantly enhance our understanding of local air quality patterns. The efficiency of the RF algorithm ensures swift model training and prediction, making it an ideal choice for real-time air quality assessments. This computational efficiency enhances the speed of our predictions and makes the methodology more accessible for widespread implementation. The developed approach achieves a favourable balance between prediction accuracy and computational performance, rendering it a practical and readily deployable solution for air quality monitoring across diverse urban landscapes. The demonstrated success of the machine learning approach encourages further exploration and application in urban air quality assessments, ultimately contributing to improved public health outcomes.

Competing interests

All authors declare no financial or non-financial competing interests. Author Roy M. Harrison is Co-Editor-in-Chief of npj Climate and Atmospheric Science and was not involved in the journal’s review of, or decisions related to, this manuscript.

Author contributions

A.B. and F.D.P. conceived the idea for this paper. A.B. led the processing of the data and wrote the first draft. F.D.P. and A.B.¹ supervised the project. S.D. and D.B. helped in data collection. D.B. performed the sensor calibrations. A.B.¹ helped with the refinement of the methodology. D.B., S.D., A.B.¹, G.G, O.G.,R.M.H and F.D.P. provided help and feedback on the analysis and edited the final draft.

Acknowledgements

This paper and related research have been conducted during and with the support of the Italian national inter-university PhD course in Sustainable Development and Climate change (link: www.phd-sdc.it). The materials and study were funded by the European Commission as part of the RI-URBANS project (grant no. 1010362450) and the UKRI QUANT project (grant number NE/T001968/1).

Data availability

The datasets used and/or analysed during the current study will be available on the open access repository of the University of Birmingham Institutional Research Archive (UBIRA) **currently being validated**.

Code availability

All the relevant code can be obtained upon reasonable request from the corresponding author.

European Environment Agency. Europe’s Air Quality Status 2021. (Publications Office, LU, 2022).
Gelencsér, A. et al. Source apportionment of PM2.5 organic aerosol over Europe: Primary/secondary, natural/anthropogenic, and fossil/biogenic origin. J. Geophys. Res. Atmospheres 112, 2006JD008094 (2007).
Pöschl, U. Atmospheric Aerosols: Composition, Transformation, Climate and Health Effects. Angew. Chem. Int. Ed. 44, 7520–7540 (2005).
Murray, C. J. L. et al. Global burden of 87 risk factors in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. The Lancet 396, 1223–1249 (2020).
Liu, C. et al. Ambient Particulate Air Pollution and Daily Mortality in 652 Cities. N. Engl. J. Med. 381, 705–715 (2019).
Pappa, A. & Kioutsioukis, I. Forecasting Particulate Pollution in an Urban Area: From Copernicus to Sub-Km Scale. Atmosphere 12, 881 (2021).
Southerland, V. A. et al. Global urban temporal trends in fine particulate matter (PM2·5) and attributable health burdens: estimates from global datasets. Lancet Planet. Health 6, e139–e146 (2022).
Minguillón, M. C., Querol, X., Baltensperger, U. & Prévôt, A. S. H. Fine and coarse PM composition and sources in rural and urban sites in Switzerland: Local or regional pollution? Sci. Total Environ. 427–428, 191–202 (2012).
Jerrett, M. et al. A review and evaluation of intraurban air pollution exposure models. J. Expo. Sci. Environ. Epidemiol. 15, 185–204 (2005).
Li, H. Z., Dallmann, T. R., Gu, P. & Presto, A. A. Application of mobile sampling to investigate spatial variation in fine particle composition. Atmos. Environ. 142, 71–82 (2016).
Baruah, A., Zivan, O., Bigi, A. & Ghermandi, G. Evaluation of low-cost gas sensors to quantify intra-urban variability of atmospheric pollutants. Environ. Sci. Atmospheres 3, 830–841 (2023).
Di, Q. et al. Air Pollution and Mortality in the Medicare Population. N. Engl. J. Med. 376, 2513–2522 (2017).
Crilley, L. R. et al. Evaluation of a low-cost optical particle counter (Alphasense OPC-N2) for ambient air monitoring. Atmospheric Meas. Tech. 11, 709–720 (2018).
Zimmerman, N. et al. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmospheric Meas. Tech. 11, 291–313 (2018).
Bigi, A., Mueller, M., Grange, S. K., Ghermandi, G. & Hueglin, C. Performance of NO, NO2 low cost sensors and three calibration approaches within a real world application. Atmospheric Meas. Tech. 11, 3717–3735 (2018).
Morawska, L. et al. Applications of low-cost sensing technologies for air quality monitoring and exposure assessment: How far have they gone? Environ. Int. 116, 286–299 (2018).
Fattoruso, G. et al. Site Suitability Analysis for Low Cost Sensor Networks for Urban Spatially Dense Air Pollution Monitoring. Atmosphere 11, 1215 (2020).
Zheng, T. et al. Field evaluation of low-cost particulate matter sensors in high- and low-concentration environments. Atmospheric Meas. Tech. 11, 4823–4846 (2018).
Badura, M., Batog, P., Drzeniecka-Osiadacz, A. & Modzel, P. Evaluation of Low-Cost Sensors for Ambient PM _2.5 Monitoring. J. Sens. 2018, 1–16 (2018).
Castell, N. et al. Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates? Environ. Int. 99, 293–302 (2017).
Sayahi, T., Butterfield, A. & Kelly, K. E. Long-term field evaluation of the Plantower PMS low-cost particulate matter sensors. Environ. Pollut. 245, 932–940 (2019).
Hassani, A., Castell, N., Watne, Å. K. & Schneider, P. Citizen-operated mobile low-cost sensors for urban PM2.5 monitoring: field calibration, uncertainty estimation, and application. Sustain. Cities Soc. 95, 104607 (2023).
deSouza, P. et al. Air quality monitoring using mobile low-cost sensors mounted on trash-trucks: Methods development and lessons learned. Sustain. Cities Soc. 60, 102239 (2020).
Mueller, M. D., Hasenfratz, D., Saukh, O., Fierz, M. & Hueglin, C. Statistical modelling of particle number concentration in Zurich at high spatio-temporal resolution utilizing data from a mobile sensor network. Atmos. Environ. 126, 171–181 (2016).
Singh, A. et al. Air quality assessment in three East African cities using calibrated low-cost sensors with a focus on road-based hotspots. Environ. Res. Commun. 3, 075007 (2021).
Prank, M. et al. Evaluation of the performance of four chemical transport models in predicting the aerosol chemical composition in Europe in 2005. Atmospheric Chem. Phys. 16, 6041–6070 (2016).
Guo, L., Chen, B., Zhang, H. & Zhang, Y. A new approach combining a simplified FLEXPART model and a Bayesian-RAT method for forecasting PM10 and PM2.5. Environ. Sci. Pollut. Res. 27, 2165–2183 (2020).
Ma, J., Yu, Z., Qu, Y., Xu, J. & Cao, Y. Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A Case Study of Shanghai. Aerosol Air Qual. Res. 20, 128–138 (2020).
Taheri Shahraiyni, H. & Sodoudi, S. Statistical Modeling Approaches for PM10 Prediction in Urban Areas; A Review of 21st-Century Studies. Atmosphere 7, 15 (2016).
Bai, L., Liu, Z. & Wang, J. Novel hybrid extreme learning machine and multi-objective optimization algorithm for air pollution prediction. Appl. Math. Model. 106, 177–198 (2022).
Yang, G., Lee, H. & Lee, G. A Hybrid Deep Learning Model to Forecast Particulate Matter Concentration Levels in Seoul, South Korea. Atmosphere 11, 348 (2020).
Kim, B.-Y., Cha, J. W., Chang, K.-H. & Lee, C. Estimation of the Visibility in Seoul, South Korea, Based on Particulate Matter and Weather Data, Using Machine-learning Algorithm. Aerosol Air Qual. Res. 22, 220125 (2022).
Tella, A. & Balogun, A.-L. GIS-based air quality modelling: spatial prediction of PM10 for Selangor State, Malaysia using machine learning algorithms. Environ. Sci. Pollut. Res. 29, 86109–86125 (2022).
Kim, B.-Y., Lim, Y.-K. & Cha, J. W. Short-term prediction of particulate matter (PM10 and PM2.5) in Seoul, South Korea using tree-based machine learning algorithms. Atmospheric Pollut. Res. 13, 101547 (2022).
Barthwal, A., Acharya, D. & Lohani, D. Prediction and analysis of particulate matter (PM2.5 and PM10) concentrations using machine learning techniques. J. Ambient Intell. Humaniz. Comput. 14, 1323–1338 (2023).
Alam, M. S. et al. Diurnal variability of polycyclic aromatic compound (PAC) concentrations: Relationship with meteorological conditions and inferred sources. Atmos. Environ. 122, 427–438 (2015).
Crilley, L. R. et al. Effect of aerosol composition on the performance of low-cost optical particle counter correction factors. Atmospheric Meas. Tech. 13, 1181–1193 (2020).
Khreis, H., Johnson, J., Jack, K., Dadashova, B. & Park, E. S. Evaluating the Performance of Low-Cost Air Quality Monitors in Dallas, Texas. Int. J. Environ. Res. Public. Health 19, 1647 (2022).
Ghaffarpasand, O. & Pope, F. D. Telematics data for geospatial and temporal mapping of urban mobility: Fuel consumption, and air pollutant and climate-forcing emissions of passenger cars. Sci. Total Environ. 894, 164940 (2023).
Ghaffarpasand, O. & Pope, F. D. Telematics data for geospatial and temporal mapping of urban mobility: New insights into travel characteristics and vehicle specific power. J. Transp. Geogr. 115, 103815 (2024).
Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001).
Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. in Advances in Neural Information Processing Systems (eds. Burges, C. J., Bottou, L., Welling, M., Ghahramani, Z. & Weinberger, K. Q.) vol. 26 (Curran Associates, Inc., 2013).
Ishwaran, H. Variable importance in binary regression trees and forests. Electron. J. Stat. 1, (2007).
Genuer, R., Poggi, J.-M. & Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 31, 2225–2236 (2010).
Williamson, B. D., Gilbert, P. B., Carone, M. & Simon, N. Nonparametric variable importance assessment using machine learning techniques. Biometrics 77, 9–22 (2021).
Greenwell, B. M., Boehmke, B. C. & McCarthy, A. J. A Simple and Effective Model-Based Variable Importance Measure. (2018) doi:10.48550/ARXIV.1805.04755.
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, (2001).
Jones, A. M., Harrison, R. M. & Baker, J. The wind speed dependence of the concentrations of airborne particulate matter and NOx. Atmos. Environ. 44, 1682–1690 (2010).

(Not answered)

SupplementaryInformation.docx

Download PDF

Editorial decision: revise
11 Sep, 2024
Review #1 received at journal
08 Sep, 2024
Review #2 received at journal
06 Aug, 2024
Reviewer #3 agreed at journal
02 Aug, 2024
Reviewer #2 agreed at journal
29 Jul, 2024
Reviewer #1 agreed at journal
08 Jul, 2024
Reviewers invited by journal
08 Jul, 2024
Editor assigned by journal
04 Jul, 2024
Submission checks completed at journal
04 Jul, 2024
First submitted to journal
01 Jul, 2024

You are reading this latest preprint version

A novel spatiotemporal prediction approach to fill air pollution data gaps using mobile sensors, machine learning and citizen science techniques

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methodology

2.1 Study Area

2.2 Low-cost Optical Particle Counters

2.3 Birmingham Air Quality Supersite (BAQS)

2.4 Calibration

2.5 Hybrid dataset and ML model

2.5.1 Hybrid Dataset Preparation

2.5.2 ML Model Development and Calibration Approaches

3. Results and Discussion

3.1 Calibration Approach 1: Standard Random Forest Regression

3.2 Calibration Approach 2: Sensor Transferability Evaluation

3.3 Variable importance (VIMP) of the input features

4. Conclusions

Declarations

Author contributions

Acknowledgements

Data availability

Code availability

References

Additional Declarations

Supplementary Files

Status:

Version 1