Particulate Matter (PM) air pollution has a considerable negative influence on the human health, especially with respect to the cardiovascular and pulmonary systems. According to the European Environmental Agency (EEA, 2023)1, 97% of the urban population in Europe is exposed to fine particulate matter (PM2.5, i.e. PM with a diameter of 2.5 µm or smaller) concentrations above the World Health Organization's (WHO) 2021 recommendations of 5 µg m−³ for annual average. Within the EU, air pollution is estimated to lead to 238 000 premature deaths in 2021 and is the largest environmental health risk in Europe1. PM2.5 is a critical air pollutant with primary PM2.5 originating mainly from combustion processes and secondary PM2.5 from the reaction of organic or inorganic gas compounds, finally contributing eventually up to more than 50% of PM2.5 mass depending on the season and the location2. Also PM10 (PM with a diameter of 10 µm or smaller) is a critical air pollutant, with coarse particles, i.e. between 2.5 µm and 10 µm, resulting mainly from mechanical processes3.
PM2.5 has become the leading environmental contributor to the global burden of disease, representing a substantial departure from its position as the fifth major contributor among environmental risk factors in 19904. Studies have shown that spending a substantial amount of time in areas even with low ambient PM2.5 levels can have adverse effects on human health5. The health impact of air pollution is critical in urban areas, where most of the world population resides, therefore rapid reduction strategies are required. For these strategies to be successful, they need to be targeted and hence an accurate description of the spatial-temporal variability of PM is required7. Urban areas exhibit high heterogeneity in PM concentrations due to the diversity of the emission sources, the variability in land use patterns, and of the interaction between the meteorological factors and the urban canopy, which influence air pollutants’ dispersion8. This spatial and temporal variability poses challenges for exposure assessment and air quality management9.
Regulatory monitoring networks, such as the UK's Automatic Urban and Rural Network (AURN), serve as the main UK infrastructure for ensuring compliance with ambient air pollution standards. Nevertheless, the acquisition and maintenance costs of regulatory-grade instruments are high, and the sparsely distributed station network fails to capture the small-scale spatial variations observed in pollutant concentrations in urban areas, as indicated by numerous studies10,11. These localized variations contribute to differences in human pollutant exposures, ultimately influencing associated health impacts12.
To detect and quantify the fine-scale spatial fluctuations in pollutant concentrations, there is a growing interest for utilizing low-cost sensor (LCS) networks. This interest is attributed to the improved capabilities of sensor technologies and the development of innovative methods for sensor calibration13,14,15. However, challenges remain in optimizing sensor placement strategies16,17,in data quality assurance due to e.g. LCS drift or sensitivity to meteorological variables 18,19 and in interpreting LCS data in the context of regulatory air quality standards20,. To accurately estimate population exposure, monitoring at a high spatial and temporal resolution should be pursued. Mobile low-cost sensors provide a cost-effective solution for monitoring air quality in areas with limited existing infrastructure, owing to their compact size and portability. Examples include PM2.5 measurements performed by citizen-operated mobile sensors mounted on bikes22, deployed on routine fleet of vehicles such as trash trucks23, tram-based mobile sensor network in Zurich24, taxi motorcycles in Kampala25, etc. However, sampling every location continuously throughout a given geographic area is an unattainable goal.
A diverse array of models are utilized in the prediction of PM levels. Some are based on atmospheric processes and emissions, e.g., Chemical Transport Models (CTMs) or Lagrangian particle dispersion models. These models play a crucial role in simulating and understanding the intricate dynamics of air pollutants, incorporating factors like atmospheric chemistry, emission sources, and dispersion patterns. For instance, Sokhi et al.,26 evaluated four regional chemistry transport models, with a horizontal resolution of approximately 20 km which systematically underestimated PM10 and PM2.5 concentrations in Europe by 10–60%, varying with models and seasons, when compared with the European Monitoring and Evaluation Programme (EMEP) measurements. Zhang et al.,27 employed a simplified Lagrangian particle dispersion model (LPDM) with Bayesian-RAT (multiplicative ratio correction optimization) to enhance regional PM concentration predictions, demonstrating superior accuracy compared to other models(WRF-Chem and CAMX), showcasing the LPDM's advantage in forecasting PM and potentially other pollutants. However, both CTMs and LPDMs may encounter challenges in accurately predicting PM2.5 concentrations due small size of the dataset, low predictive performance for small areas, high computational cost and achieving sufficient spatial and temporal resolution28. Other prediction approaches include the use of statistical approaches based on meteorological variables and emission proxies29.
Data-driven models, in contrast to physically-driven models, have garnered significant attention due to their ease of implementation30. Machine learning (ML) models have been shown to be highly effective for PM prediction, showcasing robust performance with non-linear variables and flexible modelling31. Supervised learning involves the integration of tree-based algorithms (random forest, extreme gradient boosting, light gradient boosting, etc.) and vector-based algorithms (k-nearest neighbour, support vector regression, etc.), capable of learning label data through classifiers or regressors32. Nevertheless, classifier methods proved to be less suitable for PM prediction compared to regressor methods, and, in general, vector-based algorithms exhibited lower predictive power than their tree-based counterparts33. Hence, tree-based machine learning algorithms, known for their low computational costs and high prediction accuracy, are extensively employed in PM prediction research34,35.
The existing literature reveals a notable research gap concerning the limitations of current air pollution prediction models, particularly in the context of fine-scale spatial and temporal variations. To address these gaps that demand innovative and cost-effective solutions for enhanced spatial resolution, especially in densely populated urban settings, our study proposes a novel methodology that leverages on ML tools, particularly tree-based models, to predict PM2.5 levels with unprecedented precision at both spatial and temporal scales.
While our research maintains a broad scope, we conducted a comprehensive testing phase within a measurement campaign from Selly Oak, Birmingham, United Kingdom where we deployed a combination of static and mobile Optical Particle Counters. Our primary objective is to craft predictive models using tree-based ML algorithms that excel in estimating PM levels. To achieve this, we are harnessing the potential of a hybrid dataset, curated to integrate information from both static, mobile low-cost sensors and diverse ancillary datasets. Our focus extends beyond scenarios with active mobile sensor deployment, aiming to create models that can reliably forecast PM concentrations even when the mobile sensors are not in operation.