A pruned feed-forward neural network (pruned-FNN) approach to measure air pollution exposure

doi:10.21203/rs.3.rs-2322627/v1

Download PDF

Research Article

A pruned feed-forward neural network (pruned-FNN) approach to measure air pollution exposure

https://doi.org/10.21203/rs.3.rs-2322627/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 11 Sep, 2023

Read the published version in Environmental Monitoring and Assessment →

You are reading this latest preprint version

Environmental epidemiology studies require accurate estimation of exposure intensities to air pollution. The process from air pollutant emission to individual exposure is however complex and nonlinear, which poses significant modeling challenges. This study aims to develop an exposure assessment model that can strike a balance between accuracy, complexity, and usability. In this regard, neural networks offer one possible approach. This study employed a custom-designed pruned feed-forward neural network (pruned-FNN) approach to calculate the air pollution exposure index based on emission time and rates, terrain factors, meteorological conditions, and proximity measurements. The model performance was evaluated by cross validating the estimated exposure indexes with ground-based monitoring records. The pruned-FNN can predict pollution exposure indexes (PEIs) that are highly and stably correlated with the monitored air pollutant concentrations (Spearman rank correlation coefficients for 10-fold cross validation (mean ± standard deviation: 0.906 ± 0.028), for random cross validation (0.913 ± 0.024)). The predicted values are also close to the ground truth in most cases (95.5% of the predicted PEIs have relative errors smaller than 10%) when the training datasets are sufficiently large and well-covered. The pruned-FNN method can make accurate exposure estimations using a flexible number of variables and less extensive data in a less money/time-consuming manner. Compared to other exposure assessment models, the pruned-FNN is an appropriate and effective approach for exposure assessment that covers a large geographic area over a long period of time.

Air pollution

Exposure assessment

Pruned feed-forward neural network (pruned-FNN)

Machine learning

GIS

Spatial modeling

The quality of the air is crucial to human health. Air pollution is linked to a range of health conditions, including respiratory diseases, vascular disorders, heart attacks, lung cancer, mental health issues, and even increased mortality rates (Mabahwi et al., 2014; Mannucci et al., 2015; Murray et al., 2001; Sarris et al., 2019), which has raised public concerns. The term air pollution exposure refers to the process by which the human body is exposed to a certain concentration of a pollutant via breathing, skin absorption, diet, etc., over a certain period of time (B. Han et al., 2017). To perform epidemiological analyses on the association between air pollution exposure and health outcomes, we must first accurately estimate an individual's exposure to air pollutants. Although different exposure assessment methods have been developed and applied in the literature (Table 1), all the methods have limitations under certain circumstances, particularly when exposure assessment is conducted over a large geographic area and over a long period of time (Gong et al., 2016; Nieuwenhuijsen et al., 2006; Zou et al., 2009a). The direct measurement of individual exposure is accurate, but data collection is time-consuming and expensive; ambient monitor-based methods rely on dense networks of monitoring stations, which are not always available; air dispersion models incorporate various factors (such as meteorological conditions, air temperature, topography, and road type) in the dispersion equations to provide accurate estimates, but the extensive input data requirements make them difficult to implement on a large scale; proximity models simplifies the dispersion process using proximity as a proxy, but exposure misclassifications may occur as a result of oversimplification (de Ferreyro Monticelli et al., 2021; Forastiere & Galassi, 2005; Gong et al., 2016; Zou et al., 2009a). More details about these methods will be discussed in the following chapter. Consequently, we need a method that can achieve a balance between accuracy, complexity, and usability, which means that it can make accurate exposure estimations using a flexible number of variables and less extensive data in an efficient manner. Because of the self-learning, fast convergence, and fault tolerant characteristics of neural networks (Bishop, 2006), they were able to simulate the complex and non-linear relationship from air pollutant emission to individual exposure. This study aims to examine the feasibility of developing an exposure assessment model using the neural network approach.

Particulate matter (PM) is the generic term used to describe a mixture of air pollutants characterized by particles of similar sizes; and PM_2.5 and PM₁₀ refer to particles with aerodynamic diameters of less than 2.5µm and 10µm, respectively (Pöschl, 2005). The composition of PM varies, as they can absorb and transfer a variety of pollutants. Specific PM composites, such as mercury, cadmium, nickel, vanadium, chromium, and manganese, can be transported by air and constitute one category of PM’s major components (Kampa & Castanas, 2008). As by-products of industrial and/or manufacturing processes, those composites in PM are emitted to the atmosphere on a consistent basis in the United States (U.S. EPA, 2020a). A major problem with these PM composites is that they tend to accumulate in the human body and become toxic. As has been demonstrated in numerous studies, exposure to metal composites of air pollution is associated with adverse effects on various systems, including the respiratory system (Tager et al., 2005), the cardiovascular system (Ghio, 2006), the nervous system (Ratnaike, 2003), etc. While the monitoring network for PM as a whole is relatively dense, the monitoring records for each specific composite within PM are limited. Therefore, these specific PM composites have received less attention in environmental health research compared to criteria air pollutants (CAPs) (Gong et al., 2016). Thus, this study will focus on exposure assessment to the composites of PM emitted from industrial facilities.

To address the literature gap regarding air pollution exposure assessments, this study implemented a customized pruned feed-forward neural network (pruned-FNN) approach to calculate the air pollution exposure index based on emission time and rates, terrain factors, meteorological conditions, and proximity measurements. An evaluation of the model performance was conducted by cross validating the estimated exposure indexes with monitoring data of PM composites in New Mexico during 2007–2017. A significant advantage of the model is that it balances accuracy, complexity, and usability in assessing exposure to air pollution in a large geographic area over a prolonged period.

Table 1

Comparison of air pollution exposure assessment methods

(Modified from Gong et al. (2016))
Categories	Methods		Examples	Key advantages	Key disadvantages	Appropriate for estimating exposure in large scale with less misclassification
Direct methods	Individual monitoring		• Direct-reading devices • Pumped or passive samplers	• Most accurate estimations • Useful as benchmark	• Expensive • Time-consuming • Labor intensive • Sample selection bias	No
Indirect methods	Ambient monitor-based methods	Direct surrogate models	• Data from the closest monitors as a surrogate for exposure	• Simple • Directly use the monitoring data	• Rely on the availability of monitoring data • Require proper neighborhood size	No
		Interpolation models	• IDW • Spline • Kriging/Cokriging	• Easy to operate • Generate continuous surface which can simulate sources and sinks of air pollution concentration	• Require extensive monitoring datasets • Introduce uncertainties by ignoring spatial variabilities	No
		Land Use Regression (LUR) models	• LUR	• Cover extensive geographic areas efficiently	• Require extensive monitoring datasets • Need model calibrations to apply in another area	No
	Pollution source-based methods	Air dispersion models (Self-prepared)	• AERMOD • CALPUFF • BLP • CALINE3 • ISC3	• Consider various factors and different types of pollution sources • More extensive spatial coverage than ambient monitor-based methods	• Extensive data requirement • Time consuming • Hard to implement in large geographic areas over multiple years	No
		Air dispersion models (Public datasets)	• National-Scale Air Toxics Assessment (NATA) • Risk-Screening Environmental Indicators (RSEI)	• Consider various factors and different types of pollution sources • Ready for further analysis, easy to acquire, avoid long computing time	• Limited temporal coverage • Restricted spatial resolution	No
		Proximity models	• Traditional proximity models (TMPs) • Emission Weighted Proximity Model (EWPM) • Gaussian weighting function-aided proximity model (GWFPM)	• Simple to implement • Low data requirement • Appropriate for exploratory analysis prior to more sophisticated investigations	• Exposure misclassification	No
		Machine Learning methods	• Decision tree • Support vector machine • Gradient boosting • Neural Network	• Data driven, don’t need extensive expert knowledge of the phenomenon • Consider flexible number of predictor variables • Require less extensive input data • Effective • Fault tolerance	• Highly dependent on training data quality • No explicitly functions for interpreting the universal dispersion process	Yes

Related Work

The assessment of air pollution exposure is an essential component of any study on the relationship between air pollution and human health. Researchers have developed and applied various types of exposure assessment methods in the literature, as illustrated in Table 1. The two major categories are direct and indirect methods (de Ferreyro Monticelli et al., 2021). The direct method of estimating individual air pollution exposure is to use personal exposure monitors that can be carried around by the user throughout the day. Air pollution sensors are incorporated into these monitors to measure the pollutants that individuals are exposed to every day (Park et al., 2020; Steinle et al., 2013). Devices like these can record an individual's movements and activities in real-time for accurate exposure assessment; however, collecting data for a large number of individuals is both expensive and time-consuming (Forastiere & Galassi, 2005; Khan et al., 2019). An alternative category of method is to use indirect measurements that determine the exposure index via logical or empirical constructs based upon the available input data (Gong et al., 2016).

Indirect methods can be classified into two subcategories: ambient monitor-based methods and pollution source-based methods. Ambient monitor-based methods assess air pollution exposure using data gathered from ambient air monitoring (Bell, 2006). Direct surrogate models, interpolation models, and land use regression (LUR) models are three commonly used ambient monitor-based methods. A direct surrogate model uses data from the closet monitors as a surrogate for exposure of a given location (Gong & Zhan, 2022; Xie et al., 2017). Even though this model is easy to use, it may produce inaccurate results when the monitor network is not dense enough and the nearest monitor is located far away. Based on monitoring data collected at sample locations, interpolation models typically use deterministic and stochastic mathematical techniques to create a continuous surface of air pollution concentrations (Jerrett et al., 2005). Inverse distance weighting (IDW), spline, and cokriging are important interpolation models used in estimating the air pollution exposure intensities (Hvidtfeldt et al., 2021; H. C. Li et al., 2017; L. Li et al., 2012). Interpolation models are easy to operate, but the disadvantage of interpolation models is that they require extensive monitoring datasets and introduce more uncertainties because of ignoring spatial variabilities (Jerrett et al., 2005; Xie et al., 2017). The Land Use Regression (LUR) model utilizes monitoring data as dependent variables and other data sources such as traffic, land use, and terrain around monitoring stations as independent variables in order to formulate multiple regression equations, which can be used to predict air pollution intensity in unmonitored areas (Luo et al., 2021; Zhang et al., 2021). LUR is advantageous for its ability to cover extensive geographic areas efficiently; however, it requires a large number of monitoring sites in order to build a model, and the model needs to be re-calibrated before being applied to another area (Morley & Gulliver, 2018).

The pollution source-based methods included air dispersion models and proximity models. Air dispersion models use mathematical formulations (e.g. Gaussian model) to simulate the dispersion process of a pollutant from a source to a receptor (Gulliver & Briggs, 2011). Examples of the air dispersion models include AERMOD, CALPUFF, BLP, CALINE3, and ISC3 (U.S. EPA, 2021). In the modeling process, air dispersion models can incorporate various factors (e.g., meteorological factors, air temperature, topography, and road type) and different types of pollution sources (point, line, and area sources), which can improve the spatial-temporal coverage of areas with limited monitoring (Kousa et al., 2002; Zou et al., 2009b). Nevertheless, it is often difficult in practice to comply with the extensive data requirements of air dispersion models, especially if the study requires environmental data covering a large geographical area over a long period of time (de Ferreyro Monticelli et al., 2021; Ragettli et al., 2014). Moreover, some pre-calculated exposure datasets using air dispersion models are also available for public use, e.g. the U.S. EPA’s National-Scale Air Toxics Assessment (NATA) datasets (U.S. EPA, 2022a) and Risk-Screening Environmental Indicators (RSEI) datasets (U.S. EPA, 2022b). However, the estimated air pollution exposures in NATA and RSEI are limited in terms of their spatial resolution and coverage of time and pollutants. For instance, the RSEI data are reported in predefined 810-meter grid cells in the United States by year average; NATA datasets are only available every third year, and the finest spatial resolution is at the census tract level. The proximity model is another category of pollution source-based method, which is based on the premise that exposure intensities decrease with increasing distance from a pollution source. In proximity models, the distances between pollution sources and receptors are used as proxy values for estimating exposure (Brender et al., 2014). Traditional proximity models (TMPs) (Zou et al., 2009a), emission weighted proximity model (EWPM) (Gong et al., 2016, 2018b; Zou et al., 2009b), modified EWPM (Gong, et al., 2018a), and Gaussian weighting function-aided proximity model (GWFPM) (Zou et al., 2016) are some examples of this category. Although proximity models are simple to implement and good for exploratory analysis, they can also suffer from exposure misclassifications due to the oversimplified assumption of dispersion process used in the model (Zou et al., 2016).

Because the relationship from air pollutant emission to individual exposure is complex and non-linear in nature, Machine learning (ML) methods can be used to model this relationship and predict air pollution exposures without requiring extensive expert knowledge of the phenomenon (Razavi-Termeh et al., 2021; Stingone et al., 2017). ML methods firstly build a model between predictor variables and target variable(s) based on a training dataset with limited samples without being explicitly programmed to do so, then use the model to predict target variable(s) in an independent dataset (Koza et al., 1996). Neural networks are some of the most universal and efficient ML algorithms (Bishop, 2006). A neural network is made up of a series of node layers, including an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold (Adams & Kanaroglou, 2016). Neural networks have great advantages in data regression and classification tasks because of their powerful self-learning properties and fast converging algorithms (Cao et al., 2018). Neural networks’ fault tolerance capability also makes it useful when it is challenging to obtain unbiased data without errors (Bolt, 1993). Neural networks also have some advantages over other ML methods in air pollution exposure assessment. The decision tree algorithm is unstable, where a small change in the training data can lead to a large change in the predicting results; gradient boosting algorithm is sensitive to the noise and tends to be easily overfitting; support vector machine is computationally intensive for large and high dimensional training data and also sensitive to the noise (Bishop, 2006; Xi et al., 2015). Comparing neural network approach with other exposure assessment methods, the neural network models can consider flexible number of predictor variables besides only the proximity, require less extensive monitoring samples as input, and are less time/money consuming to implement. These features are good for air pollution exposure assessment in studies involving large areas and long time periods. Additionally, fully connected neural networks have been used to for exposure assessments in may studies (Banerjee et al., 2022; Blanes-Vidal et al., 2017; Ebrahimi Ghadi et al., 2019; Gan et al., 2020). Fully connected neural networks have been shown to be capable of learning (Karnn, 1990); however, they are generally computationally expensive and may produce some unrealistic function relations. The idea behind a pruned neural network is to remove parts of the network that are unrealistic or not useful without affecting the performance of the network (S. Han et al., 2015). A pruned neural network is more efficient to train because the search space for solutions is significantly reduced. For example, if we are aware that certain variables should not interact, we can simply prune the edges between them. In this way, we are instructing the neural network to avoid exploring a portion of the search space. Therefore, a pruned neural network can be trained faster and computationally lighter, without sacrificing accuracy, and also eliminate less-possible function relations by pruning (Bishop, 2006). As a result, this study intends to develop a pruning neural network-based exposure assessment approach and evaluate the model performance accordingly.

The present study incorporated a custom-designed pruned feed-forward neural network (pruned-FNN) to simulate the complex and non-linear relationships from air pollution emission to individual exposure. When predicting exposure indexes using the model, emission time and emission rate of air pollutants, terrain factors, meteorological conditions, and proximity measurements are used as input variables (Hill et al., 2019; Prabhakaran et al., 2020; Ren et al., 2020). Monitoring data is considered as the ground truth to train, calibrate, and cross-validate the prune-FNN.

Data

We used four datasets in the state of New Mexico and surrounding areas during 2007–2017 in the study, including: (1) the air emission data of air pollutants compiled by the U.S. EPA Toxic Release Inventory (TRI) program (U.S. EPA, 2022c); (2) the air monitoring data from the U.S. EPA Air Quality System (AQS) DataMart (U.S. EPA, 2020b); (3) the climate data, including wind data, temperature data, and humidity data from the North American Regional Reanalysis (NARR) by NOAA (NARR, 2022), and (4) the terrain data from the United States Geological Survey (USGS, 2019). All variables within the datasets are potentially related to the complex process of air pollution dispersion. Due to the fact that wind is taken into consideration in the dispersion process, air emission data, climate data, and terrain data include not only the main study area (the state of New Mexico), but also its surrounding areas (latitude range: 30.5 ~ 38.5; longitude range: -110.7~-101.5).

The U.S. EPA TRI program is a mandatory program created under Section 313 of the Emergency Planning and Community Right-to-Know Act (EPCRA) to facilitate emergency planning and inform the public about releases of toxic chemicals in their communities (U.S. EPA, 2022c). Under the TRI program, U.S. industrial facilities must annually report information about their locations, types of chemicals released, and estimated quantities of chemicals released into the environment. We selected the TRI emission data for New Mexico and the surrounding areas for 2007–2017 from the national TRI database. A total of 139 types of pollutants are released by 369 factories during this time period (Fig. 1). For every pollutant from a given factory, the sum of annual emissions from both the stack and fugitive pathways is used as the total emission amount in the exposure assessment model.

Air monitoring data were obtained from the U.S. EPA AQS DataMart, which is a database that contains every measured value as well as associated aggregate values (8-hour, daily, annual, etc.) collected through the national ambient air monitoring program (U.S. EPA, 2020b). In New Mexico, 63 active monitoring sites were available to record the concentrations of 255 unique chemicals between 2007 and 2017 (Fig. 1). The study used the annual average concentrations of each pollutant from the monitors to match the temporal scale of the air emission data.

North American Regional Reanalysis (NARR) is an extension of NOAA's National Center for Atmospheric Prediction (NCEP) Global Reanalysis, which provides a long-term, consistent, high-resolution climate dataset for North America (Mesinger et al., 2006). This study extracted the daily mean values of temperature, humidity, zonal wind velocity (horizontal wind towards east), and meridional wind velocity (horizontal wind towards north) for New Mexico and surrounding areas during 2007–2017. The spatial resolution of the climate data grid is approximately 32 kilometers. For each year in each grid, we calculated the mean temperature, the mean humidity, the average wind speed, and the prevailing wind direction so that these variables would match the annual scale of other variables. Combining the wind speed and wind direction, we calculated a wind index value ${W}_{i,j}$between an emission source j and any given location i in New Mexico during 2007–2017 using the Eq. (1):

$${W}_{i,j}=\left\{\begin{array}{c}v\text{*cos}(\left(\theta -180\right)-\alpha ), \text{cos}(\left(\theta -180\right)-\alpha )\ge 0\\ 0, \text{cos}\left(\left(\theta -180\right)-\alpha \right)<0\end{array}\right.$$

Where $\theta$ is the wind angle at the location i; $\alpha$ is the direction from the emission source j to the location i; and v is the wind speed at location i. The wind index generally assigns larger values to places downwind of the emission sources with higher wind speed.

The 30-meter DEM data for New Mexico and surrounding areas is obtained from the National Map Data Download and Visualization Services of the United States Geological Survey (USGS, 2019). The data are used to calculate elevation differences from a specific location to different emission sources, which are one of the model input variables.

Methods

Data pre-processing

Because we estimate the exposure using the air emission data as inputs and use the air monitoring data as the ground truth for training and cross-validating the prune-FNN model, we need to identify the chemicals shared between emission and monitoring records. Twelve chemicals of PM_2.5 composites were selected for further analysis, including arsenic, aluminum, cadmium, chlorine, chromium, copper, lead, manganese, nickel, selenium, vanadium, and zinc, which contribute a total of 1113 annual average monitoring records of pollutant concentrations in New Mexico during 2007 to 2017. We excluded two negative monitoring records. We also excluded 221 positive records with no emission found in our data, because these records violate the assumption of the pruned-FNN model that the monitor reading is additively contributed by each factory nearby. Therefore, a total of 916 monitoring records were used as ground truth for model training. Due to the limited quantity of samples for each pollutant (no more than 89 samples for a single chemical), we made an assumption that all PM_2.5 composites share roughly similar dispersion pattern. This assumption is a compromise over the limited samples for model training.

The monitor readings have a heavily skewed distribution, with the mean value of all pollution concentrations as 0.00319 microgram/m³ and the standard deviation as 0.0127 microgram/m³. If use the readings directly as training data, the pruned-FNN will be primarily able to fit the large pollution concentration values. To overcome this issue, we add an offset value of 0.0003 microgram/m³ (median of all pollution concentrations) to each monitoring records and perform a natural log transformation on the training data before feeding it into the pruned-FNN. After model training and prediction, we can obtain an exposure index value for pollution concentration by performing an exponential reverse transformation on pruned-FNN’s output and then subtracting the offset from the result. Because of the aforementioned assumptions, the small sample sizes, and skewed distributions in the training data, the model might not predict the real concentrations of air pollutants directly. Instead, the estimated results can be interpreted as pollution exposure indexes (PEIs) which is proportional to the real concentrations of air pollutants.

Neural network structure

The PEI of a given chemical k at a given location i is calculated by the Eq. (2):

$${PEI}_{i, k}=\sum _{j=1}^{n}{g}_{j}({T}_{i},{H}_{i},{EM}_{j,k},{D}_{i,j},{ED}_{i,j},{W}_{i,j })$$

In this expression, ${PEI}_{i, k}$ is the predicted exposure index of chemical k at the location i; j represents one of the n total emission sources (factories). ${PEI}_{i, k}$ is the sum of all g_j (j = 1,2, …, n), where each g_j represents the contribution of PEI at location i from one single emission source j. The g_j is generated by the complex air pollutant dispersion process, which can be depicted by six independent variables as follows. ${T}_{i}$ is the temperature at the location i; ${H}_{i}$ is the humidity at the location i; ${EM}_{j,k}$ represents the sum of fugitive and stack emissions of chemical k from emission source j; ${D}_{i,j}$ is the distance from location i to the emission source j; ${ED}_{i,j}$ represents the elevation difference between emission source j and location i; and ${W}_{i,j}$ is the calculated wind index between emission source j and location i.

The pruned-FNN is designed to fit the function g_j through machine learning in the training dataset. In the training dataset, the monitoring sites are considered as exposure receptors of pollutant emissions. The target variable of the training dataset is the monitoring records of pollutant concentrations; the prediction variables of the training dataset include the distances, elevation differences, the annual wind index from each monitor to each of the 369 factories (${D}_{i,j}$, ${ED}_{i,j}$, and ${W}_{i,j}$), annual average temperature and humidity at each monitoring site (${T}_{i}$ and ${H}_{i}$), and annual emission records from all 369 factories (${EM}_{j,k}$) in New Mexico and surrounding areas during 2007–2017.

As shown in Fig. 2, the pruned-FNN has 1478 inputs in the input layer, 1107 neurons in the first hidden layer, 369 neurons in the second hidden layer, and 1 neuron in the output layer according to our training dataset, which represent the entire contribution of PEI from all emission sources. The entire 1478 inputs are not fully connected in the model. There are 369 substructures in the pruned-FNN, each of which represents a single emission source’s contribution (see Fig. S1). Each substructure contains six input variables mentioned in Eq. (2), two hidden layers, and one output. Because the temperature (T) and humidity (H) at the receptor location are independent from the emission sources, they are shared by all substructures of the emission sources. In the pruned-FNN, the first and second neurons in the input layer represent the temperature (T) and humidity (H); these two neurons are fully connected with all 1107 neurons in the first hidden layer. For the other four variables (distances (D), elevation differences (ED), the annual wind index (W), and annual emission records (EM)), they are only connected in their corresponding substructure. Therefore, in the entire pruned-FNN (Fig. 2), the neuron number 3 to number 6 in the input layer represent these four input features of the first emission source; and neuron number (4i-1) to number (4i + 2) in the input layer represent the input features of the i-th (i = 1,2,…, 369) emission source. The input neurons of a single emission source i will be fully connected with neuron a_[1]³ⁱ⁻², a_[1]³ⁱ⁻¹, and a_[1]³ⁱ in the first hidden layer, and further fully connected with the neuron a_[2]ⁱ in the second hidden layer. All neurons in the second hidden layer will be fully connect with the output layer, which contains only one neuron. Compared with a fully connected neural network having more than 100,000 weights to tune, this study uses only 4,428 weights as tuning parameters. This reduces the size of the search space for solutions significantly and improves the performance of the model training process. It is worth noting that the number of neurons in the current model is related to the 369 factories in the training dataset. The quantity of substructures and corresponding neurons can be changed flexibly based on the number of emission sources. The hyper-parameters of our neural network, such as the learning rate, activation functions, and layer number, are selected using a heuristic approach. The goal is to incrementally adjust these hyper-parameters until we reach an acceptable balance between prediction accuracy and training time.

Method validation

We used two methods to validate the effectiveness of the pruned-FNN: 10-fold cross validation and random cross validation. In the 10-fold cross validation, the data were first shuffled randomly, then divided into 10 groups of equal size, each of which is called a fold. Nine folds are used as training datasets to predict the values for the remaining fold; then the predicted values are compared with the true values in the remaining fold. The process continues 10 times to validate each unique fold. In this study, we did 10-fold cross validation for 10 times which has 100 validations in total. We also run random cross validation for 100 times. In one random cross validation, we randomly chose 20% data entries as validation set and the other 80% as the training set.

In each time of validation, correlations and errors between the predicted values and monitored values were recorded. Since there was no guarantee that the normal distribution assumption could be met in the air pollutant concentrations, the non-parametric Spearman-rank correlation coefficient is used for correlation measurement. We also recorded the relative absolute error (RAE) for error measurement, which is defined as Eq. (3):

$$RAE= \frac{{\sum }_{i=1}^{k}|{R}_{i}-{P}_{i}|}{{\sum }_{i=1}^{k}\left|{R}_{i}\right|}$$

where R_i is the i-th monitor’s reading value; P_i is the predicted value for at the location of the i-th monitor; and k is the total number of readings in the validation set.

For the 10 times of the 10-fold cross validations, Fig. 3a shows the distribution of Spearman rank correlation coefficients. All Spearman rank correlation coefficients are higher than 0.800. The correlation coefficients have a high mean value (0.906) and a small standard deviation (0.028). The results demonstrate that after splitting the dataset into 10 folds, the predicted PEIs have a high and stable correlation with the monitored air pollutant concentrations.

For the 100 times of random cross validation, all Spearman-rank correlation coefficients are also higher than 0.800 (Fig. 3b). The Spearman rank correlation coefficients have a mean value of 0.913 and standard deviation of 0.024, which also shows a consistently high correlation between predicted PEIs and monitoring records. The random cross validations further ensures that the quantities of various pollutants in the training set is large enough to predict a reliable PEI.

To further demonstrate the model performance, we built an 810m*810m grid system over New Mexico and surrounding areas, then used the trained pruned-FNN model to predict the PEIs of the twelve PM_2.5 composites at each grid. Figure 4 shows the PEIs of manganese in year 2017 in the study area, where darker colors represent higher predicted PEIs from the model. The emission sources are shown as red circles whose size are proportional to the annual emission amount, while the monitoring sites are illustrated as green squares with a dot whose size are proportional to the pollutant concentrations. The high predicted PEIs are located near Northwest, Central, and Eastern New Mexico, which are consistent with the trend demonstrated by the monitoring records.

This study has designed a neural network-based exposure assessment approach (pruned-FNN) which considers emission time and rates, terrain factors, meteorological conditions, and proximity measurements. The model cross validation results show that the pruned-FNN model predictions (PEIs) have high and stable correlations with the monitoring readings which are usually considered as the ground truth. The results prove the validity of the pruned-FNN model in predicting the relative relationship of air pollutant exposures in different locations of a study area.

The six input variables of the pruned-FNN (distances (D), elevation differences (ED), annual wind index (W), annual emission records (EM), temperature (T), and humidity (H)) might have different importance in determining the predicted PEI. Therefore, sensitivity analyses were performed to evaluate the model performance with different variable combinations. For each combination, we removed partial variables from the model, retrained the pruned-FNN, and did 10-fold cross validation for 10 times to record the Spearman rank correlation coefficients. As shown in Table S1, the mean correlation coefficient for different model combinations after eliminating variables range from 0.469 to 0.869, the median ranges from 0.422 to 0.852, and the standard deviation ranges from 0.031 to 0.097. Except for the model combinations (T, H, EM, D, ED and EM, D) who have smaller standard deviations of correlation coefficients than the original model, the original model outperformed all other variable combinations in the performance statistics. It shows that all six input variables of the original pruned-FNN are necessary for the model trainings and predictions. The models with emission records (EM) and distances (D) also significantly outperformed the ones not including both variables (Table S1), which indicate the dominant importance of emission records (EM) and distances (D) in the exposure assessment model.

To investigate whether the dispersion process is stable across years in the pruned-FNN, we also conducted cross validations based on time periods of the data to test the model performance. We used data from one of the eleven years (2007 to 2017) as the validation dataset and data from the other ten years as training dataset, then repeated the process for each year from 2007 to 2017. The Spearman rank correlation coefficients of the cross validation for the eleven years are stable (range: 0.766 to 0.932, standard deviation: 0.058) (Fig. S2), indicating that there is no significant difference between years in the dispersion process of pollutants. Therefore, it is reasonable to apply a pruned-FNN trained using existing years of data to predict the PEIs for a new year.

The pruned-FNN is a data-driven method, which means the data quantity could have certain effects on the prediction performance. To further investigate this influence, we conducted some further sensitivity analyses by randomly selecting five different portions of the entire 916 input data instances: 100%, 80%, 60%, 40%, and 20%. For each portion, we repeated the model training and the 10-fold cross validation process as the original model. As data quantity decreases, the mean correlation coefficient decreases, and the standard deviation of correlation coefficients increased (Fig. S3), indicating that a lower prediction accuracy with higher uncertainty. This observation of the pruned-FNN is consistent with the common phenomenon in a data-driven method: the more high-quality training and validation datasets, the higher possibility a data-driven model can make more accurate predictions (James et al., 2013).

Although the model has high accuracy in predicting the relative relationship (correlation measurement) of air pollutant exposures, the RAEs of the 10-fold cross validations and random cross validations both have a relatively large mean value (36.7% and 28.7% respectively) and a high standard deviation (12.5% and 12.8% respectively) (Fig. S4). However, we observed that most of the predicted PEIs are quite close to the true monitoring values, while only some outliers in the monitoring records have large errors in predictions. Fig. S5 demonstrates one of the 10-fold cross validation results. Among the 916 instances of pollution concentrations, 875 (95.5%) of the predicted PEIs have RAEs smaller than 10%; the large errors are brought by the rest 41 (4.5%) of the predicted PEIs, which are related to the large extreme values of monitoring readings. These outliers usually imply a lack of training data, which results in a poor prediction. Therefore, if we want to interpret predicted PEIs as real pollution concentrations, we need to bear in mind that potential large errors might exist in predicting extreme values. A more complete training dataset that can cover the extreme monitoring values would improve the entire RAE of the model.

In order to compare the pruned-FNN's efficiency with that of a fully connected neural network, we constructed a fully connected neural network with the same hyperparameters but with different connections. We repeated the training process 100 times on a machine equipped with an Intel Core i7-4770HQ processor, 32GB of memory, and the Linux kernel version 4.5. The pruned-FNN averages a training time of 107.5 seconds and has a standard deviation of 13.3 seconds, while the fully connected neural network averages an average training time of 189.4 seconds and a standard deviation of 12.8 seconds. The null hypothesis that a fully connected neural network takes the same amount of time to train as a pruned neural network is rejected by a one-tailed hypothesis test with a p-value less than 0.01. In other words, pruned-FNNs require less time to train given the same parameters and training set.

The pruned-FNN has many advantages. The algorithm is data-driven and could be fault tolerant. Therefore, with more high-quality data, we can get a better regression result without analytical work of dispersion process equations; even if part of the data contains errors, e.g. some unreported fugitives or biased monitor readings, the algorithm can still generate generate acceptable predictions (Tohma & Iwata, 1999). Pruned-FNNs have better convergence performance than fully connected FNNs. Also, in the pruned-FNN model, we can consider more independent variables and depict a more complicated dispersion model compared with the proximity model as long as related data can be found. Meanwhile, the pruned-FNN is less data extensive compared with air dispersion model by taking advantage of the existing dataset from different sources. All these data-related advantages present us a less time/money consuming way for pollution concentration prediction over a large area than direct monitoring.

However, this study is not without limitations. First, although the data-driven neural network approach has fault tolerant feature, the model performance is not totally irrelevant to the data quality. Due to the limited monitoring network in New Mexico, the monitoring records, which are used as validation dataset, is small in quantity and may contain errors and noises. The study had to assume that all PM_2.5 composites share roughly similar dispersion pattern in order to deal with the limited samples. This compromise is not ideal for exposure assessment, but it does not devalue the feasibility of the pruned-FNN model. Future studies can consider increase the sample sizes by including larger study area and/or longer time period, so each chemical can be modelled individually thought the pruned-FNN.

Second, most existing monitoring sites in this study are located in urban areas, so the validation dataset does not have a good representation of the suburban and rural area. Therefore, it is possible that the predicted exposure might also include misclassifications in those areas. One way to create an evenly distributed validation dataset is by interpolating existing monitoring data. However, the interpolation methods, such as Kriging, could also introduce more uncertainties into the validation dataset and the model predictions (Whitworth et al., 2011). Narrowing down the study area to only urban areas with monitoring sites could be another solution. But it is not feasible in this study because the model performance will be negatively impacted by reducing the already small quantity of monitoring records. Therefore, we call for a more comprehensive monitoring network with a balanced geographical distribution in the future.

Thirds, this study only considered industrial facilities as emission sources and used emissions through stack and fugitive pathways in model training. There are, however, other sources of pollutant emissions, such as nearby transportation and residential emissions, that may affect the monitor reading. In most cases, it is very difficult to distinguish between the contributions of different emission sources to the monitor readings. To deal with the potential missing emissions in the data, this study utilized the fault tolerance property of the neural network. Future studies could include more emission source types in exposure assessment, such as areal and mobile sources. The TRI air emission data are self-reported by the industrial facilities, which might include uncertainties in data quality. It might be helpful for future studies to quantify this uncertainty by comparing monitoring records close to the reporting facilities.

Fourth, the air emission data, air monitoring data, climate data, and terrain data all have different temporal scales. Therefore, we had to aggregate all variables to the same coarsest temporal scale (annual level in this study) for model training and prediction. If we want to assess exposure at finer temporal scales, such as daily, monthly, and seasonal exposure, we need to make sure all data has temporal scales finer than the target scale. However, it is challenging especially because the TRI air emissions are annual data.

Finally, unlike the air dispersion models, the pruned-FNN cannot provide an explicit equation of pollution dispersion from emission source to receptor. Although through literature review and making assumptions, we have included as many variables as possible in the training dataset; however, we may still miss, or even impossible to get a fully perspective on all important features that determine the pollution dispersion process. The pruned-FNN need to be retrained for each study case using data in the area during the time, which has the advantage of achieving a better fit to local conditions and the disadvantage of lacking insight into the universal dispersion process. As a result, additional research would be helpful to combine the data-driven pruned-FNN method with other air dispersion models in the field of air pollution exposure assessment.

The research on the health effects of air pollution requires an air pollution exposure assessment method with an appropriate balance between accuracy, complexity, and usability. We designed a pruned feed-forward neural network (pruned-FNN) approach to simulate the complex and non-linear relationship between air pollutant emission and individual exposure. The pruned-FNN model considers emission time and rate, terrain factors, meteorological conditions, and proximity measurements in predicting PEIs. The monitoring data are utilized as the ground truth to train, calibrate, and cross-validate the prune-FNN model.

The pruned-FNN can predict PEIs that are highly and stably correlated with the monitoring records. The predicted values are also close to the ground truth in most cases (except for some extreme values) when the training datasets are sufficiently large and well-covered. The model is flexible and suitable for modeling non-linear dispersion process through the self-learning, fast convergence, and fault tolerant characteristics of neural networks. The pruned-FNN can infer the dispersion process from limited samples. It has less extensive data requirements than the air dispersion models but can achieve comparable accuracy in exposure assessments. Compared to the proximity model, the pruned-FNN can take into account flexible number of input variables besides proximity depending on the data availability. The pruned-FNN is also less money/time-consuming than the individual monitoring methods, which is suitable for studies involving large areas and/or long term. These features show that the pruned-FNN is a valid method to make accurate exposure estimations using a flexible number of variables and less extensive data in an efficient manner.

The pruned-FNN can also be used in similar studies when obtaining an explicit analytical mathematical relationship is difficult or impossible but obtaining the relevant data is easier. In addition, we will also develop this model into a toolbox for automating the training/predicting process, and eventually made it open source in the future.

All authors have read, understood, and have complied as applicable with the statement on "Ethical responsibilities of Authors" as found in the Instructions for Authors.

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Funding

This work was supported by the University of New Mexico Center for Metals in Biology and Medicine (UNM CMBM) through National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) grant (#P20GM130422); the University of New Mexico Office of the Vice President for Research WeR1 Faculty Success Program and Research Allocations Committee (RAC) awards (#8oh6a4x35h, #gvvrxwyj64); the University of New Mexico, A&S Interdisciplinary Science Cooperative through the Office of Research (Award #TA-1003); the National Institute on Minority Health and Health Disparities (NIMHD) of the NIH under award number P50MD015706; and the National Institute of Environmental Health Sciences (NIEHS) of NIH under award number P42 ES025589. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the funding sources.

Ethics approval and Consent to participate

Not applicable. This study only uses secondary data by public agency without identifiable information for human subjects. For this type of study, ethics approval and consent to participate are not required.

Consent to publish

Not applicable. This study only uses secondary data by public agency without identifiable information for human subjects. For this type of study, formal consent is not required.

Availability of data and material

The original data used in the analyses were obtained from United States Environmental Protection Agency (U.S. EPA), North American Regional Reanalysis (NARR) by National Oceanic and Atmospheric Administration (NOAA), and United States Geological Survey (USGS). The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Author Contributions

Xi Gong: Conceptualization, Methodology, Writing - Original Draft, Supervision, Funding acquisition. Lin Liu: Methodology, Software, Data Curation, Writing - Original Draft. Yanhong Huang: Validation, Data Curation, Writing - Original Draft, Visualization. Bin Zou: Writing - Review & Editing, Validation. Yeran Sun: Writing - Review & Editing, Validation. Li Luo: Writing - Review & Editing. Yan Lin: Conceptualization, Writing - Review & Editing, Funding acquisition.

Adams, M. D., & Kanaroglou, P. S. (2016). Mapping real-time air pollution health risk for environmental management: Combining mobile and stationary air pollution monitoring with neural network models. Journal of Environmental Management, 168, 133–141. https://doi.org/10.1016/j.jenvman.2015.12.012
Banerjee, K., Bali, V., Nawaz, N., Bali, S., Mathur, S., Mishra, R. K., & Rani, S. (2022). A Machine-Learning Approach for Prediction of Water Contamination Using Latitude, Longitude, and Elevation. Water (Switzerland), 14(5). https://doi.org/10.3390/w14050728
Bell, M. L. (2006). The use of ambient air quality modeling to estimate individual and population exposure for human health research: A case study of ozone in the Northern Georgia Region of the United States. Environment International, 32(5), 586–593. https://doi.org/10.1016/j.envint.2006.01.005
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. In EAI/Springer Innovations in Communication and Computing. https://doi.org/10.1007/978-3-030-57077-4_11
Blanes-Vidal, V., Cantuaria, M. L., & Nadimi, E. S. (2017). A novel approach for exposure assessment in air pollution epidemiological studies using neuro-fuzzy inference systems: Comparison of exposure estimates and exposure-health associations. Environmental Research, 154(January), 196–203. https://doi.org/10.1016/j.envres.2016.12.028
Bolt, G. R. (1993). Fault tolerance in artificial neural networks: Are neural networks inherently fault tolerant? ProQuest Dissertations and Theses, November, 230. http://ezproxy.rice.edu/login?url=https://search.proquest.com/docview/304080109?accountid=7064%0Ahttp://sfxhosted.exlibrisgroup.com/rice?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&genre=dissertations+%26+theses&sid=ProQ:ProQuest+Dis
Brender, J. D., Shinde, M. U., Benjamin Zhan, F., Gong, X., & Langlois, P. H. (2014). Maternal residential proximity to chlorinated solvent emissions and birth defects in offspring: A case-control study. Environmental Health: A Global Access Science Source, 13(1). https://doi.org/10.1186/1476-069X-13-96
Cao, W., Wang, X., Ming, Z., & Gao, J. (2018). A review on neural networks with random weights. Neurocomputing, 275, 278–287. https://doi.org/10.1016/j.neucom.2017.08.040
de Ferreyro Monticelli, D., Santos, J. M., Goulart, E. V., Mill, J. G., da Silva Corrêa, J., dos Santos, V. D., & Reis, N. C. (2021). Comparison of methods for assessment of children exposure to air pollution: dispersion model, ambient monitoring, and personal samplers. Air Quality, Atmosphere and Health, 0123456789. https://doi.org/10.1007/s11869-021-01123-6
Ebrahimi Ghadi, M., Qaderi, F., & Babanezhad, E. (2019). Prediction of mortality resulted from NO 2 concentration in Tehran by Air Q + software and artificial neural network. International Journal of Environmental Science and Technology, 16(3), 1351–1368. https://doi.org/10.1007/s13762-018-1818-4
Forastiere, F., & Galassi, C. (2005). Self report and GIS based modelling as indicators of air pollution exposure: is there a gold standard? Occupational and Environmental Medicine, 62(8), 508 LP – 509. https://doi.org/10.1136/oem.2005.020560
Gan, D., Huang, D., Yang, J., Zhang, L., Ou, S., Feng, Y., Peng, Y., Peng, X., Zhang, Z., & Zou, Y. (2020). Assessment of kitchen emissions using a backpropagation neural network model based on urinary hydroxy polycyclic aromatic hydrocarbons. Environmental Pollution, 265, 114915. https://doi.org/10.1016/j.envpol.2020.114915
Ghio, Y.-C. T. H. and A. J. (2006). Vascular Effects of Ambient Pollutant Particles and Metals. In Current Vascular Pharmacology (Vol. 4, Issue 3, pp. 199–203). https://doi.org/http://dx.doi.org/10.2174/157016106777698351
Gong, X., Lin, Y., Bell, M. L., & Zhan, F. B. (2018a). Associations between maternal residential proximity to air emissions from industrial facilities and low birth weight in Texas, USA. Environment International, 120(March), 181–198. https://doi.org/10.1016/j.envint.2018.07.045
Gong, X., Lin, Y., & Zhan, F. B. (2018b). Industrial air pollution and low birth weight: a case-control study in Texas, USA. Environmental Science and Pollution Research, 25(30), 30375–30389. https://doi.org/10.1007/s11356-018-2941-y
Gong, X., & Zhan, F. B. (2022). A method for identifying critical time windows of maternal air pollution exposures associated with low birth weight in offspring using massive geographic data. Environmental Science and Pollution Research, 29(22), 33345–33360. https://doi.org/10.1007/s11356-021-17762-2
Gong, X., Zhan, F. B., Brender, J. D., Langlois, P. H., & Lin, Y. (2016). Validity of the Emission Weighted Proximity Model in estimating air pollution exposure intensities in large geographic areas. Science of the Total Environment, 563–564, 478–485. https://doi.org/10.1016/j.scitotenv.2016.04.088
Gulliver, J., & Briggs, D. (2011). STEMS-Air: A simple GIS-based air pollution dispersion model for city-wide exposure assessment. Science of the Total Environment, 409(12), 2419–2429. https://doi.org/10.1016/j.scitotenv.2011.03.004
Han, B., Hu, L. W., & Bai, Z. (2017). Human exposure assessment for air pollution. In Advances in Experimental Medicine and Biology (Vol. 1017). https://doi.org/10.1007/978-981-10-5657-4_3
Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. Advances in Neural Information Processing Systems, 2015-Janua, 1135–1143.
Hill, L. D., Pillarisetti, A., Delapena, S., Garland, C., Pennise, D., Pelletreau, A., Koetting, P., Motmans, T., Vongnakhone, K., Khammavong, C., Boatman, M. R., Balmes, J., Hubbard, A., & Smith, K. R. (2019). Machine-learned modeling of PM 2.5 exposures in rural Lao PDR. Science of the Total Environment, 676, 811–822. https://doi.org/10.1016/j.scitotenv.2019.04.258
Hvidtfeldt, U. A., Severi, G., Andersen, Z. J., Atkinson, R., Bauwelinck, M., Bellander, T., Boutron-Ruault, M. C., Brandt, J., Brunekreef, B., Cesaroni, G., Chen, J., Concin, H., Forastiere, F., van Gils, C. H., Gulliver, J., Hertel, O., Hoek, G., Hoffmann, B., de Hoogh, K., … Fecht, D. (2021). Long-term low-level ambient air pollution exposure and risk of lung cancer – A pooled analysis of 7 European cohorts. Environment International, 146. https://doi.org/10.1016/j.envint.2020.106249
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Intro to Statistical Learning Theory.
Jerrett, M., Arain, A., Kanaroglou, P., Beckerman, B., Potoglou, D., Sahsuvaroglu, T., Morrison, J., & Giovis, C. (2005). A review and evaluation of intraurban air pollution exposure models. Journal of Exposure Analysis and Environmental Epidemiology, 15(2), 185–204. https://doi.org/10.1038/sj.jea.7500388
Kampa, M., & Castanas, E. (2008). Human health effects of air pollution. Environmental Pollution, 151(2), 362–367. https://doi.org/10.1016/j.envpol.2007.06.012
Karnn, E. (1990). A simple procedure for pruning back-propagation trained neural networks.
Khan, J., Kakosimos, K., Raaschou-Nielsen, O., Brandt, J., Jensen, S. S., Ellermann, T., & Ketzel, M. (2019). Development and performance evaluation of new AirGIS – A GIS based air pollution and human exposure modelling system. Atmospheric Environment, 198(May 2018), 102–121. https://doi.org/10.1016/j.atmosenv.2018.10.036
Kousa, A., Kukkonen, J., Karppinen, A., Aarnio, P., & Koskentalo, T. (2002). A model for evaluating the population exposure to ambient air pollution in an urban area. Atmospheric Environment, 36(13), 2109–2119. https://doi.org/10.1016/S1352-2310(02)00228-5
Koza, J. R., Iii, F. H. B., & Andre, D. (1996). Automated Design Of Both The Topology And Sizing Of Analog Electrical Circuits Using Genetic Programming. Artificial Intelligence in Design ’96, September. https://doi.org/10.1007/978-94-009-0279-4
Li, H. C., Chiueh, P. Te, Liu, S. P., & Huang, Y. Y. (2017). Assessment of different route choice on commuters’ exposure to air pollution in Taipei, Taiwan. Environmental Science and Pollution Research, 24(3), 3163–3171. https://doi.org/10.1007/s11356-016-8000-7
Li, L., Wu, J., Wilhelm, M., & Ritz, B. (2012). Use of generalized additive models and cokriging of spatial residuals to improve land-use regression estimates of nitrogen oxides in Southern California. ATMOSPHERIC ENVIRONMENT, 55, 220–228. https://doi.org/10.1016/j.atmosenv.2012.03.035
Luo, D., Kuang, T., Chen, Y. X., Huang, Y. H., Zhang, H., & Xia, Y. Y. (2021). Air pollution and pregnancy outcomes based on exposure evaluation using a land use regression model: A systematic review. Taiwanese Journal of Obstetrics and Gynecology, 60(2), 193–215. https://doi.org/10.1016/j.tjog.2021.01.004
Mabahwi, N. A. B., Leh, O. L. H., & Omar, D. (2014). Human Health and Wellbeing: Human Health Effect of Air Pollution. Procedia - Social and Behavioral Sciences, 153, 221–229. https://doi.org/10.1016/j.sbspro.2014.10.056
Mannucci, P. M., Harari, S., Martinelli, I., & Franchini, M. (2015). Effects on health of air pollution: a narrative review. Internal and Emergency Medicine, 10(6), 657–662. https://doi.org/10.1007/s11739-015-1276-7
Mesinger, F., DiMego, G., Kalnay, E., Mitchell, K., Shafran, P. C., Ebisuzaki, W., Jović, D., Woollen, J., Rogers, E., Berbery, E. H., Ek, M. B., Fan, Y., Grumbine, R., Higgins, W., Li, H., Lin, Y., Manikin, G., Parrish, D., & Shi, W. (2006). North American regional reanalysis. Bulletin of the American Meteorological Society, 87(3), 343–360. https://doi.org/10.1175/BAMS-87-3-343
Morley, D. W., & Gulliver, J. (2018). A land use regression variable generation, modelling and prediction tool for air pollution exposure assessment. Environmental Modelling and Software, 105, 17–23. https://doi.org/10.1016/j.envsoft.2018.03.030
Murray, F., McGranahan, G., & Kuylenstierna, C.I., J. (2001). Assessing Health Effects of Air Pollution in Developing Countries. Water, Air, and Soil Pollution, 378(8), 43.
NARR. (2022). NCEP North American Regional Reanalysis. NOAA Physical Sciences Laboratory. https://psl.noaa.gov/data/gridded/data.narr.monolevel.html
Nieuwenhuijsen, M., Paustenbach, D., & Duarte-Davidson, R. (2006). New developments in exposure assessment: The impact on the practice of health risk assessment and epidemiological studies. Environment International, 32(8), 996–1009. https://doi.org/10.1016/j.envint.2006.06.015
Park, J., Ryu, H., Kim, E., Choe, Y., Heo, J., Lee, J., Cho, S. H., Sung, K., Cho, M., & Yang, W. (2020). Assessment of PM2.5 population exposure of a community using sensor-based air monitoring instruments and similar time-activity groups. Atmospheric Pollution Research, 11(11), 1971–1981. https://doi.org/10.1016/j.apr.2020.08.010
Pöschl, U. (2005). Atmospheric aerosols: Composition, transformation, climate and health effects. Angewandte Chemie - International Edition, 44(46), 7520–7540. https://doi.org/10.1002/anie.200501122
Prabhakaran, P., Jaganathan, S., Walia, G. K., Wellenius, G. A., Mandal, S., Kumar, K., Kloog, I., Lane, K., Nori-Sarma, A., Rosenqvist, M., Dahlquist, M., Reddy, K. S., Schwartz, J., Prabhakaran, D., & Ljungman, P. L. S. (2020). Building capacity for air pollution epidemiology in India. Environmental Epidemiology, 4(5), e117. https://doi.org/10.1097/ee9.0000000000000117
Ragettli, M. S., Tsai, M. Y., Braun-Fahrländer, C., de Nazelle, A., Schindler, C., Ineichen, A., Ducret-Stich, R. E., Perez, L., Probst-Hensch, N., Künzli, N., & Phuleria, H. C. (2014). Simulation of population-based commuter exposure to NO2 using different air pollution models. International Journal of Environmental Research and Public Health, 11(5), 5049–5068. https://doi.org/10.3390/ijerph110505049
Ratnaike, R. N. (2003). Acute and chronic arsenic toxicity. Postgraduate Medical Journal, 79(933), 391 LP – 396. https://doi.org/10.1136/pmj.79.933.391
Razavi-Termeh, S. V., Sadeghi-Niaraki, A., & Choi, S. M. (2021). Effects of air pollution in Spatio-temporal modeling of asthma-prone areas using a machine learning model. Environmental Research, 200(May), 111344. https://doi.org/10.1016/j.envres.2021.111344
Ren, X., Mi, Z., & Georgopoulos, P. G. (2020). Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States. Environment International, 142(January), 105827. https://doi.org/10.1016/j.envint.2020.105827
Sarris, J., De Manincor, M., Hargraves, F., & Tsonis, J. (2019). Harnessing the four elements for mental health. Frontiers in Psychiatry, 10(APR), 1–9. https://doi.org/10.3389/fpsyt.2019.00256
Steinle, S., Reis, S., & Sabel, C. E. (2013). Quantifying human exposure to air pollution-Moving from static monitoring to spatio-temporally resolved personal exposure assessment. Science of the Total Environment, 443, 184–193. https://doi.org/10.1016/j.scitotenv.2012.10.098
Stingone, J. A., Pandey, O. P., Claudio, L., & Pandey, G. (2017). Using machine learning to identify air pollution exposure profiles associated with early cognitive skills among U.S. children. Environmental Pollution, 230, 730–740. https://doi.org/10.1016/j.envpol.2017.07.023
Tager, I. B., Balmes, J., Lurmann, F., Ngo, L., Alcorn, S., & Künzli, N. (2005).Chronic Exposure to Ambient Ozone and Lung Function in Young Adults. Epidemiology, 16(6), 751–759.
Tohma, Y., & Iwata, T. (1999). Fault-tolerant neural networks with higher functionality. Systems and Computers in Japan, 30(10), 22–33. https://doi.org/10.1002/(SICI)1520-684X(199909)30:10<22::AID-SCJ3>3.0.CO;2-D
U.S. EPA. (2020a). Technical Air Pollution Resources. https://www.epa.gov/technical-air-pollution-resources
U.S. EPA. (2020b). U.S. EPA AQS Data Mart. http://www.epa.gov/ttn/airs/aqsdatamart
U.S. EPA. (2021). Air Quality Dispersion Modeling - Preferred and Recommended Models. https://www.epa.gov/scram/air-quality-dispersion-modeling-preferred-and-recommended-models
U.S. EPA. (2022a). National Air Toxics Assessment. https://www.epa.gov/national-air-toxics-assessment
U.S. EPA. (2022b). Risk-Screening Environmental Indicators (RSEI) Model. https://www.epa.gov/rsei
U.S. EPA. (2022c). What is the Toxics Release Inventory? https://www.epa.gov/toxics-release-inventory-tri-program/what-toxics-release-inventory
USGS. (2019). USGS. https://www.usgs.gov/the-national-map-data-delivery
Whitworth, K. W., Symanski, E., Lai, D., & Coker, A. L. (2011). Kriged and modeled ambient air levels of benzene in an urban environment: An exposure assessment study. Environmental Health: A Global Access Science Source, 10(1), 1–11. https://doi.org/10.1186/1476-069X-10-21
Xi, X., Wei, Z., Xiaoguang, R., Yijie, W., Xinxin, B., Wenjun, Y., & Jin, D. (2015). A comprehensive evaluation of air pollution prediction improvement by a machine learning method. 10th IEEE Int. Conf. on Service Operations and Logistics, and Informatics, SOLI 2015 - In Conjunction with ICT4ALL 2015, 176–181. https://doi.org/10.1109/SOLI.2015.7367615
Xie, X., Semanjski, I., Gautama, S., Tsiligianni, E., Deligiannis, N., Rajan, R. T., Pasveer, F., & Philips, W. (2017). A review of urban air pollution monitoring and exposure assessment methods. ISPRS International Journal of Geo-Information, 6(12), 1–21. https://doi.org/10.3390/ijgi6120389
Zhang, L., Tian, X., Zhao, Y., Liu, L., Li, Z., Tao, L., Wang, X., Guo, X., & Luo, Y. (2021). Application of nonlinear land use regression models for ambient air pollutants and air quality index. Atmospheric Pollution Research, 12(10), 101186. https://doi.org/10.1016/j.apr.2021.101186
Zou, B., Wilson, J. G., Zhan, F. B., & Zeng, Y. (2009a). Air pollution exposure assessment methods utilized in epidemiological studies. Journal of Environmental Monitoring, 11(3), 475–490.
Zou, B., Wilson, J. G., Zhan, F. B., & Zeng, Y. (2009b). An emission-weighted proximity model for air pollution exposure assessment. Science of the Total Environment, 407(17), 4939–4945. https://doi.org/10.1016/j.scitotenv.2009.05.014
Zou, B., Zheng, Z., Wan, N., Qiu, Y., & Wilson, J. G. (2016). An optimized spatial proximity model for fine particulate matter air pollution exposure assessment in areas of sparse monitoring. International Journal of Geographical Information Science, 30(4), 727–747. https://doi.org/10.1080/13658816.2015.1095921

No competing interests reported.

SupplementaryMaterialTablesandFigures11282022.docx

Download PDF

Journal Publication

published 11 Sep, 2023

Read the published version in Environmental Monitoring and Assessment →

Editorial decision: Major revision
29 Aug, 2023
Reviews received at journal
27 Aug, 2023
Reviewers agreed at journal
08 Aug, 2023
Reviewers agreed at journal
11 Jan, 2023
Reviewers invited by journal
12 Dec, 2022
Editor assigned by journal
01 Dec, 2022
Submission checks completed at journal
01 Dec, 2022
First submitted to journal
28 Nov, 2022

You are reading this latest preprint version

A pruned feed-forward neural network (pruned-FNN) approach to measure air pollution exposure

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Related Work

Material And Methods

Results

Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1