The present study incorporated a custom-designed pruned feed-forward neural network (pruned-FNN) to simulate the complex and non-linear relationships from air pollution emission to individual exposure. When predicting exposure indexes using the model, emission time and emission rate of air pollutants, terrain factors, meteorological conditions, and proximity measurements are used as input variables (Hill et al., 2019; Prabhakaran et al., 2020; Ren et al., 2020). Monitoring data is considered as the ground truth to train, calibrate, and cross-validate the prune-FNN.
Data
We used four datasets in the state of New Mexico and surrounding areas during 2007–2017 in the study, including: (1) the air emission data of air pollutants compiled by the U.S. EPA Toxic Release Inventory (TRI) program (U.S. EPA, 2022c); (2) the air monitoring data from the U.S. EPA Air Quality System (AQS) DataMart (U.S. EPA, 2020b); (3) the climate data, including wind data, temperature data, and humidity data from the North American Regional Reanalysis (NARR) by NOAA (NARR, 2022), and (4) the terrain data from the United States Geological Survey (USGS, 2019). All variables within the datasets are potentially related to the complex process of air pollution dispersion. Due to the fact that wind is taken into consideration in the dispersion process, air emission data, climate data, and terrain data include not only the main study area (the state of New Mexico), but also its surrounding areas (latitude range: 30.5 ~ 38.5; longitude range: -110.7~-101.5).
The U.S. EPA TRI program is a mandatory program created under Section 313 of the Emergency Planning and Community Right-to-Know Act (EPCRA) to facilitate emergency planning and inform the public about releases of toxic chemicals in their communities (U.S. EPA, 2022c). Under the TRI program, U.S. industrial facilities must annually report information about their locations, types of chemicals released, and estimated quantities of chemicals released into the environment. We selected the TRI emission data for New Mexico and the surrounding areas for 2007–2017 from the national TRI database. A total of 139 types of pollutants are released by 369 factories during this time period (Fig. 1). For every pollutant from a given factory, the sum of annual emissions from both the stack and fugitive pathways is used as the total emission amount in the exposure assessment model.
Air monitoring data were obtained from the U.S. EPA AQS DataMart, which is a database that contains every measured value as well as associated aggregate values (8-hour, daily, annual, etc.) collected through the national ambient air monitoring program (U.S. EPA, 2020b). In New Mexico, 63 active monitoring sites were available to record the concentrations of 255 unique chemicals between 2007 and 2017 (Fig. 1). The study used the annual average concentrations of each pollutant from the monitors to match the temporal scale of the air emission data.
North American Regional Reanalysis (NARR) is an extension of NOAA's National Center for Atmospheric Prediction (NCEP) Global Reanalysis, which provides a long-term, consistent, high-resolution climate dataset for North America (Mesinger et al., 2006). This study extracted the daily mean values of temperature, humidity, zonal wind velocity (horizontal wind towards east), and meridional wind velocity (horizontal wind towards north) for New Mexico and surrounding areas during 2007–2017. The spatial resolution of the climate data grid is approximately 32 kilometers. For each year in each grid, we calculated the mean temperature, the mean humidity, the average wind speed, and the prevailing wind direction so that these variables would match the annual scale of other variables. Combining the wind speed and wind direction, we calculated a wind index value \({W}_{i,j}\)between an emission source j and any given location i in New Mexico during 2007–2017 using the Eq. (1):
$${W}_{i,j}=\left\{\begin{array}{c}v\text{*cos}(\left(\theta -180\right)-\alpha ), \text{cos}(\left(\theta -180\right)-\alpha )\ge 0\\ 0, \text{cos}\left(\left(\theta -180\right)-\alpha \right)<0\end{array}\right.$$
1
Where \(\theta\) is the wind angle at the location i; \(\alpha\) is the direction from the emission source j to the location i; and v is the wind speed at location i. The wind index generally assigns larger values to places downwind of the emission sources with higher wind speed.
The 30-meter DEM data for New Mexico and surrounding areas is obtained from the National Map Data Download and Visualization Services of the United States Geological Survey (USGS, 2019). The data are used to calculate elevation differences from a specific location to different emission sources, which are one of the model input variables.
Methods
Data pre-processing
Because we estimate the exposure using the air emission data as inputs and use the air monitoring data as the ground truth for training and cross-validating the prune-FNN model, we need to identify the chemicals shared between emission and monitoring records. Twelve chemicals of PM2.5 composites were selected for further analysis, including arsenic, aluminum, cadmium, chlorine, chromium, copper, lead, manganese, nickel, selenium, vanadium, and zinc, which contribute a total of 1113 annual average monitoring records of pollutant concentrations in New Mexico during 2007 to 2017. We excluded two negative monitoring records. We also excluded 221 positive records with no emission found in our data, because these records violate the assumption of the pruned-FNN model that the monitor reading is additively contributed by each factory nearby. Therefore, a total of 916 monitoring records were used as ground truth for model training. Due to the limited quantity of samples for each pollutant (no more than 89 samples for a single chemical), we made an assumption that all PM2.5 composites share roughly similar dispersion pattern. This assumption is a compromise over the limited samples for model training.
The monitor readings have a heavily skewed distribution, with the mean value of all pollution concentrations as 0.00319 microgram/m3 and the standard deviation as 0.0127 microgram/m3. If use the readings directly as training data, the pruned-FNN will be primarily able to fit the large pollution concentration values. To overcome this issue, we add an offset value of 0.0003 microgram/m3 (median of all pollution concentrations) to each monitoring records and perform a natural log transformation on the training data before feeding it into the pruned-FNN. After model training and prediction, we can obtain an exposure index value for pollution concentration by performing an exponential reverse transformation on pruned-FNN’s output and then subtracting the offset from the result. Because of the aforementioned assumptions, the small sample sizes, and skewed distributions in the training data, the model might not predict the real concentrations of air pollutants directly. Instead, the estimated results can be interpreted as pollution exposure indexes (PEIs) which is proportional to the real concentrations of air pollutants.
Neural network structure
The PEI of a given chemical k at a given location i is calculated by the Eq. (2):
$${PEI}_{i, k}=\sum _{j=1}^{n}{g}_{j}({T}_{i},{H}_{i},{EM}_{j,k},{D}_{i,j},{ED}_{i,j},{W}_{i,j })$$
2
In this expression, \({PEI}_{i, k}\) is the predicted exposure index of chemical k at the location i; j represents one of the n total emission sources (factories). \({PEI}_{i, k}\) is the sum of all gj (j = 1,2, …, n), where each gj represents the contribution of PEI at location i from one single emission source j. The gj is generated by the complex air pollutant dispersion process, which can be depicted by six independent variables as follows. \({T}_{i}\) is the temperature at the location i; \({H}_{i}\) is the humidity at the location i; \({EM}_{j,k}\) represents the sum of fugitive and stack emissions of chemical k from emission source j; \({D}_{i,j}\) is the distance from location i to the emission source j; \({ED}_{i,j}\) represents the elevation difference between emission source j and location i; and \({W}_{i,j}\) is the calculated wind index between emission source j and location i.
The pruned-FNN is designed to fit the function gj through machine learning in the training dataset. In the training dataset, the monitoring sites are considered as exposure receptors of pollutant emissions. The target variable of the training dataset is the monitoring records of pollutant concentrations; the prediction variables of the training dataset include the distances, elevation differences, the annual wind index from each monitor to each of the 369 factories (\({D}_{i,j}\), \({ED}_{i,j}\), and \({W}_{i,j}\)), annual average temperature and humidity at each monitoring site (\({T}_{i}\) and \({H}_{i}\)), and annual emission records from all 369 factories (\({EM}_{j,k}\)) in New Mexico and surrounding areas during 2007–2017.
As shown in Fig. 2, the pruned-FNN has 1478 inputs in the input layer, 1107 neurons in the first hidden layer, 369 neurons in the second hidden layer, and 1 neuron in the output layer according to our training dataset, which represent the entire contribution of PEI from all emission sources. The entire 1478 inputs are not fully connected in the model. There are 369 substructures in the pruned-FNN, each of which represents a single emission source’s contribution (see Fig. S1). Each substructure contains six input variables mentioned in Eq. (2), two hidden layers, and one output. Because the temperature (T) and humidity (H) at the receptor location are independent from the emission sources, they are shared by all substructures of the emission sources. In the pruned-FNN, the first and second neurons in the input layer represent the temperature (T) and humidity (H); these two neurons are fully connected with all 1107 neurons in the first hidden layer. For the other four variables (distances (D), elevation differences (ED), the annual wind index (W), and annual emission records (EM)), they are only connected in their corresponding substructure. Therefore, in the entire pruned-FNN (Fig. 2), the neuron number 3 to number 6 in the input layer represent these four input features of the first emission source; and neuron number (4i-1) to number (4i + 2) in the input layer represent the input features of the i-th (i = 1,2,…, 369) emission source. The input neurons of a single emission source i will be fully connected with neuron a[1]3i−2, a[1]3i−1, and a[1]3i in the first hidden layer, and further fully connected with the neuron a[2]i in the second hidden layer. All neurons in the second hidden layer will be fully connect with the output layer, which contains only one neuron. Compared with a fully connected neural network having more than 100,000 weights to tune, this study uses only 4,428 weights as tuning parameters. This reduces the size of the search space for solutions significantly and improves the performance of the model training process. It is worth noting that the number of neurons in the current model is related to the 369 factories in the training dataset. The quantity of substructures and corresponding neurons can be changed flexibly based on the number of emission sources. The hyper-parameters of our neural network, such as the learning rate, activation functions, and layer number, are selected using a heuristic approach. The goal is to incrementally adjust these hyper-parameters until we reach an acceptable balance between prediction accuracy and training time.
Method validation
We used two methods to validate the effectiveness of the pruned-FNN: 10-fold cross validation and random cross validation. In the 10-fold cross validation, the data were first shuffled randomly, then divided into 10 groups of equal size, each of which is called a fold. Nine folds are used as training datasets to predict the values for the remaining fold; then the predicted values are compared with the true values in the remaining fold. The process continues 10 times to validate each unique fold. In this study, we did 10-fold cross validation for 10 times which has 100 validations in total. We also run random cross validation for 100 times. In one random cross validation, we randomly chose 20% data entries as validation set and the other 80% as the training set.
In each time of validation, correlations and errors between the predicted values and monitored values were recorded. Since there was no guarantee that the normal distribution assumption could be met in the air pollutant concentrations, the non-parametric Spearman-rank correlation coefficient is used for correlation measurement. We also recorded the relative absolute error (RAE) for error measurement, which is defined as Eq. (3):
$$RAE= \frac{{\sum }_{i=1}^{k}|{R}_{i}-{P}_{i}|}{{\sum }_{i=1}^{k}\left|{R}_{i}\right|}$$
3
where Ri is the i-th monitor’s reading value; Pi is the predicted value for at the location of the i-th monitor; and k is the total number of readings in the validation set.