2.1 Data:
The datasets (Kaggle Reference 5) used are as following :
1.Air Quality Data in India(2015-2020) : The Kaggle dataset contains air quality data and AQI (Air Quality Index) at hourly and daily level of various stations across multiple cities in India. Columns are ‘city’, ’datetime’, ’PM2.5’, ’PM10’, ’NO’, ’NO2’ ,’NOx’ ,’NH3’ ,’CO’ ,’SO2 ’,’O3’ ,’Benzene’ ,’Toluene’ ,’Xylene’, ’AQI’,’AQI_Bucket’. We explicitly reduce the data to AQI data of Delhi only to reduce the space and time complexity of training the models.
2.COVID-19 in India: The Kaggle dataset had state-wise and district-wise details of the total number of coronavirus cases, tests carried out, positivity rate based on current population and other metrics. We again reduce the data to the total number of cases reported in Delhi daily from June 2020 till June 2021.
3.Delhi Weather data: Obtained from Wunderground using their easy-to-use API, this dataset comprises temperature(average and min-max), humidity, precipitation, and other condition details of Delhi weather from 1990 till 2016. Further weather conditions and daily mean temperatures till 2020 have been obtained by scraping Accuweather forecasts for Delhi.
2.2 Implementing Machine Learning:
We first use the AQI dataset to obtain a correlation between PM2.5 particles and AQI which is found to be 0.8 on an average based on the data from 2015 to 2020, signifying a strong correlation between the two. We train Ensemble regressors like the Random Forest Regressor and the Gradient Boosting Regressor model on this data and obtain the AQI predictions for Indirapuram and Vasundhara locality based on their PM2.5 outdoor air concentrations. Ensemble modeling is a process where multiple diverse base models are used to predict an outcome. The motivation for using ensemble models is to reduce the generalization error of the prediction. The approach seeks the wisdom of crowds in making a prediction. It acts and performs as a single model. Most of the practical data science applications utilize ensemble modeling techniques. In reference to Leo Breiman’s work (Breiman 2001),Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s to a limit as the number of trees in the forest becomes large. Thus, using Random forests for AQI prediction ensures that it is closest to the actual AQI conditions of Indirapuram and Vasundhara. The Gradient Boosting Machine(GBM) algorithm is used for supervised machine learning, and it produces an ensemble of weak learners(Garcia de Oliveira,2019). The most used implementations of the GBM techniques are Light GBM by Ke et al,2017 and the XGBoost library by Chen and Guestrin,2016.However, despite being a collection of weak learners, it outperforms most ensemble models, with the help of hyperparameter tuning. Hence, out of the two chosen models for the prediction purposes, the Gradient Boosting Regressor, which is based on the GBM algorithm, performs better than the Random Forest Regressor.
The models predict the AQI values with accuracies(based on R2 metric score) of 77.4% and 80% respectively. The AQI predictions were the highest (mean value of 375 with a standard deviation of 25) in months of November and December 2020, indicating high PM2.5 particle concentration in these localities during that time of the year. Further, according to the Hindustan Times newspaper of India, Indirapuram and Vasundhara had the highest number of COVID-19 caseloads during the months of November and December 2020. Although this doesn’t indicate causality, correlation between particulate matter and Covid-19 infections is evident, keeping external validations in consideration. Further, to support our findings, there has been a research carried out by Nor, N.S.M., Yip, C.W, Ibrahim, N. et al(2020), where it was proven that particulate matter of diameter less than or equal to 2.5 µm could be a potential SARS-COV-2 carrier. No correlation was found between the virus concentration and the diameter of particulate matters (Marcazzan 2001) .However, positive correlations between PM2.5 and other respiratory viruses such as the influenza virus have been reported previously, emphasizing the probability of particulate matter being a transport carrier for SARS-CoV-2(Xing 2016).
Table-1. Machine Learning Models Used
Models
|
Hyper Parameter Tuning
|
Cross-Validation
|
Test R2 score
|
Random Forest Regressor
|
'max_depth': range (3,9)
'n_estimators': range(100,200,20)
'max_features': [3,4,5,6]
'bootstrap': [True]
'criterion': ['mse']
|
Random Search CV with 80 iterations, random state of 1 and verbose of 2.
|
0.77423
|
Gradient Boosting Regressor
|
'learning_rate': [0.1, 0.01]
'max_depth': [3, 8]
'min_samples_leaf': [3, 5]
'max_features': [0.2, 0.6]
'loss': ['huber']
|
Grid Search CV with 3 folds instead of 5(default)
|
0.80018
|
The AQI dataset is further used to find a correlation between PM2.5 particle concentration and temperature and weather conditions. The temperature and weather conditions i.e., humidity for Delhi is obtained from the weather dataset as specified earlier. We merge the AQI, and weather datasets based on the common dates and obtain the correlation accordingly. For humidity, the correlation coefficient with PM2.5 turns out to be 0.076 and for temperature, it is -0.41. Thus, temperature and humidity in Delhi, have a significant negative and insignificant positive correlation respectively, with PM2.5 particles and transitively, with rate of Covid-19 infections, with respect to Delhi weather dataset. According to Yang Lv et al.(Yang Lv 2017), the prevalence of fog and haze seriously affect indoor air quality, given it affects the outdoor air quality, and indoor air quality is correlated and highly influenced by outdoor air quality(Braniš 2005, Kim 2010). In Daqing, China, research showed that there was a significant positive correlation among indoor particles concentration and outdoor particles concentration, temperature, and humidity (p<0.05), but different building types had obvious differences. Temperature and humidity are important factors affecting the concentration of indoor particulate matter and the influence of indoor and outdoor temperature is greater for offices and classrooms with the glass exterior wall, whereas the relative humidity is the main factor for the rest of the building with concrete wall structure. However, when analyzed in Indian setup, i.e., Delhi, weather conditions and temperature had contradicting impacts on the particulate matter concentration, thus, implying that the correlations differ from not only building to building, but also, background to background.
This indicates the presence of other external factors such as casual behaviour of citizens, low testing rate and slow vaccination drive, inadequate measures and lack of strict lockdown and restrictions. Hence, although the correlation is strong and positive between the COVID-19 infections and PM2.5 particles or the AQI predicted, the Pearson correlation coefficient value is estimated to be 0.68, due to the presence of other cofactors. This is known as External validity in research.[31].
According to the Health Effects Institute’s Report of 2019, particulate matter (PM) pollution was considered the third most important cause of death in 2017 with the rate being highest in India. Air pollution was considered to cause over 1.1 million premature deaths in 2017 in India (HEI 2019), of which 56% was due to exposure to outdoor PM2.5 concentration and 44% was attributed to indoor air pollution. As per WHO (2016), one death out of nine in 2012 was attributed to air pollution, of which around three million deaths were solely due to outdoor air pollution. According to an article(Emily Henderson 2020), 1.67 million deaths occurred in India due to air pollution in 2019. This means that the mortality rate of India associated with PM2.5 particle exposure in 2019 was 12.846 deaths per 1000 people.Given the pandemic and the increasing pollution in India despite several efforts by the government, it is feasible to assume that there has been an increase in the mortality rate due to the PM2.5 particles exposure in the last two years.
Beixi Jia et al (2021), found out that the estimated PM2.5-mortality in India has had an annual increasing rate of 2.7% during 1998-2015. Further, the article states that aggressive air pollution control strategies should be taken in North India due to their current health risks.Based on this assumption, we use the formula obtained in NCBI’s Mortality due to Indoor PM2.5 exposure Report(Ji W 2015),
![](https://myfiles.space/user_files/58894_9946feeafa4c1df7/58894_custom_files/img1629867628.JPG)
Where Δlog Mall,j is the increase in mortality due to the jth outcome associated with total PM exposure for each 10 μg/m3 increase in PM10 or PM2.5 , outdoors. j represents three major health outcomes: all-cause, cardiovascular, and respiratory mortality.
ΔCout is the increase in outdoor PM10 or PM2.5 concentrations, which is set as 10 μg/m3.
ΔCout-in is the increase in outdoor-originated PM10 or PM2.5 concentrations found in the indoor environment.
tout is the duration of direct exposure to outdoor PM pollution.
tin is the duration of indoor exposure to PM of outdoor origin.
Δlog Min,j estimates the increase in mortality due to the jth outcome associated with indoor exposure to outdoor-origin PM for each 10 μg/m3 increase in PM10 or PM2.5.Here, we use the PM2.5 concentration change explicitly.
Using this formula we obtain a ratio of 3:7 between the Δlog Min,j and Δlog Mall,j which means that for an increase in the mortality by 7 units due to the jth outcome associated with the total PM exposure for each 10 μg/m3 increase in PM2.5 outdoors , there is an increase of 3 units in the value of mortality due to the jth outcome associated with indoor exposure to outdoor-origin PM for each 10 μg/m3 increase in PM10 or PM2.5. The calculations are carried out considering a time span of 24 hours and ΔCout-in of 7.5 because according to Leung Dennis Y.C (2015), approximately 75% variation in indoor air pollutant concentration is due to outdoor air pollutant concentration variation. Previously, Douglas W. Dockery et al, based on a survey model, had estimated that the mean infiltration rate of outdoor fine particulates was approximately 70% and the effect of full air conditioning of the building was to reduce infiltration of outdoor fine particulates by about one half, while preventing dilution and purging of internally generated pollutants. However, when analyzed for the Delhi suburban setup, we see that the infiltration rate, although within the 95-percentile spread of normal distribution of 70% mean, tends to be on the higher side due to the high rates of pollution in India. Further, according to Chun Chen et al.(2011),the indoor/outdoor ratios vary considerably due to the difference in size-dependent indoor particle emission rates, the geometry of the cracks in building envelopes, and the air exchange rates. Thus, it is difficult to draw uniform conclusions. However, for our case study, we realize that the indoor environment is highly influenced by the outdoor ambience and there is a 30% increase in mortality due to increase in the indoor PM2.5 concentration if there is a 70% increase in the mortality due to outdoor PM2.5 concentration and the outdoor PM2.5 concentration influences the indoor concentrations of the same by 75%.
Further, based on one of the research works in PNAS(Z. Bazant 2021), we can quantify the concentration of pathogen C(r,t) suspended in droplets of radius r at 25℃, exhaled by an infected person in a room and having another healthy person in the vicinity, is:
Rate of change=Production rate from exhalation − Lr – (2)
Where Lr is Loss rate of pathogens from ventilation, filtration, sedimentation, and deactivation.
For SARS-CoV-2, Buonanno et al. (2020) estimated a Cq range of 10.5 to 1,030 quanta/m3 based on the estimated infectivity ci=0.01 to 0.1 of SARS-COV-2 and the reported viral loads in sputum although the precise value depends strongly on the infected person’s respiratory activity. Here Cq is the concentration of exhaled infection quanta by an infectious individual. Hence, it becomes very important for implementation of air purification and ventilation along with proper maintenance of the 6ft rule even in the households. When the PM2.5 concentration increases indoors , the probability of getting infected by these pathogenic suspended droplets increases given the virus can use the particulate matter as a carrier and thus, this explains the increase in mortality probability indoors given there is an in particulate matter concentration outdoors.