3.2 Identification of Possible Sources of Air Pollution
In this study, five datasets were created and named PC2018, PC2019, PC2020, PC2021, and PC2022. To ensure the suitability of the datasets for PCA, preliminary assessments were conducted using Bartlett’s test and the KMO test. These assessments aimed to evaluate the correlation between variables and assess the adequacy of the sampling dataset. Table 5 shows that the p-value < 0.0001, indicating statistically significant correlations between variables in the dataset. Hence, it can be inferred that the air quality data meet the assumption of sphericity, indicating a strong relationship among variables.
Table 5
Result of Bartlett’s Sphericity test in each analysis model dataset.
Statistic
|
PC2018
|
PC2019
|
PC2020
|
PC2021
|
PC2022
|
Chi-square (Observed value)
|
63372.774
|
83159.321
|
48866.259
|
48722.757
|
44591.538
|
Chi-square (Critical value)
|
24.996
|
24.996
|
24.996
|
24.996
|
24.996
|
DF
|
15
|
15
|
15
|
15
|
15
|
p-value (Two-tailed)
|
< 0.0001
|
< 0.0001
|
< 0.0001
|
< 0.0001
|
< 0.0001
|
alpha
|
0.050
|
0.050
|
0.050
|
0.050
|
0.050
|
The result of the KMO test in Table 6 shows a value of 0.650, which indicates the degree of correlation between variables and the suitability of PCA. According to Rencher (2003) and Shrestha (2021), a value above 0.6 indicates a mediocre and is acceptable for the next analysis.
Table 6
Result of Kaiser-Meyer-Olkin test in each analysis model dataset.
Statistic
|
PC2018
|
PC2019
|
PC2020
|
PC2021
|
PC2022
|
PM10 (µg/m3)
|
0.611
|
0.604
|
0.617
|
0.600
|
0.615
|
PM2.5 (µg/m3)
|
0.609
|
0.595
|
0.606
|
0.596
|
0.602
|
SO2 (ppm)
|
0.789
|
0.711
|
0.643
|
0.644
|
0.846
|
NO2 (ppm)
|
0.681
|
0.616
|
0.702
|
0.724
|
0.842
|
O3 (ppm)
|
0.582
|
0.718
|
0.479
|
0.480
|
0.449
|
CO (ppm)
|
0.717
|
0.692
|
0.733
|
0.789
|
0.804
|
KMO
|
0.650
|
0.626
|
0.644
|
0.638
|
0.659
|
PCA was performed on each dataset that comprised of six variables; O3, CO, NO2, SO2, PM2.5, and PM10. The analysis yielded two PCs for all datasets with eigenvalues exceeding 1.0 (Kim and Mueller, 1987). The scree plot depicted in Fig. 2(i)-(v) shows the association between eigenvalues and the number of factors in descending order, explaining the most significant variance in the data.
Varimax rotation was performed with two PCs, and the results are displayed in Table 7. A threshold value of VF greater than 0.75 was established for selection. Figure 3(i)-(v) illustrates the percentage of variance after varimax rotation.
Table 7
Factor loading after varimax rotation.
Variables
|
PM10 (µg/m3)
|
PM2.5 (µg/m3)
|
SO2 (ppm)
|
NO2 (ppm)
|
O3 (ppm)
|
CO (ppm)
|
Variability (%)
|
Cumulative (%)
|
PC2018
|
VF1
|
0.848
|
0.866
|
0.213
|
0.212
|
0.71
|
0.219
|
35.199
|
35.199
|
VF2
|
0.35
|
0.361
|
0.522
|
0.881
|
-0.333
|
0.824
|
34.858
|
70.058
|
PC2019
|
VF1
|
0.893
|
0.899
|
0.357
|
0.191
|
0.642
|
0.437
|
39.578
|
39.578
|
VF2
|
0.296
|
0.299
|
-0.039
|
0.853
|
-0.523
|
0.798
|
30.245
|
69.823
|
PC2020
|
VF1
|
0.919
|
0.934
|
0.097
|
0.665
|
0.256
|
0.636
|
43.986
|
43.986
|
VF2
|
-0.093
|
-0.101
|
0.019
|
0.571
|
-0.833
|
0.515
|
21.749
|
65.735
|
PC2021
|
VF1
|
0.877
|
0.911
|
0.188
|
0.732
|
0.128
|
0.689
|
44.359
|
44.359
|
VF2
|
0.223
|
0.201
|
0.287
|
-0.41
|
0.875
|
-0.315
|
20.088
|
64.447
|
PC2022
|
VF1
|
0.889
|
0.914
|
0.237
|
0.743
|
0.123
|
0.636
|
44.256
|
44.256
|
VF2
|
0.184
|
0.196
|
0.299
|
-0.293
|
0.881
|
-0.37
|
19.33
|
63.586
|
Dataset PC2018 showed the cumulative percentage of variance for VF1 and VF2 as 35.199% and 34.858% respectively, totaling 70.058%. The two highest positive factor loadings in VF1 were from PM10 and PM2.5, with values of 0.848 and 0.866 respectively, while VF2 demonstrated the highest positive factor loadings from NO2 (0.881) and CO (0.824). In PC2019, the cumulative percentage of variance scored 69.823%, with VF1 contributing 39.578% and VF2 contributing 30.245%. Major pollutants in VF1 were PM10 and PM2.5, with positive factor loading values of 0.893 and 0.899 respectively, while VF2 scored the highest positive factor loadings from NO2 (0.853) and CO (0.798). PC2020 showed a cumulative percentage of variance of 65.735%, with VF1 contributing 43.986% and VF2 contributing 21.749%. The two highest positive loadings in this dataset were from PM10 (0.919) and PM2.5 (0.934), while VF2 demonstrated a negative factor loading from O3 (-0.833). VF1 contributed 44.359% and VF2 contributed 20.088%, totaling 64.44 7% of cumulative percentage of variance, in dataset PC2021. PM10 and PM2.5 were identified as major pollutants in VF1 and O3 in VF2, with factor loading values of 0.877, 0.911, and 0.875 respectively. PC2022 showed results after varimax rotations, with a total cumulative percentage of variance of 63.586%, obtained from VF1 (44.256%) and VF2 (19.330%). PM10 and PM2.5 were reported as major pollutants in VF with positive factor loadings of 0.889 and 0.914 respectively, while O3 in VF2 had a positive factor loading of 0.881.
Based on the results, it can be observed that there is a similarity of major pollutants in a five-year trend. PM10 and PM2.5 showed strong positive loadings for each model with a range of values from 0.848 to 0.919 and 0.866 to 0.934, respectively. According to Rahman et al. (2015), PM10 and PM2.5 stem from motor vehicles, factories, power generators, construction sites, quarries, and incinerators, collectively contributing to atmospheric pollutant level. Furthermore, Malaysia encountered a haze event characterized by the highest recorded intensity of PM2.5, attributable to both open burning and haze from bordering countries such as Sumatera and Kalimantan, Indonesia (Latif et al., 2018; Liyana Zakri et al., 2018; Ab. Rahman et al., 2022). NO2 showed strong positive loadings in the years 2018 and 2019 with a range of values from 0.853 to 0.881. The presence of NO2 in ambient air is mainly caused by industrial operations and heavy traffic (Isiyaka & Azid, 2015; Ismail et al., 2017), while Dominick et al. (2012) concluded that NO2 is a product of traffic congestion and manufacturing activities. From 2020 to 2022, NO2 showed weak factor loading, which can be attributed to the decrease in industrial and commercial activities during the COVID-19 pandemic (Mazlan et al., 2022). O3 showed strong positive factor loadings in the years 2021 and 2022 but was negative in 2020, with a range of values from 0.833 to 0.881. According to Mazlan et al. (2022), O3 levels rose during the post-MCO period as industries resumed operations, road traffic increased, and people resumed their activities. However, in 2020, O3 showed a negative value, indicating an inverse relationship. In this context, the decrease in VOCs and NOx during the pandemic is anticipated to have resulted in a reduction of ozone (Tavella & da Silva Júnior, 2021). CO showed strong positive loadings in 2018 and 2019 with a range of values from 0.782 to 0.824. High concentrations of CO are primarily associated with incomplete fuel combustion in automobiles, making it an important indicator of atmospheric contamination in this region (Dominick et al., 2012; Angatha and Mehar, 2020).
3.2.3 Forecasting Air Quality Using ANN
The models for forecasting API were developed by combining the result obtained from PCA and ANN. The analysis was conducted by computing MLP-FF in JMP10 software, and the models were then renamed as ANN-PC2018, ANN-PC2019, ANN-PC2020, ANN-PC2021 and ANN-PC2022.
The performance results of R2 and RMSE based on the 10 network structures were compared in forecasting air quality. The R2 score ranges from zero to one, with higher values indicating greater explanatory power. Therefore, the best performance model is the one with the highest R2 value and the lowest RMSE (Chenard & Caissie, 2008; Nasir, 2011; Azid et al., 2013). As per Rumsey (2011), a value of R2 greater than 0.90 is considered significant with perfect linear regression, a range between 0.70–0.89 is significant with strong linear regression, a range of 0.50–0.69 is considered significant with moderate linear regression, a range between 0.30–0.49 and 0.00–0.29 are considered no significant with weak linear relationship and no linear relationship, respectively.
Table 8 shows the structure of constructing networks and the performance level based on training and validation. Input parameters in each model are based on PCA results. ANN-PC2018 and ANN-PC2019 used PM10, PM2.5, NO2, and CO, while ANN-PC2020, ANN-PC2021, and ANN-PC2022 used PM10, PM2.5, and O3. These pollutants are considered major sources in each dataset. Figure 4(i)-(x) depicts scatter plots of the predicted API (versus actual API) for both training and validation datasets.
ANN-PC2018 showed optimum performance at node eight with R2 = 0.7905, RMSE = 5.4684 for training and R2 = 0.7826, RMSE = 5.5446 for validation. This model is considered significant with strong linear regression. ANN-PC2019 scored the highest R2 = 0.8612, RMSE = 7.7467 for training and R2 = 0.8356, RMSE = 7.7990 at node eight, indicating significance with strong linear regression. ANN-PC2020 showed the best performance results for training and validation at node nine with the values R2 = 0.7384, RMSE = 6.3382 and R2 = 0.7586, RMSE = 5.9427, respectively. This indicates that this model is significant with strong linear regression. ANN-PC2021 obtained the highest result at node 9 with R2 = 0.8230, RMSE = 5.9020 for training and R2 = 0.8270, RMSE = 5.9010 for validation. This model is categorized as significant with strong linear regression. ANN-PC2022 is considered significant with strong linear regression after the performance at node 5 showed R2 = 0.8057, RMSE = 5.9613 for training and R2 = 0.8042, RMSE = 6.0240 for validation. Therefore, all models are acceptable, and based on their performance metrics, the rankings are as follows:
ANN-PC2019 > ANN-PC2021 > ANN-PC2022 > ANN-PC2018 > ANN-PC2020
The prediction models also introduced a new approach by eliminating the least significant pollutant from observation. Models ANN-PC2018 and ANN-PC2019 identified SO2 and O3 as the least important pollutants, whereas models ANN-PC2020, ANN-PC2021, and ANN-PC2022 suggested SO2, NO2, and CO. Currently, the API is determined by the highest reading among the sub-index for O3, CO, NO2, SO2, PM2.5, and PM10. For that reason, the DOE is responsible for collecting data on all these pollutants, regardless of their significance to the overall API value. This information is crucial for accurately determining the API and monitoring air quality in Malaysia. Thus, the findings suggest considering the removal of the least important pollutant from the list of major pollutants when determining API readings. Moreover, air pollution trends in Malaysia are predominantly influenced by PM2.5 and PM10 (DOE, 2018–2021; Sentian et al., 2019; Ab. Rahman et al., 2022), consistent with the findings of this study. Therefore, this study highly suggests that model ANN-PC2019 is most suitable for forecasting API. Although models ANN-PC2020, ANN-PC2021, and ANN-PC2022 showed significance, they are least recommended due to the COVID-19 pandemic and post-pandemic period, which reflects an improvement in air quality during those periods (Zahid et al., 2022).
Table 8
The forecasting performance of ANN-PCA model.
Model
|
Network Structures
|
Training
|
Validation
|
R2
|
RMSE
|
R2
|
RMSE
|
ANN-PC2018
|
[4,1,1]
|
0.7872
|
5.5114
|
0.7791
|
5.5888
|
[4,2,1]
|
0.7856
|
5.5316
|
0.7786
|
5.5944
|
[4,3,1]
|
0.7866
|
5.5185
|
0.7797
|
5.5807
|
[4,4,1]
|
0.7866
|
5.5185
|
0.7797
|
5.5807
|
[4,5,1]
|
0.7880
|
5.5007
|
0.7811
|
5.5636
|
[4,6,1]
|
0.7890
|
5.4882
|
0.7818
|
5.5541
|
[4,7,1]
|
0.7897
|
5.4787
|
0.7810
|
5.5641
|
[4,8,1]
|
0.7905
|
5.4684
|
0.7826
|
5.5446
|
[4,9,1]
|
0.7889
|
5.4889
|
0.7820
|
5.5521
|
[4,10,1]
|
0.7885
|
5.4936
|
0.7804
|
5.5721
|
ANN-PC2019
|
[4,1,1]
|
0.8600
|
7.7791
|
0.8345
|
7.8245
|
[4,2,1]
|
0.8605
|
7.7658
|
0.8353
|
7.8064
|
[4,3,1]
|
0.8606
|
7.7620
|
0.8348
|
7.8181
|
[4,4,1]
|
0.8610
|
7.7532
|
0.8354
|
7.8033
|
[4,5,1]
|
0.8597
|
7.7882
|
0.8355
|
7.8013
|
[4,6,1]
|
0.8591
|
7.8037
|
0.8349
|
7.8159
|
[4,7,1]
|
0.8600
|
7.7787
|
0.8351
|
7.8105
|
[4,8,1]
|
0.8612
|
7.7467
|
0.8356
|
7.7990
|
[4,9,1]
|
0.8608
|
7.7567
|
0.8352
|
7.8074
|
[4,10,1]
|
0.8600
|
7.7788
|
0.8351
|
7.8111
|
ANN-PC2020
|
[3,1,1]
|
0.7329
|
6.4048
|
0.7502
|
6.0452
|
[3,2,1]
|
0.7339
|
6.3924
|
0.7518
|
6.0251
|
[3,3,1]
|
0.7376
|
6.3482
|
0.7572
|
5.9592
|
[3,4,1]
|
0.7329
|
6.4047
|
0.7500
|
6.0478
|
[3,5,1]
|
0.7367
|
6.3589
|
0.7574
|
5.9575
|
[3,6,1]
|
0.7332
|
6.4013
|
0.7506
|
6.0396
|
[3,7,1]
|
0.7365
|
6.3606
|
0.7534
|
6.0060
|
[3,8,1]
|
0.7377
|
6.3461
|
0.7575
|
5.9555
|
[3,9,1]
|
0.7384
|
6.3382
|
0.7586
|
5.9427
|
[3,10,1]
|
0.7380
|
6.3424
|
0.7578
|
5.9523
|
ANN-PC2021
|
[3,1,1]
|
0.8199
|
5.9542
|
0.8254
|
5.9293
|
[3,2,1]
|
0.8195
|
5.9602
|
0.8248
|
5.9397
|
[3,3,1]
|
0.8174
|
5.9953
|
0.8234
|
5.9627
|
[3,4,1]
|
0.8206
|
5.9417
|
0.8267
|
5.9071
|
[3,5,1]
|
0.8220
|
5.9193
|
0.8269
|
5.9042
|
[3,6,1]
|
0.8224
|
5.9117
|
0.8267
|
5.9070
|
[3,7,1]
|
0.8227
|
5.9079
|
0.8263
|
5.9145
|
[3,8,1]
|
0.8203
|
5.9470
|
0.8258
|
5.9219
|
[3,9,1]
|
0.8230
|
5.9020
|
0.8270
|
5.9010
|
[3,10,1]
|
0.8216
|
5.9248
|
0.8260
|
5.9191
|
ANN-PC2022
|
[3,1,1]
|
0.8025
|
6.0095
|
0.7967
|
6.1394
|
[3,2,1]
|
0.8035
|
5.9950
|
0.7991
|
6.1026
|
[3,3,1]
|
0.8043
|
5.9824
|
0.8001
|
6.0878
|
[3,4,1]
|
0.8044
|
5.9810
|
0.8000
|
6.0894
|
[3,5,1]
|
0.8057
|
5.9613
|
0.8042
|
6.0240
|
[3,6,1]
|
0.8034
|
5.9959
|
0.7987
|
6.1085
|
[3,7,1]
|
0.8040
|
5.9860
|
0.7996
|
6.0946
|
[3,8,1]
|
0.8031
|
6.0000
|
0.7987
|
6.1094
|
[3,9,1]
|
0.8046
|
5.9774
|
0.8017
|
6.0634
|
[3,10,1]
|
0.8036
|
5.9931
|
0.7991
|
6.1032
|