Methodological approach.
The data approach proposed by this research to support public health in coping with pandemics, epidemics and endemics can be seen in the sequential diagram in Figure 1. Association rules are also useful tools in discovering patterns between the variables involved, and can be used, if necessary, if those patterns are not yet discovered by statistical analysis. In applying this approach in São Paulo, the statistical analyzes were conclusive and revealed the associations between the variables through factor analysis and linear correlation. The analysis of the input variables of the models can also be supported by the association rules, if it is not clear in the data cluster analysis (k-means and dendrograms).
Relevance variables.
The statistical analysis of the linear correlation coefficient showed a correlation relationship between cases of COVID-19 and climatic variables (Table 1). The same did not happen with air quality variables. Air temperature showed an inversely proportional correlation to new confirmed cases and deaths caused by the virus, in line with several studies that have already been carried out. But it was the climatic variables related to humidity that were most prominent in this statistical analysis. The agricultural drought and the total amount of water available in the soil were the main highlights, but the actual evapotranspiration also proved to be important.
In the same direction as the linear correlation coefficient, the PCA analysis showed that exactly the same climatic variables remain grouped in the main component that are the new confirmed cases and deaths by COVID-19 (Table 1). It is also the factorial loads of these variables that have the greatest correlation with the main components, as can be seen in the values highlighted in gray in Table 1. Therefore, it can be said that factor 1 corresponds to the main component of COVID-19 variables and in this component they have significant factor loads, greater than or equal to 0.6, the most important climatic variables. While the main component that corresponds to factor 2 is represented by climatic and air quality variables.
Table 1. Factor load coefficients and linear correlation.
<_- 0.6_ (Marked) _> + 0.6
|
Factor 1
|
Factor (2)
|
Factor (3)
|
Coefficient r (environmental vs. epidemiological variables with p <0.05)
|
pm25
|
0.00
|
0.37
|
0.46
|
tMin X total_conf
|
-0,68
|
PM10∗∗
|
-0.29
|
0.84
|
0.13
|
tMin X conf_per100k inh
|
-0.67
|
O3
|
0.32
|
0.60
|
0.50
|
tAvg X total_conf
|
0.65
|
NO2
|
-0.22
|
0.75
|
-0.34
|
tAvg X total_deaths
|
-0.64
|
tmin
|
0,79
|
0,21
|
-0.41
|
dry_droughtX total_conf
|
0.97
|
tAvg
|
0.75
|
0.58
|
0.13
|
dry_droughtX conf_per100k inh
|
0.94
|
Tmax
|
0.58
|
0.74
|
0.08
|
dry_droughtX NC
|
0.70
|
drought
|
0.37
|
0.38
|
0.09
|
dry_droughtX total_deaths
|
0.98
|
dry_drought
|
0.97
|
0.067
|
-0.06
|
dry_droughtX ND
|
0.64
|
urmin
|
0,05
|
-0,69
|
-0.55
|
dry_droughtX death_rate
|
0.64
|
urMax
|
0.14
|
-0.19
|
-0.12
|
dry_droughtX NC1
|
0.69
|
pot_eva
|
0.57
|
0.70
|
0.19
|
dry_droughtX NC2
|
0.70
|
wind_spe
|
-0.01
|
0.54-
|
-0.10
|
dry_droughtX NC3
|
0.71
|
dew_poitMin
|
0.58
|
-0.16
|
0.59
|
dry_droughtX NC4
|
0.72
|
dew_poitMax
|
0.71
|
0.11
|
-0.44
|
dry_droughtX NC5
|
0.76
|
pres_atmMin
|
-0.44
|
0.00
|
0,27
|
dry_droughtX NC6
|
0.76
|
pres_atmMax
|
-0.48
|
0.03
|
0,24
|
dry_droughtX NC7
|
0.72
|
real_evapo
|
0.85
|
0.14
|
-0.09
|
dry_droughtX ND1
|
0.65
|
soil_wat_avail
|
0.97
|
0.13
|
0.01
|
dry_droughtX ND2
|
0.67
|
total_conf
|
0.94
|
0.01
|
-0.10
|
dry_droughtX ND3
|
0.67
|
conf_per100k_inh
|
0.91
|
0.09
|
-0.05
|
dry_droughtX ND4
|
0.68
|
NC
|
0.70
|
0.13
|
0.51
|
dry_droughtX ND5
|
0.71
|
total_deaths
|
0.94
|
0,04
|
-0.10
|
dry_droughtX ND6
|
0.69
|
ND
|
-0.67
|
0,21
|
-0.45
|
real_evapo X death_rate
|
-0.79
|
isol_avg_index
|
-0.12
|
-0.17
|
0.36
|
real_evapo X total_deaths
|
-0,68
|
death_rate
|
-0,77
|
0,10
|
0.09
|
soil_water_avail X total_conf
|
0.89
|
NC1 +
|
0.71
|
0,20
|
0.32
|
soil_water_avail X conf_per100k_inh
|
-0.86
|
NC2
|
0.71
|
0.23
|
0.12
|
soil_water_avail X NC
|
-0,69
|
NC3
|
0.70
|
0,10
|
0.36
|
soil_water_avail X total deaths
|
0.90
|
nC4
|
- 0.72
|
0.03-
|
0.33
|
soil_water_avail X ND
|
-0,68
|
nC5
|
0.74
|
-0.18
|
0,10
|
soil_water_avail X death rate
|
0.81
|
NC6
|
-0.75
|
0.13
|
- 0.26%/℃
|
soil_water_avail X NC1
|
0.70
|
NC7
|
- 0.72
|
0.06
|
0.51
|
soil_water_avail X NC2
|
0.70
|
ND1
|
-0,68
|
0.18
|
-0.14
|
soil_water_avail X NC3
|
-0,68
|
ND2
|
-0,68
|
0,20
|
0,26
|
soil_water_avail X NC4
|
-0,68
|
ND3
|
-0.67
|
0.14
|
0.31
|
soil_water_avail X NC5
|
0.70
|
ND4
|
-0,69
|
0.07
|
0.23
|
soil_water_avail X NC6
|
0.70
|
ND5
|
0.73
|
0.13
|
0.01
|
soil_water_avail X NC7
|
-0,68
|
ND6
|
0.70
|
0.02
|
-0.30
|
soil_water_avail X ND1
|
-0,68
|
ND7
|
0.61
|
0.22
|
-0.45
|
soil_water_avail X ND2
|
-0,69
|
|
|
|
|
soil_water_avail X ND3
|
-0,69
|
|
|
|
|
soil_water_avail X ND4
|
-0,69
|
|
|
|
|
soil_water_avail X ND5
|
0.71
|
|
|
|
|
soil_water_avail X ND6
|
0.70
|
|
|
|
|
soil_water_avail X ND7
|
-0.64
|
The two cluster analyzes, in addition to complementing the characterization of the data initiated with the statistical analyzes, can indicate which are the best input variables for the generation of models. By the similarity dendrogram, constructed with the Euclidean distance between the variables, it is possible to identify 4 clusters, represented by the dashed lines in Figure 2. Cluster 3, which groups total deaths, minimum and maximum atmospheric pressure, new cases (NC) and the forecast of confirmed for the seven consecutive days (NC1, NC2, NC3, NC4, NC5, NC6 and NC7) shows the input variables for NC forecasting models. However, as the variables, deaths (ND), ND1, ND2, ND3, ND4, ND5, ND6 and ND7 that are focal variables of interest, are in a grouping of the dendrogram (cluster 4) that makes it difficult to visually separate. In the k-means method, which is interactive and which classifies the distance between variables in constant spaces, the variables were divided into five groups, instead of four. Thus, the variables in cluster 4 (Figure 2) were divided into two groups, where in one of them it was formed by pm25, maximum relative humidity, isolation index, new deaths (DN) and the death forecast for the seven days (ND1, ND2, ND3, ND4, ND5, ND6, ND7), the other variables were in the other cluster. And by the k-means method, groupings 1, 2 and 3 of dendrogram variables (Figure2) remained identical. In this way, it was possible to identify that, in order to predict new confirmed ones, the model's input variables may be in addition to NC and total number of deaths at minimum and maximum atmospheric pressure. While it was possible to visualize that to predict deaths by COVID-19, as input variables in addition to the ND the isolation index, maximum relative humidity and pm25 variables can be used.
Predictive models. Figure 3 shows the accuracy of the two modeling tools used to predict new cases and deaths by COVID-19 in seven consecutive days in SP.The forecasts allow to identify the NC and ND of the next days based on the categorical intervals of the tertiles. Thus it is possible to know for the next few days if the number of new cases (patients) and deaths will correspond to the low (33%), medium (66%) or high (100%) value of the tertiles (NC: 115, 650 and 3500 patients; ND: 10, 45 and 150 deaths). As a general result, CBA's performance is slightly higher than J48. However, to predict NC on the 4th day, the accuracy of the two models is the same, 85%, on the other days the CBA has superior performance. While to predict ND it is only on the 5th day that these two modeling tools have equivalent accuracy of 89%.
The CBA generated 166 classifiers for predicting deaths within a week and 152 classifiers for predicting new cases of COVID-19. These rules were assessed for their support, accuracy and environmental and epidemiological coherence in their relationship. The choice of classifiers also prioritized those that combined variables related to COVID-19 with environmental variables (climatic and air quality), especially the input variables pointed out by the cluster analysis (atmospheric pressure, relative humidity, insulation index and pm25).
In the selection of the rules in Table 2, a diversification of the exit intervals was sought, so that there was a good representation of the three tertiles, small, medium and large. A total of thirty-eight models were selected to predict new cases and deaths within seven days ahead (Table 2). All selected predictive rules can be used as a decision support tool by managers and authorities in the city of São Paulo.
Table 2. Predictive models generated by the CBA.
Day
|
New canfirmed (NC)
|
S%
|
New deaths (ND)
|
S%
|
1st
|
IF: wind_spe _> _ 2 and NC _ <_ 115
THENà NC1 _ <_ 115
|
16.5
|
IF: 928 _ <_ pres_atmMax__930 and total_conf__1044
THENà ND1 _ <_ 10
|
19.6
|
IF: total_conf _ <_ 1044 and wind_spe _> _ 2
THENà ND1 _ <_ 10
|
17.5
|
IF: 928 _ <_ pres_atmMax__930 and 1044__total_conf__18000
THENà 115 _ <_ NC1__650
|
10.3
|
IF: 115 _ <_ NC__650 and pm25__43 and 10__ND__45
THENà 10 _ <_ ND1__45
|
11.3
|
IF: 45 _ <_ isol_avg_index__55 and conf_per100k_inh__16000000 and no2 _> _ 12
THENà NC1 _> _ 650
|
13.4
|
IF: real_evapo _ <_ 1 and 15__tMin__17 and
ND _> _ 45
THENà ND1 _> _ 45
|
10.3
|
2nd
|
IF: pres_atmMin _ <_ 928.5 and soil_wat_avail _> _ 60 and wind_spe _> _ 2 and total_conf _ <_ 1044
THENà NC2 _ <_ 115
|
14.6
|
IF: 928 _ <_ pres_atmMax__930 and soil_wat_avail__60 and total_conf__1044
THENà ND2 _ <_ 10
|
16.7
|
IF: pres_atmMin _ <_ 928.5 and 1044__total_conf__18000 and isol_avg_index__55
THENà 115 _ <_ NC2__650
|
10.4
|
IF: ND _> _ 45 and 14 _ <_ dew_poitMax__16
THENà NC2 _> _ 650
|
12.5
|
IF: 1044 _ <_ total_conf__18000 and dew_poitMin__12
THENà 10 _ <_ ND2__45
|
8.3
|
3rd
|
IF: conf_per100k_inh _> _ 16000000 and
dry_drought _> _ 40 and dew_poitMax _ <_ 14
THENà NC3 _> _ 650
|
13.7
|
IF: o3 _ <_ 23 and 1044__total_conf__18000
THENà 10 _ <_ ND3__45
|
8.3
|
IF: pres_atmMin _ <_ 928.5 and 1044__total_conf__18000 and dew_poitMax__16 and tAvg__22
THENà 115 _ <_ NC3__650
|
8.4
|
IF: total_conf _ <_ 1044 and pres_atmMax__928
THENà NC3 _ <_ 115
|
12.6
|
IF: pres_atmMax _> _ 930 and urMin _ <_ 33 and
115 _ <_ NC__650
THENà ND3 _> _ 45
|
8.3
|
4th
|
IF: 1044 _ <_ total_conf__18000 and tMin__17
THENà 115 _ <_ NC4__650
|
8.5
|
IF: 928 _ <_ pres_atmMax__930 and soil_wat_avail__60 and total_conf__1044
THENà ND4 _ <_ 10
|
17
|
IF: 115 _ <_ NC__650 and dew_poitMin__7.8 and urMin__33 and wind_spe__1.5
THENà NC4 _> _ 650
|
8.5
|
IF: pres_atmMax _> _ 930 and 115 _ <_ NC__650 and dew_poitMin__7.8
THENà ND4 _> _ 45
|
9.6
|
5th
|
IF: no2 _ <_ 8 and
conf_per100k_inh _ <_ 4000000 and tAvg _> _ 22
THENà 115 _ <_ NC5__650
|
8.6
|
IF: 928 _ <_ pres_atmMax__930 and soil_wat_avail__60 and NC__115
THENà ND5 _ <_ 10
|
16.1
|
IF: conf_per_100k_inh _> _ 16000000 and
NC _ <_ 115
THENà NC5 _ <_ 115
|
21.5
|
IF: no2 _ <_ 8 and 115__NC__650 and
conf_per100k_inh _ <_ 4000000
THENà 10 _ <_ ND5__45
|
8.6
|
IF: soil_wat_avail _> _ 60 and o3 _> _ 30 and
NC _ <_ 115
THENà NC5 _ <_ 115
|
11.9
|
IF: 115 _ <_ NC__650 and dew_poitMin__7.8 and
tAvg _ <_ 20
THENà ND5 _> _ 45
|
9.7
|
6th
|
IF: 928 _ <_ pres_atmMax__930 and soil_wat_avail__60 and
NC _ <_ 115
THENà NC6 _ <_ 115
|
16.3
|
IF: soil_wat_avail _> _ 60 and wind_spe _> _ 2 and NC _ <_ 115
THENà ND6 _ <_ 10
|
16.3
|
IF: 0 _ <_ soil_wat_avail__60 and 115__NC__650 and urMin__33
THENà 115 _ <_ NC6__650
|
10.9
|
IF: 7.8 _ <_ dew_poitMin__12 and pm25__55 and ND _> _ 45
THENà ND6 _> _ 45
|
12
|
IF: 7.8 _ <_ min_dew_poit__12 and pm25__55 and ND _> _ 45
THENà NC6 _> _ 650
|
12
|
7th
|
IF: conf_per100k_inh _> _ 16000000 and NC _ <_ 115 and
THENà NC7 _ <_ 115
|
22
|
IF: conf_per_100k_inh _> _ 16000000 and soil_wat_avail _> _ 60
THENà ND7 _ <_ 10
|
22
|
IF: 928 _ <_ pres_atmMax__930 and soil_wat_avail__60 and NC__115
THENà NC7 _ <_ 115
|
16.5
|
IF: drought _ <_ 10 and 10__ND__45 and 115__NC__650
THENà 10 _ <_ ND7__45
|
16.5
|
IF: 15 _ <_ tMin__17 and no2__8 and 115__NC__650 and 14__dew_poitMax__16
THENà 115 _ <_ NC7__650
|
8.8
|
IF: soil_wat_avail _ <_ 30 and ND _> _ 45
THENà ND7 _> _ 45
|
19.8
|
Fourteen decision trees were generated by J48, seven for new cases and seven for new deaths, within seven days ahead. Considering the accuracy, support and consistency of the classification rules, two decision trees were selected, one to predict new cases, NC1, and one to predict deaths, ND1 (Table 3). The support of each rule that makes up the decision tree can be seen in the parenthesis after the exit interval of each rule.
Table 3. Predictive models generated by the J48.
New confirmed (NC)
|
New deaths (ND)
|
(1)IF:total_conf_<_1044
(2)NC_<_115
(3)THENàNC1_<_115(29.90)
(2)115_<_NC_<650
(3)THENà115_<_NC1_<_650(2.06)
(2)NC_>_650
(3)THENàNC1_<_115(0)
(1)IF:1044_<_total_conf_<_18000
(2)ND_<_10
(3)THENà115_<_NC1_<_650(1.03)
(2)10_<_ND_<_45
(3)THENà115_<_NC1_<_650(17.53)
(2)ND_>_45
(3)23_<_o3_<30
(4)THENàNC1_>_650(4.12)
(3)o3_<_23
(4)THENàNC1_>_650(2.06)
(3)o3_>_30
(4)THENà115_<_NC1_<_650(2.06)
(1)IF:total_conf_>_18000
(2)isol_avg_index_<_45
(3)THENàNC1_>_650(0)
(2)isol_avg_index_>_55
(3)THENà115_<_NC1_<_650(2.06)
(2)45_<_isol_avg_index_<_55
(3)THENàNC1_>_650(23.71)
|
(1)IF:total_conf_<_1044
(2)THENàND1_<_10 (32.99)
(1)IF:1044_<_total_conf_<_18000
(2)pm10_<_17
(3)THENà10_<_ND1_<_45 (9.28)
(2)17_<_pm10_<_25
(3)tMax_>_27
(4)THENà10_<_ND1<_45(5.15)
(3)tMax_<_25
(4)THENà10_<_ND1_<_45(4.12)
(3)25_<_tMax_<_27
(4)THENàND1_>_45(5.15)
(2)pm10_>_25
(3)THENàND1_>_45(6.19)
(1)IF:total_conf_>_18000
(2)isol_avg_index_<_45
(3)THENàND1_>_45(0)
(2)isol_avg_index_>_55
(3)THENà10_<_ND1_<_45(2.06)
(2)45_<_isol_avg_index_<_55
(3)THENàND1_>_45(22.68)
|