3.1 Study area and data sets
Acute respiratory diseases (ARD) are prevalent communicable illnesses, primarily attributed to widespread exposure to harmful inhalational factors in the environment, workplace, and personal behaviours [55, 56]. In 2019, China reported that 1.1 million children under the age of 5, constituting 10% of this age group, were diagnosed with ARD (GBD, 2019). Notably, children are more vulnerable to the effects of air pollution due to their higher breathing rates, narrower airways, developing lungs and immune systems, and increased outdoor exposure[3, 57]. Earlier studies have shown that the hospitalisation rate for acute respiratory illnesses in two-year-olds is approximately 12 times greater than in children aged 5–17 years without underlying high-risk conditions[58]. The primary focus of this research is on children aged 3 and under, a specific age group characterised by limited mobility confined to their residential neighbourhoods, in contrast to school students and adults.
This study was conducted in Nanning City [59], which serves as the capital of the less-developed Guangxi Zhuang Autonomous Region (GZAR). In recent years, the air quality in Nanning has declined due to urban development and increasing emissions from motor vehicles. Considering the distribution of pollution sources, the study area is confined to Nanning's five traditional urban districts, excluding the five counties that fall under its jurisdiction. The location of the study area is illustrated in Fig. 1.
The daily data concerning childhood respiratory diseases were sourced from the Maternal and Children’s Health Hospital of Guangxi Province, covering the period from 1st January to 31st December 2016. The dataset includes information on gender, age, residential address, admission date, and a brief description of disease symptoms. It is important to note that all epidemiological data used in this study were ethically approved solely for academic research purposes, with strict measures in place to safeguard patient privacy by excluding any personal identifiers, such as names and ID card numbers, before releasing the data. Based on diagnoses provided by medical professionals, the childhood respiratory diseases were categorised as follows: Acute nasopharyngitis (C1), Breathing pneumonia (C2), Acute pharyngitis (C3), Bronchitis (C4), and Polyangiitis-bronchitis (C5).
To ensure the accuracy of the information, this study underwent several data preprocessing steps. Initially, records lacking addresses or originating from areas outside the urban region, as well as those associated with individuals older than 3 years, were systematically removed. Secondly, a geo-coding technique [60] was employed to convert the residential addresses of patients into locational coordinates, represented as latitude and longitude. The geo-coding process applied to the current dataset resulted in a matching rate of 80%. After these meticulous data processing and filtering steps, the study included a total of 131,619 incidents involving children aged 3 and below residing in the five districts of Nanning in 2016. Each incident was represented as a point vector, denoted as Pi = (Location, Time, Disease category). The spatial distribution of these categorical incidents throughout the urban area of Nanning is illustrated in Fig. 2.
To facilitate the analysis of monthly patterns, the daily data were temporally aggregated on a monthly basis. The statistical characteristics of the disease dataset are illustrated in Fig. 3. Firstly, it is evident that the dataset suffers from a class imbalance issue among the five disease categories. As shown in Fig. 3 (left), there is a significant disparity in the total number of incidents across the five categories. Acute nasopharyngitis (C1) accounts for more than 50% of cases, while Polyangiitis-bronchitis (C5) represents only 1.28% of the cases. Secondly, the data exhibit pronounced temporal variations throughout the year 2016. As indicated in Fig. 3 (right), the highest incidence of respiratory diseases occurred during the spring months (March, April, May) and winter months (November, December, January), coinciding with periods of rapid temperature changes.
The air pollutant data were gathered from the China National Environmental Monitoring Centre (CNEMC)'s national real-time municipal air quality platform. PM 2.5 (Particulate Matter with a diameter of 2.5 micrometres) has been shown to be closely associated with childhood respiratory diseases[61, 62]. In our study, we selected December as the focus for monthly pattern analysis due to the recorded poorest air quality in 2016, as depicted in the left graph of Fig. 4. Additionally, we chose two specific days, the 19th and the 27th, for analysing daily patterns under varying air conditions. Figure 4 (right) illustrates that the 19th represents the single day with the highest PM 2.5 levels, while the 27th exhibits the lowest PM 2.5 levels.
3.2 GTWCLQ analysis
To validate the proposed approach, which addresses issues related to sample size, class imbalance, and temporal effects, we conducted an experiment to examine the spatio-temporal patterns of disease incidents and identify high-risk co-locations. This is crucial for effective disease management, and we utilised the GTWCLQ tool for this purpose.
Initially, we employed the GTWCLQ approach to assess both global and local spatial associations among the five types of diseases on a monthly scale. Next, we conducted a more detailed analysis to investigate spatial association patterns on a daily scale, taking into account different air conditions. Finally, we analysed the temporal effects across multiple temporal scales.
To comprehensively understand the spatial patterns at different temporal scales, we utilised various datasets to calculate the GTWCLQ, as outlined in Table 1. Using Equations (6) and (7), we determined the minimum sample size and imbalance ratios for each dataset. Specifically, the "GTWCLQ12" dataset was used for monthly pattern analysis, while "GTWCLQ31" was employed for daily pattern analysis in December. Additionally, the "GTWCLQ7" dataset was utilised for daily pattern analysis under different air conditions.
As shown in Table 1, all datasets met the minimum sample size requirements, with the first dataset considered a large dataset (e.g., containing more than 100,000 instances), while the remaining datasets were categorised as small datasets, containing thousands to tens of thousands of instances. Moreover, all datasets exhibited varying degrees of class imbalance issues. In the following section, we will demonstrate the application of the GTWCLQ approach on these different datasets with varying imbalance ratios.
Table 1
Datasets processed for GWTCLQ analyses
Dataset | Temporal scale | Sample size | Scale (resolution and duration) | \(\:{n}_{min}\) | IR |
GTWCLQ12 | 12 months | 131,619 | Monthly, Jan-Dec | 383 | 34 |
GTWCLQ31 | 31 days in Dec | 10,674 | Daily, 1-31st | 370 | 29 |
GTWCLQ7 | Seven days in Dec | 1,064 | Daily, 13-19th, 20-27th | 282 | 20 |
All experiments were conducted on a computer equipped with 32GB of RAM, an Intel Core i7 CPU, and running the Windows 10 operating system.
3.3 GWTCLQ12: Monthly spatial-temporal pattern in Nanning
The dataset "GTWCLQ12", comprising 12 months and a total of 131,619 records was utilised for monthly pattern analysis. The imbalance ratio for "GTWCLQ12" was calculated as 34, determined by comparing the majority class C1 with the minority class C5 (IR = 57314⁄1688 ≈ 34) using formula (8). It's important to note that "GTWCLQ12" is considered a large dataset, exceeding 100,000 instances.
As detailed in section 2.2.3, the GTWCLQ method can be applied directly to this dataset without the need to address the class imbalance problem. Initially, December was selected as the current time frame for studying the monthly spatio-temporal pattern in 2016, resulting in a temporal bandwidth set to 12 (months), and the temporal period designated as "month." Furthermore, considering the uneven distribution of data across different areas, an adaptive spatial bandwidth with a K-nearest value of 362 (\(\:\text{K}=\sqrt{131619}\approx\:362\)) was chosen, following formula (11).
Additionally, the sampling size for Monte Carlo significance test was set at 1000, with the number of simulations typically established at 1,000, at a significance level of 0.05, as per the methodology outlined in previous literature [63, 64]. With these parameter settings in place, the global GWTCLQ results for all five categories are presented in Table 2 and Fig. 5.
Table 2
The global values of GTWCLQ12 in monthly pattern
Category | C1 | C2 | C3 | C4 | C5 | Sample size |
C1 | 0.957 | 1.079 | 0.995 | 0.908 | 1.076 | 57314 |
C2 | 0.856 | 1.189 | 0.943 | 0.950 | 1.199 | 48569 |
C3 | 0.880 | 1.049 | 1.659 | 0.914 | 1.147 | 8544 |
C4 | 0.811 | 1.090 | 0.929 | 1.441 | 1.140 | 15504 |
C5 | 0.804 | 1.123 | 0.997 | 1.036 | 3.791 | 1688 |
Note: all significant at 5% level |
Table 2 provides insights into the spatial patterns of disease categories. Specifically, four disease categories (C2–C5) exhibit same-category CLQ values greater than 1, which are highlighted in bold and italic in the table. These values along the diagonal indicate that these disease categories have a strong spatial autocorrelation, taking into account both spatial and temporal distance decay effects. However, Acute nasopharyngitis (C1) shows a random distribution, as its CLQ value is less than 1. Among all the diseases, Polyangiitis-bronchitis (C5) exhibits the most significant clustering pattern with itself, as evidenced by the highest CLQ value of 3.791.
Furthermore, the global GWTCLQ analysis detects symmetry and asymmetry in dependence between the five categories, as depicted in Fig. 5. Two pairs of categories exhibit symmetrical association or co-location: Polyangiitis-bronchitis (C5) and Breathing pneumonia (C2), as well as Polyangiitis-bronchitis (C5) and Bronchitis (C4). This indicates that these two diseases tend to co-occur in the same locations. Conversely, there are asymmetric association between C1 and C2, C3 and C2, and C4 and C2. Specifically, Breathing pneumonia (C2) and Polyangiitis-bronchitis (C5) are the most dependent categories[59], as all other categories have GTWCLQ values greater than 1 in relation to them, but not vice versa. This suggests that Breathing pneumonia (C2) and Polyangiitis-bronchitis (C5) tend to attract many other diseases to occur in their vicinity.
The global GTWCLQ provides an overall assessment of spatial associations between different disease types, while the local GTWCLQ enables a more granular analysis of spatio-temporal associations from one location to another. Figure 6 illustrates the local GTWCLQ values, highlighting associations between each disease and other disease types. Only local GTWCLQ values greater than one and statistically significant (p < 0.05) are considered, as these provide valuable insights for understanding the distribution patterns of diseases.
Firstly, from a spatial autocorrelation perspective, C5 and C3 exhibit different distribution patterns. As depicted in Fig. 6a, Polyangiitis-bronchitis (C5) is primarily concentrated in the city centre, where population and traffic density are higher compared to other areas. In contrast, Acute pharyngitis (C3) is prevalent on both sides of the river, particularly in the southern region (Fig. 6b).
Secondly, the spatio-temporal association with other diseases varies for each type. As mentioned earlier, all other disease types display asymmetric association with Polyangiitis-bronchitis (C5) in the city centre and residential areas, indicating that these diseases tend to cluster with the flow of people (Fig. 6c). Additionally, there is a symmetrical association between C2 and C5, as shown in Fig. 6d, with different patterns of high concentration. Notably, the inter-attraction association mainly occurs in densely populated areas, such as around the railway station (Fig. 6c). This could be attributed to higher levels of air pollution in these areas, which are more likely to trigger the occurrence of these two diseases.
3.4 GTWCLQ31: Daily spatial-temporal pattern in Nanning
The study also included the "GWTCLQ31" dataset, which covered 31 days in December 2016, to examine the daily spatio-temporal pattern in Nanning. Given this specific time frame, a temporal bandwidth of 31 days was chosen, with the temporal periods set as "day". However, due to the traditional nature of the "GWTCLQ31" dataset and an imbalance ratio exceeding 10, specifically 29, it was deemed inappropriate to directly apply the GTWCLQ model, as outlined in the guidelines in section 2.2.
To address this issue, given the limited sample size of class C5 in December, totalling 151 cases with only one case on the 17th, and its small proportion (only 1.4%), class C5 was merged into C3, as both exhibit similar symptoms. By combining C3 and C5, the imbalance ratio (IR) of "GWTCLQ31" was reduced to 5, meeting the conditions for statistical analysis. Additionally, an adaptive spatial bandwidth was applied to account for the uneven distribution of diseases. Conventionally, the K value could be set to 100 based on formula (12). However, considering the limited number of diseases on the 11th (only 54 cases) and taking into account the randomness of the Monte Carlo sampling, this study opted for a K value of 25, which is half the size of the smallest sample observed on the 11th. The simulation sample size was set to 50, and the simulation iterations were set to 1000 for Monte Carlo analysis.
Following the merging of classes and the specified parameter settings, the global GTWCLQ values at a significance level of 0.05, across all four categories, are presented in Table 3.
Table 3
Global values of GWTCLQ31 at daily level(time scale is 31 days)
Category | | C1 | C2 | C3 | C4 | Sample size |
C1 | | 1.004 | 1.032 | 0.690 | 1.101 | 4200 |
C2 | | 0.723 | 1.305 | 0.420 | 1.286 | 4396 |
C3 | | 0.488 | 1.030 | 2.670 | 1.453 | 889 |
C4 | | 0.664 | 0.773 | 1.127 | 2.938 | 1187 |
Note: all significant at 5% level |
The local co-location values from GWTCLQ31 at daily level are depicted in Fig. 7.Compared to the monthly pattern, the daily pattern observed in December exhibits stronger clustering within the same category. Among the disease categories, Bronchitis (C4) shows the most pronounced clustering within its category in the daily pattern (Fig. 7a-b), indicating a higher tendency for cases of Bronchitis to cluster together. In contrast, Acute Nasopharyngitis (C1) displays relatively weaker clustering within its category. Additionally, symmetrical associations exist between certain categories. Specifically, Acute Nasopharyngitis (C1) is co-located with C2 and C4, but the inverse is not true. This suggests that most cases of Acute Nasopharyngitis are likely caused by localised disease outbreaks rather than transmission from other diseases. Furthermore, all diseases exhibit co-location with C4 (Fig. 7c), indicating that Bronchitis is consistently associated with all childhood respiratory diseases.
Moreover, a symmetrical co-location relationship was identified between Acute Pharyngitis (C3) and Bronchitis (C4), suggesting that these two diseases frequently cluster at the same locations and times (Fig. 7d). These findings support the relevance of classifying the pathogenesis of these diseases and confirm their spatio-temporal associations.
3.5 GWTCLQ7: Daily spatial-temporal pattern in different air condition
The association between respiratory diseases and air quality in residential neighbourhoods is well-established. Using the GWTCLQ method, various spatio-temporal patterns can be explored under different air quality conditions. For this analysis, two specific days in December were chosen to represent contrasting air quality: the 19th, characterised by severe pollution, and the 27th, characterised by good air quality.
The "GWTCLQ7" dataset used in this experiment is small and suffers from a severe class imbalance (IR of 20). To address this, category C5 was merged into C3 based on disease characteristics. Previous studies have shown that respiratory diseases typically persist for 7 to 14 days[65]. Therefore, a temporal bandwidth of 7 days was selected, reflecting the course of the disease. Considering the impact area of air pollution, a fixed bandwidth was chosen for this experiment. Some studies have suggested that the influential buffer radius of air pollution ranges from 100 to 3,000 metres[66]. To assess the extent of pollution in Nanning, the neighbouring distance range in the "GWTCLQ7" dataset was calculated using formulas (8–9), and found to range from 1,000 to 3,000 metres, with a mean distance of 2,000 metres. Consequently, 2,000 metres was selected as the spatial bandwidth for this study. Additionally, the box kernel function was chosen as the spatial kernel function due to the nature of air pollution. Monte Carlo simulations were performed 1,000 times, with the simulation sample size set to 25. The results of the global GWTCLQ values and the local distribution are presented in Table 4 and Fig. 8.
The analysis of spatial association on a 7-day temporal scale revealed weak spatial associations between air conditions, with notable differences observed between the two. Among the same disease categories, three types of diseases exhibited clustering on polluted days, except for C1. Conversely, on clean days, only C3 displayed a clustering pattern within the same category. This suggests that most childhood respiratory diseases tended to cluster under poorer air conditions. Regarding associations between different disease categories, C1 and C2 showed symmetric co-location under both air conditions. However, C1 and C4, as well as C1 and C3, exhibited inter-attraction on polluted and clean days, respectively. This indicates that these diseases often cluster at the same place and time. Additionally, all diseases were attracted to Acute Nasop haryngitis (C2) on polluted days, indicating that Breathing Pneumonia (C2) tends to attract other diseases in its vicinity.
Table 4
The global values of GWTCLQ7 in different air conditions (7-day time scale)
Day | Category | C1 | C2 | C3 | C4 |
19th (max PM2.5) | C1 | 0.9248 | 1.0670 | 0.8692 | 1.0094 |
C2 | 1.0245 | 1.0173 | 0.7990 | 0.9950 |
C3 | 1.0502 | 1.0289 | 1.0264 | 0.7915 |
C4 | 1.0172 | 1.0317 | 0.7064 | 1.0111 |
27th (min PM2.5) | C1 | 0.9592 | 1.0311 | 1.1054 | 0.9460 |
C2 | 1.0180 | 1.0033 | 1.0026 | 0.9452 |
C3 | 1.0669 | 0.9460 | 0.9156 | 1.0589 |
C4 | 1.0402 | 1.0382 | 1.0483 | 0.7538 |
To illustrate the disparity between the two air quality conditions, the local co-location values of C1 and C2 under different air conditions are depicted in Fig. 8. It is clear that the symmetric co-location between Acute Nasopharyngitis (C1) and Breathing Pneumonia (C2) clusters more prominently on polluted days compared to clean days.
On the 19th, the clusters are primarily concentrated in the centre of Qinxiu and in rural areas of the Xingning district. These areas have a high density of the catering industry or the burning of straw, both contributing to higher pollution levels. In contrast, the cluster points on the 27th are less concentrated and more scattered, indicating that Acute Nasopharyngitis (C1) and Breathing Pneumonia (C2) in a clean environment exhibit a more random distribution pattern.
3.6 Temporal effects
The scaling effect has long been a central topic in Geographic Information Systems (GIS), with most studies primarily focused on spatial scales. However, the availability of big data, particularly high-velocity temporal data, has increasingly addressed concerns related to temporal effects. This is particularly important for GWTCLQ analysis. In this case study, the temporal scales are determined by both resolution (daily or monthly) and duration (one year or one month). The starting time effect refers to the sensitivity of the initial time to the spatio-temporal patterns.
To capture these temporal effects, including scale, duration, and starting time, two indicators—the frequency and intensity of colocation patterns (as defined in Equations 12–13)—were calculated and are presented in Table 5. These indicators help analyse the impact of various temporal factors on the observed spatio-temporal patterns.
Table 5
The frequency and intensity values between four scenarios
Index | GWTCLQ12 | GWTCLQ31 | GWTCLQ7(19th ) | GWTCLQ7(27th ) |
\(\:{\:\varvec{V}}_{\mathbf{f}\mathbf{r}\mathbf{e}\mathbf{q}\mathbf{u}\mathbf{e}\mathbf{n}\mathbf{c}\mathbf{y}\:}\) | 0.456 | 0.364 | 0.462 | 0.495 |
\(\:{\:\varvec{V}}_{\varvec{i}\varvec{n}\varvec{t}\varvec{e}\varvec{n}\varvec{s}\varvec{i}\varvec{t}\varvec{y}}\) Resolution Duration | 0.661 Month 1 year | 0.727 Day 1 month | 0.703 Day 1 week | 0.667 Day 1 week |
Starting | January | 1st Dec | 13th Dec | 21st Dec |
Table 5 provides a comprehensive overview of the varying intensity and frequency of spatio-temporal associations across four scenarios. The key observations are as follows:
Frequency of Aggregation
The highest aggregation frequency is observed at the 7-day scale (GWTCLQ7s), followed by the monthly scale (GWTCLQ12), with the lowest frequency found at the daily scale (GWTCLQ31). This indicates that different types of respiratory diseases exhibit more pronounced spatial and temporal associations at the 7-day (one-week) scale (GWTCLQ7s).
Intensity Index
The daily scale generally demonstrates stronger intensity than the monthly scale (GWTCLQ12), and a longer duration (GWTCLQ31) results in higher intensity in the daily pattern.
Starting Day Effects
At the same temporal scales (resolution and duration), both frequency and intensity values vary with the starting day. The polluted day GWTCLQ7 (19th) shows intensive co-location but lower frequency, while the clean day GWTCLQ7 (27th) demonstrates less intensity but higher frequency.
Overall, this analysis reveals that temporal effects significantly influence the frequency of co-location, with the monthly pattern frequency being greater than the daily pattern. Additionally, temporal duration has a substantial impact on the intensity of co-location relationships, as the intensity over a year is less than that over a month. Furthermore, the starting time also affects both the frequency and intensity index, highlighting the importance of considering starting time effects in spatio-temporal analyses.