Attentions Data
Daily attentions of pneumonia cases were obtained from the individual reports provided by healthcare provision systems from January 2009 to December 2019. The daily data were grouped by epidemiological week to obtain weekly cumulative attentions. The five most populated cities in Colombia were selected for the study. The cities selected were Bogotá, Medellín, Cali, Barranquilla, and Cartagena.
Aerosols Data
Data on air pollution-aerosols corresponded to aerosol optical depth measured by the Moderate Resolution Imaging Spectroradiometer, a space-borne instrument [16]. Daily data on air pollution-aerosols are available on the NASA product: Modern-Era Retrospective Analysis for Research and Applications, Version 2 [17]. The air pollution-aerosols included in this study were Black Carbon Surface Mass Concentration (BCSMASS), Dimethylsulphide Surface Mass Concentration (DMSSMASS), Dust Surface Mass Concentration of 2.5 μm in diameter (DUSMASS25), SO4 Surface Mass Concentration (SO4SMASS), and Sea Salt Surface Mass Concentration of 2.5 μm in diameter (SSSMASS25). All data were converted to μg/m3. Daily data were grouped by epidemiological week to obtain weekly average data. Spatial matching between the values of air pollution-aerosols and the cities included in the study was performed using the raster package of R [18].
Meteorological Data
Daily data on rainfall and temperature were obtained from Modern-Era Retrospective analysis for Research and Applications, Version 2 [19]. The daily data were grouped by epidemiological week to obtain weekly cumulative rainfall and weekly average temperature data. Spatial matching among the weekly values of rainfall and temperature and the five cities being evaluated was performed using the raster package [18] of R.
We included lags of up to 4 weeks for air pollution-aerosols and meteorological variables, which we considered sufficient to capture the necessary time for the period of incubation of the disease and the time to visit a healthcare facility, and the report of a new pneumonia case.
Admission Data
In the models, we included the following as admission data: The year (2009 to 2019), the epidemiological week (Epiweek) with values from 1 to 52 or 53 for each year, and the week consecutive (Consweek), with values ranging from 1 to 573 for the entire study period.
Machine-learning Methods
Four machine-learning methods were implemented to forecast the number of attentions of pneumonia cases in each city. The methods implemented were Extreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machines (SVM), and Bayesian Adaptive Regression Trees (BART).
XGBoost is used to implement gradient boosted decision trees. The method is an approach where new models that predict the residuals or errors of prior models are created and then added together to make the final prediction [20]. RF combines several randomized decision trees and aggregates to do their predictions by averaging [21]. The objective of SVM is to find a hyperplane in an N-dimensional space that distinctly classifies the data points [22]. BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements [23].
In each city, the response variable was the number of attentions of pneumonia cases per week. We used, as predictor variables in the machine-learning methods, the air pollution-aerosols and meteorological variables with lags of up to 4 weeks, as well as the year, the Epiweek and the Consweek.
Each machine-learning method was trained and tested on a partitioned 70/30 percentage split of the dataset by stratified random sampling for each city. The method of 10-fold cross-validation was used for training the dataset. Additional file 1 shows the parameters of each machine-learning method implemented. The performance of the forecasting was evaluated with the R2 metric. We used the package caret [24] of R to implement the machine-learning methods.
Machine-learning Model Interpretation
We implemented the techniques of permutation feature importance and feature interaction to provide explanations and to analyze the behavior and forecasting of the best-performing machine-learning model in each city.
The permutation feature importance is an approach that classifies the contribution of each variable based on its precision. This means that a variable can be significantly important if changing its values (permutation) increases the model error, which means that the model needs this variable to perform more accurate forecasting. On the other hand, if the model error shows no change when varying the values, the variable does not contribute or influence the model when making the forecast [25]. The permutation feature importance was estimated with 500 iterations.
Feature interaction explains the interaction between variables. The technique states that the effect that a variable can have on the forecast is probably influenced by other variables. Therefore, this method recognizes that variables can be interconnected and that not only does a variable by itself have an influence on the machine-learning model, but that the interaction between variables can also have an effect on how the model is making its forecast [26].
The package iml of R [25] was used to implement the techniques of permutation feature importance and feature interaction.