Overview of the research framework
This study attempted to predict the peak demand days of CVDs admissions by adopting machine learning techniques. The block diagram of the classified prediction process is shown in Fig. 1. In brief, the time series dataset, which was comprised of CVDs admissions, meteorological data and air quality data, was pre-processed. Second, the generalized additive model (GAM) was built to choose the lag day of meteorological conditions and air pollutants for CVDs admission. Then, six machine learning algorithms, including LR, SVM, ANN, RF, XGBoost and LightGBM, were applied to construct the predictive models, and the models’ parameters were optimized with 10-fold cross validation. After that, the predictive models were validated, then the performances of these models were compared. Finally, we predicted the peak demand days of CVDs admissions based on the optimal machine learning model.
Fig. 1 Block Diagram of Classified Prediction Process
The details are discussed in the following sub-sections.
Data collection and preprocessing
Hospital admissions data
Data for the daily number of hospital admissions for patients with CVDs who lived in urban areas of Chengdu was obtained from the Health Information Center of Sichuan Province, China. This data contains aggregate numbers of CVDs admissions in all the tertiary and secondary hospitals of Chengdu each day with primary diagnosis of CVDs (International Classification of Diseases, 10th Revision codes: I00-I99) from 1 January 2015 to 31 December 2017, which is 1096 days of continuous data.
Additionally, we focused on the peak demand of CVDs admissions, and the binary variable was generated from the daily number of CVDs admissions. In the situation of absence of a known threshold for daily CVDs admissions, the peak demand was defined on the basis of 85th percentile threshold (304 hospital admissions per day) by reference to the previous studies [31, 33]. Specifically, the days on which the daily number of CVDs admissions were equal to or above the 85th percentile threshold were defined as peak demand days. Thus, the binary variable of CVDs admissions is highly imbalanced, with 931 samples of non-peak demand and 165 samples of peak demand. This binary variable of CVDs admissions was used as the primary dependent variable in the analysis.
Meteorological data and air quality data
Meteorological data, including temperature, relative humidity and rainfall, were derived from the Chengdu Meteorological Monitoring Database (http://data.cma.cn/).
Hourly data of air pollutants, including PM2.5 (particulate matter with aerodynamic diameter ≤ 2.5 µm), PM10 (particulate matter with aerodynamic diameter ≤ 10 µm), SO2, NO2, CO and O3, were obtained from the China National Environmental Monitoring Center (http://www.cnemc.cn/), which provides real-time monitoring of hourly concentrations of air pollutants to the general public. We averaged the 24-h mean concentrations for PM2.5, PM10, SO2, NO2 and CO, and calculated maximum 8-h moving average concentrations for O3 from the air quality monitoring stations interspersed among the urban areas of Chengdu. Concentrations of particulate matter with an aerodynamic diameter between 2.5 and 10 µm (PMC) were calculated by subtracting daily average concentrations of PM2.5 from PM10 [9, 34].
Data preprocessing
Data for the daily number of hospital admissions for CVDs, meteorological data and air quality data were collected from different data sources. We merged these three datasets to form a time series dataset by date (i.e. 1 January 2015 to 31 December 2017). The time series features were extracted from date, including year, month (month of year), day (day of month), holiday (public holidays) and DOW (day of week).
During the study period, the percentages of missing values from the monitoring stations were 1.28% (14/1096) for meteorological conditions, and 3.19% (35/1096) for air pollutants. The linear interpolation which has acceptable performance and reliability, was used to fill in the missing values of meteorological conditions and air pollutants [35, 36].
Feature extraction
As illustrated in the above section, the features for predicting the peak demand days of CVDs admissions included time series features, meteorological conditions features and air pollutants features. Accumulating epidemiological studies have suggested that the effect of meteorological conditions and air pollutants on CVDs admissions is delayed, and the lag effect is related to the regional environment [8, 12, 37]. Hence, we employed an over-dispersed GAM, which allowed the quasi-Poisson distribution to analyze the lag effects of daily meteorological conditions and air pollutants on CVDs admissions, and chose the lag day based on the minimum Generalized Cross-Validation (GCV) values which measure models fit [5, 34]. The lag effects of single day lags (from lag0 to lag6) and cumulative day lags (from lag01 to lag06) were taken into consideration. The penalized spline approaches were applied to control for potential confounding of long-term trends, seasonality and meteorological effects [38]. Moreover, dummy variables of holiday and DOW were controlled.
The results demonstrated that temperature, relative humidity, rainfall, PM2.5, PM10, PMC, SO2, NO2, CO and O3 were associated with CVDs admissions, with the minimum GCV values at lag04, lag06, lag06, lag3, lag3, lag3, lag0, lag0, lag0 and lag6, respectively.
Finally, the independent variables for forecasting the peak demand days of CVDs admissions included fifteen features, which are shown in Table 1.
Table 1 The features for prediction
Feature category
|
Features
|
Description
|
time series features
|
year
|
year of the date of hospital admission
|
month
|
month of year
|
day
|
day of month
|
holiday
|
public holidays
|
DOW
|
day of week
|
meteorological conditions features
|
Tem_lag04
|
mean temperature for the moving average of current day and previous four days (lag04)
|
RH_lag06
|
relative humidity for the moving average of current day and previous six days (lag06)
|
Rain_lag06
|
rainfall for the moving average of current day and previous six days (lag06)
|
air pollutants features
|
PM2.5_lag3
|
PM2.5 at the previous three days (lag3)
|
PM10_lag3
|
PM10 at the previous three days (lag3)
|
PMC_lag3
|
PMC at the previous three days (lag3)
|
SO2_lag0
|
SO2 at the current day (lag0)
|
NO2_lag0
|
NO2 at the current day (lag0)
|
CO_lag0
|
CO at the current day (lag0)
|
O3_lag6
|
O3 at the previous six days (lag6)
|
Machine learning methods
In this study, six well-accepted machine learning algorithms, including LR, SVM, ANN, RF, XGBoost and LightGBM, were applied to develop predictive models with the unique feature set. These machine learning methods were considered according to their following characteristics.
LR is a common and basic algorithm, which is widely used in disease risk prediction and epidemiology [39]. SVM is a discriminative classification technique, which has been widely applied in medical diagnostics and other fields, especially with small sample sets [40]. ANN, inspired by biological neural networks, has a remarkable ability to determine the meaning and rules of complicated data [41, 42]. RF, an ensemble algorithm, applies a bootstrap algorithm to extract multiple samples from the training set randomly, and trains the samples with the weak classifier (i.e. decision tree) [43]. The final result of RF is determined by the majority of votes over all decision trees, thereby improving the predictive accuracy and preventing the model from over-fitting. XGBoost is a distributed gradient boosting algorithm and has gained wide popularity and attention in machine learning competitions [44, 45]. XGBoost chooses a weak classifier to facilitate efficient optimization algorithms, adds an L2 regularization term of leaf weights to achieve lower variance, and uses the second-order Taylor series as the cost function to retain more information about the target function, thereby improving the predictive accuracy. LightGBM is a distributed and high-performance gradient lifting framework based on a decision tree algorithm designed for fast computational time, especially with very large data sets [46]. It utilizes two novel techniques: gradient-based one-side sampling and exclusive feature bundling, which is used to deal with the huge number of data samples and massive amount of features, respectively [47].
All above-mentioned models were trained and tested on a partitioned 80/20 percentage split of the dataset by stratified random sampling. Simultaneously, in situations where there was imbalanced class data combined with unequal error costs, these models’ performance metrics are not representative of reasonable performances. Therefore, it is necessary to balance the dataset to get true performance values for the classifier. We adjusted weights inversely proportional to class frequencies in the input data when training the machine learning models.
The parameters of these six predictive models were determined by grid search and 10-fold cross-validation in training dataset. To be specific, we partitioned the training dataset into ten equally sized pieces, and we utilized the grid search with nine pieces to tune the parameters, while the remaining piece was used as the validation set. We repeated this process ten times. The best parameters for predictive models were obtained with the best score, which itself was obtained by averaging the process of repetition mentioned in the previous sentence. Table 2 shows the values of the parameters in each model.
Table 2 Summary of parameter values in each model
Models
|
Parameters
|
Values
|
Parameters Mean
|
LR
|
penalty
|
L1
|
penalty function
|
SVM
|
kernel
|
linear
|
kernel function
|
C
|
5
|
penalty parameter of the error term
|
ANN
|
kernel_initializer
|
uniform
|
kernel initializer function
|
activation1
|
relu
|
activation of hidden layer
|
activation2
|
sigmoid
|
activation of output layer
|
optimizer
|
Adam
|
training optimization algorithm
|
epochs
|
300
|
number of times shown to the network
|
batch_size
|
20
|
batch size
|
dropout
|
0.0
|
dropout rate
|
RF
|
n_estimators
|
695
|
number of iterations
|
max_depth
|
4
|
maximum depth of variable interactions
|
max_features
|
7
|
number of features for the best split
|
XGBoost
|
learning_rate
|
0.1
|
learning rate
|
n_estimators
|
100
|
number of iterations
|
eta
|
0.01
|
control of learning rate
|
max_depth
|
3
|
maximum depth of variable interactions
|
gamma
|
0.6
|
minimum loss reduction required to make a further partition on the tree’ leaf node
|
subsample
|
0.7
|
subsample ratio
|
colsample_bytree
|
0.6
|
subsample ratio of columns when constructing each tree
|
min_child_weight
|
2
|
sum of the minimum weights that leaf nodes need to observe
|
LightGBM
|
learning_rate
|
0.1
|
learning rate
|
n_estimators
|
100
|
number of iterations
|
max_depth
|
8
|
maximum depth of variable interactions
|
num_leaves
|
10
|
number of leaves in each tree
|
bagging_fraction
|
0.7
|
percentage of sampling used in each iteration
|
feature_fraction
|
0.9
|
ratio of features to build the tree in each iteration
|
min_data_in_leaf
|
5
|
minimum number of records in a leaf
|
min_split_gain
|
0.0
|
smallest gain of the split
|
Model Assessment
We calculated the AUC from receiver operating characteristic (ROC) analysis to evaluate the predictive utilities of the models, and the AUC of the six machine learning models was compared based on the DeLong method (p-value < 0.05 was deemed to indicate statistical significance) [48]. Meanwhile, logarithmic loss function (log-loss) was applied to quantify the accuracy of the classifier by punishing the wrong classification. Furthermore, the evaluation indicators of the confusion matrix, including accuracy, sensitivity, specificity, precision, and F1 score, were used to analyze the relationship between the actual values and the predicted values for the peak demand of CVDs admissions. (see Equations 1-5 in the Supplementary Files)