Hemorrhagic fever data
The daily national incidence data of hemorrhagic fever in China from January 2013 to December 2019 were collected from the official website of the Public Health Science Data Center. A total of 84 months, 365 weeks and 2,557 days were consisted in the dataset in this study.
Algorithms
The ARIMA and LSTM models developed for forecasting the time series are combined with “Rolling Forecasting Origin”. The rolling forecasting origin focuses on a single forecast that the next data point to predict for each data set. This approach uses training sets, each one containing one more observation than the previous one and has one day look ahead view of the data. In general, the rolling forecast is to forecast on the value of the next one by adding the latest data. There are several variations of rolling forecast: One-step or multi-step without re-estimation and multi-step forecast with re-estimation. In this study, the variations of one-step forecast with re-estimation was combined with two models. The data of hemorrhagic fever incidence cases from January 2013 to December 2018 were used to build ARIMA and LSTM models. The data from January to December in 2019 were used to evaluate the forecasting performance of these models.
ARIMA model
ARIMA is a class of models that captures temporal structures in time series data and a linear regression based forecasting approach. Therefore, it is best for one-step out-of-sample forecast, also known as rolling forecast. Each time the model is re-fitted to build the best estimation model. ARIMA model contains auto regressive (AR) model, moving average (MA) model, seasonal autoregressive integrated moving average (SARIMA) model, etc. The model is expressed as ARIMA (p, d, q) generally, p means the order of auto-regression, d means the degree of trend difference and q means the order of moving average.[7, 16, 22] Time series stability, parameter estimation, model check and prediction were done to establish the ARIMA model.
Time series stability.
Since ARIMA model requires stationary time series which means the time series show no fluctuation or periodicity with time. Using the Augmented Dickey-Fuller (ADF) unit-root test to estimate whether the time series is stationary or not. Log transformation and differences are preferred ways to stabilize the time series.[16] Seasonal differences were adopted to stabilize the term trend and periodicity in this study.
Parameter estimation.
Autocorrelation function (ACF) graph and partial autocorrelation (PACF) graph were used to estimate the parameters.[13] Automatic identification and artificial estimation were adopted in this study. “auto.arima()” command in R software was first adopted to automatically identify the model parameters. Then ACF, PACF and differences were employed to identify p, d, q and P, D, Q.
Model evaluation.
Q-Q plots were used to identify whether the model's residuals meet an independent normal distribution. All the models that passed the Box-Ljung test (show a white noise sequence) were compared using Akaike information criterion (AIC) so that the best model can be found,[15] usually with the lowest AIC value. In this study, we used the incidence of hemorrhagic fever from January 2013 - December 2018 to build and test the ARIMA model.
LSTM model
LSTM is a special type of Recurrent Neural Network (RNN) with the capability of remembering the values from earlier stages for the purpose of future use and it is quite useful for time series forecasting.[14, 20]
LSTM Deep Learning algorithm, developed by Hochreiter and Schmidhuber (1997), allows the preservation of the weights that are forward and backpropagated through layers. The network can continue to learn over many time steps by maintaining a more constant error. Thus, the network can be used to learn long-term dependencies. Adam, SGDM and RMSProp optimizer are excellent general-purpose optimizers that perform our gradient descent via backpropagation through time.[23]
LSTM networks try to combat the vanishing/exploding gradient problems by introducing gates and an explicitly defined memory cell. These are inspired mostly by circuitry, not by so much biology. Each neuron contains one memory cell and three gates: input, output and forget.[24] The function of these gates are to safeguard the information by stopping or allowing the flow of it. The input gate determines how much of the information from the previous layer gets stored in the cell. The output layer takes the job on the other end and determines how much of the next layer gets to know about the state of this cell. The forget gate is useful to forget some prior values, i.e., it controls the extent to which a value remains into the cells due to some future works.
Forecast accuracy access
Three indexes were employed in accessing model fitting and forecasting efficiency: RMSE, MAE and MAPE .[2, 7] These three indexes are defined as:
Data and analysis
Excel 2016 was used to build the database of monthly, weekly and daily incidence of hemorrhagic fever in China and R 3.6.2 software was adopted to develop the ARIMA model and LSTM model. Significant level is 0.05.