4.1 Dataset
COVID-19 epidemiological data have been compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) [22]. The data have been provided in three separate datasets for confirmed, recovered, and death cases since January 22 and are updated daily. In each of these datasets, there is a record (row) for every geographic region. The variables in each dataset are province/state, country/region, latitude, longitude, and the incremental dates since January 22. For each region, the value for any date indicates the cumulative number of confirmed/recovered/death cases from January 22.
In this study, according to the input requirements of the proposed model, we changed the data representation so that instead of three separate datasets for three groups of confirmed, recovered, and death cases, only one dataset containing the information of all three groups was arranged. In this new dataset, each record (or row) of the dataset contains information about the number of confirmed, recovered, or deaths per day for each geographic region. As a result, the variables in this new dataset are: Province / State, Country / Region, Latitude (Lat), Longitude (Long), Date (specifying a certain date), Cases (indicating the number of confirmed, recovered, or death cases on the certain date), and Type (specifying the type of cases, i.e. confirmed, recovered, or death) as suggested by Rami Krispin [23].
In this study, the data were applied into the analysis by March 29, 2020, with 50660 records and 7 variables. By March 29, the dataset consisted of cases from 177 countries and 252 different regions around the world. There were 720139 confirmed, 33925 death, and 149082 recovered cases in the dataset.
4.2 Preprocessing Step
Pre-processing was carried out on the dataset before training the proposed model. Figure 1 shows the preprocessing steps. The dataset was first examined for noise, since the noise data were considered as having negative values in Cases variable. The dataset contained 42 negative values in this variable. After deleting these values, the number of records were reduced to 50618.
Subsequently, the Date variable was written in numerical format and renamed into "Day" variable. To that effect, January 22 marked the beginning of the outbreak and the next days were calculated in terms of distance from the origin. As a result, January 22 and March 29 were considered as Day 1 and Day 68, respectively.
Since each region is uniquely identified by its latitude and longitude, the data for Province/Satate and Country/Region were excluded from the dataset. Moreover, as the study aimed at predicting the incidence in any geographical region, we considered only those records providing information on the confirmed cases (17179 records), but not on the dead or the recovered. So, after preserving the records with "Confirmed" value in the Type variable, it was deleted from the dataset. In this study, the "Cases" is considered as the dependent variable.
4.3 Constructing the Prediction Model
An ensemble method of regression learners was utilized to predict the incidence of COVID-19 in different regions. The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models called weak learners [24]. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. One of the most commonly used loss functions is least-squares (LS) error [25].
In this study, the model employed a set of individual Least-squares boosting (LSBoost) learners trying to minimize the mean squared error (MSE).The output of the model in step m, Fm(x), was calculated using Equation 1:
where x is input variable and h(x;a) is the parameterized function of x, characterized by parameters a [25]. The values of ρ and a were obtained from Equation 2:
Where N is the number of training data and is the difference between the observed response and the aggregated prediction up to the previous step.
Due to the recent major changes in the incidence of COVID-19 worldwide over the past 2 weeks, we aimed to predict the number of new cases as an indicator of prevalence over the next 2 weeks. The structure of the proposed method is shown in Figure 2.
Since the incubation period of COVID-19 can be 14 days [26], we assumed that we needed at least 14 days prior information to predict the incidence of Covid-19 in one day. Therefore, the proposed model examined all possible intervals between the first and the last 14 days to find the optimal time period to use its information to predict the number of cases in the coming days.
We hypothesized that the incidence in any region might follow the pattern of recent days in the same region and nearby. Therefore, after determining the optimal time period, the model added the information on confirmed cases in each region and nearby in the specified period to the same region's record in the dataset.
After setting the time interval, [A, B], and the number of neighbors, the data set was rearranged. In this case, the number of records was reduced from N to M, where M is calculated from Equation 3:
Where R is the number of different regions in the data set and B is the last day of the time period. Similarly, the number of variables stored for each record increased from the first 4 variables (latitude, longitude, Day and Cases) to F, which is calculated from Equation 4:
Where NN is the number of neighbors and 4 is the number of variables in the original data set because for each geographical region, Lat, Long, Day and Cases are stored. |B-A+1| is the number of days within the period that participate in the forecast of the next 14 days. The value of NN is multiplied by 2 because for each neighbor, latitude and longitude are added to the record information. Furthermore, for each day within the period of forecast, the Cases were added to the record information, so NN was multiplied by|B-A+1|. For each region, the Day and Cases data during the period were added as well. Thus, |B-A+1| was multiplied by 2. It should be noted, however, that the dependent variable remained the Cases of current day.
Since the number of both the nearby regions and the previous days effective in forecasting were unknown, we assumed these values to be unknown variables and obtained the most accurate model by examining all possible combinations of such variables in an iterative process.
The accuracy of the model was evaluated in terms of Mean Squared Error (MSE) and Mean Absolute Error (MAE). To do so, the information of the last two weeks on all regions was considered as a validation set, and the model was trained using other information in the dataset.
4.4 Forecast Incidence in the Next Two Weeks
A new test set was created to predict incidence in the next two weeks (by April 12). The number of records in this dataset was equal to that of unique geographical regions in the COVID-19 dataset. Then, according to the best neighborhood and optimal time interval specified in the previous step, the necessary features were provided for each record. After that, the best model was created in the previous step was retrained on the entire dataset as a training set. Later on, this model was examined on the new test set to predict the incidence rate.
4.5 Evaluation the actual performance of the proposed model
Given that the actual number of confirmed cases within March 30-April 12, 2020 period was available at the time of review, the performance of the proposed model was measured based on percent error between the predicted and the actual values. The percent error was calculated from Equation 5:
Where δ is percent error, va is the actual observed value and is the expected (predicted) value. Furthermore, according to the predicted and actual confirmed cases in 252 geographical regions in the dataset, the continental incidence rate was calculated using Equation 6:
where IC is the incidence in each continent and IW is the global incidence of COVID-19 from March 30 to April 12.
The experimentation platform is Intel® Core ™ i7-8550U CPU @ 1.80GHz 1.99 GHz CPU and 12.0 GB of RAM running 64-bits OS of MS Windows 10. The pre-processing and model construction has been implemented in MATLAB.