The following subsections contain a brief description of included approaches.
3.1. ARIMA
ARIMA models represent one of the most widely used linear univariate time series modelling techniques. ARIMA models are composed of the Autoregressive (AR) model, the Moving Average (MA) model and ARMA as a combination of AR and MA models. AR model includes a linear combination of past values of the variable. AR model of order p (AR(p)) can be expressed as:
$${Y_t}=c+\sum\limits_{{i=1}}^{p} {{\phi _i}{Y_{t - i}}} +{\varepsilon _t}$$
1
Where c is a constant and \({\varepsilon _t}\) is a white noise sequence assumed to be a normal random variable with zero mean and variance \({\sigma ^2}\).
MA model uses past forecast errors as forecast variables. MA model of order q (MA(q)) can be expressed as:
$${Y_t}=c+{\varepsilon _t}+\sum\limits_{{i=1}}^{q} {{\theta _i}{\varepsilon _{t - i}}}$$
2
To apply the ARIMA models the time series needs to be stationary. The letter “I” (integrated) means that the first-order difference is applied to transform a considered time series into stationary. The full ARIMA model can be written as follows:
$$Y_{t}^{\prime }=c+\sum\limits_{{i=1}}^{p} {{\phi _i}Y_{{t - i}}^{\prime }} +{\varepsilon _t}+\sum\limits_{{i=1}}^{q} {{\theta _i}{\varepsilon _{t - i}}}$$
3
\(Y_{t}^{\prime }\) is differenced time series. Equivalent integrated form of any ARIMA model looks as follows:
$${\phi _p}(B){(1 - B)^d}{Y_t}={\theta _q}(B){\varepsilon _t}$$
4
B represents the backshift operator, whose effect on a time series \({Y_t}\) can be summarized as:
$${B^d}{Y_t}={Y_{t - d}}$$
5
Seasonal ARIMA models (SARIMA) represent an extension to cover seasonal variations in the time series. The seasonal part of the model consists of terms similar to the non-seasonal part, the difference is that the seasonal part includes backshifts of the seasonal period (not the first period like in the ARIMA model). In integrated form SARIMA model can be expressed as follows:
$${\phi _p}(B){\Phi _P}({B^S}){(1 - B)^d}{(1 - {B^S})^D}{Y_t}={\theta _q}(B){\Theta _Q}({B^S}){\varepsilon _t}$$
6
Equation (5) represents a \({\text{SARIMA(p,d,q) × (P,D,Q)}}\) model where \(\Phi\) and \(\Theta\) are the seasonal ARMA coefficients and seasonal differencing operator \({(1 - {B^S})^D}\) of order is applied to eliminate seasonal patterns. The modelling procedure of an ARIMA (SARIMA) model can be summarized in Fig. 2. Arima modelling was conducted in Python 3.10 using the pmdarima package (Smith et al., 2017).
3.2. Long Short Term Memory (LSTM) models
LSTM represents a type of Recurrent Neural Network (RNN). RNNs, due to their looped architecture, are capable to recognize sequential characteristics in data and therefore very suitable for sequence prediction problems (Hewamalage 2021). However, RNNs do suffer from a long-term dependency problem. Due to vanishing and exploding gradient issues RNNs, have difficulty in learning long-term dependencies (Somu et al. 2020). Another issue with standard RNN architecture is that the training of RNN model requires a predetermined delay window length, but it is difficult to automatically obtain the optimal value of this parameter (Xu et al. 2022).
LSTM provides a solution for long term dependency problems thanks to improved recursive neural network architecture with feedback so it can process not only individual data points but entire sequences as well (Sepp and Schmidhuber 1997; Manowska et al. 2021). A LSTM neural network is composed of one input layer, one recurrent hidden layer and one output layer. Improvement in architecture relates to replacing the hidden layer of RNN cells with LSTM cells (hereafter memory blocks) to achieve long-term memory. Self-connected LSTM memory blocks enable the model to learn the long-term dependencies while handling sequential data (Somu et al. 2020). In contrast to RNNs which have only one hidden state, in LSTM neural network to each cell two states are transferred, the cell state and the hidden state. The cell state enables long-term memory capability, whereas the hidden state enables a working memory capability that contains only near-past information and uncontrollably overwrites at every step.
Memory blocks are responsible for memorizing, and manipulations between blocks are done by special multiplicative units called gates. Gates control the flow of information (Ma et al. 2015; Hrnjica and Mehr 2020). The input gate controls the flow of input activations into the memory cell. The output gate controls the output flow of the cell activation. Besides these two gates, there is also a forget gate which filters the information from the input and previous output and decides which one should be remembered, forgotten and dropped (Hrnjica and Mehr 2020). The adaptive gating mechanism enables that whenever the content of the cell is outdated, the forget gate resets the cell state, so the input and the output gates control the input and the output, respectively. In essence, the gates represent different neural networks that decide which information is allowed in the cell state. Besides the gates, the core of the memory cell is a recurrently self-connected linear unit-Constant Error Carousel (CEC), whose activation represents the cell state. Due to the presence of CEC the problem of vanishing or exploding gradient is solved since multiplicative gates can learn to open and close enabling the LSTM cell state to enforce the constant error flow.
Figure 3. provides an insight into the internal architecture of LSTM. Symbols , , and represent the input gate, forget gate the cell state vector and the output gate, respectively. \(\sigma\) and tanh are sigmoid and hyperbolic tangent activation functions, respectively. The elementwise multiplication of two vectors is denoted by \(\otimes\).
For the tth data, an LSTM takes as input \({x_t}\), \({h_{t - 1}}\), \({C_{t - 1}}\) and produces the hidden state \(h_{t}^{{}}\)as well as the cell state \({C_t}\) based on the following formulas:
$${i_t}=\sigma ({W_i}{x_t}+{U_i}{h_{t - 1}}+{b_i})$$
7
$${f_t}=\sigma ({W_f}{x_t}+{U_f}{h_{t - 1}}+{b_f})$$
8
$${C_t}={f_t} \otimes {C_{t - 1}}+{i_t} \otimes {\underline {C} _t}$$
9
$${\underline {C} _t}=\tanh ({W_c}{x_t}+{U_c}{h_{t - 1}}+{b_c})$$
10
$${o_t}=\sigma ({W_o}{x_t}+{U_o}{h_{t - 1}}+{b_o})$$
11
$${h_t}={o_t} \otimes \tanh ({C_t})$$
12
where \({W_k} \in {R^{n \times m}}\), \({U_k} \in {R^{n \times n}}\) are weight matrices and \({b_k} \in {R^n}\)are bias vectors, for \(k=\{ i,f,c,o\}\). The symbols \(\sigma ( \cdot )\) and \(\tanh ( \cdot )\) refer to element-wise sigmoid and hyperbolic target functions. Element-wise multiplication is represented by the symbol \(\otimes\).
In this paper, the LSTM network was built based on the Keras framework of the Python 3.10 platform. Before modelling each training data that belong to a time series for each border crossing is normalized or rescaled from the original range so that all values are within the range of 0 and 1. Then the methods and parameters of the LSTM model need to be configured. Depending on the time series, the hidden layer was built from 100 to 200 LSTM cells, the number of iterations varied from 100 to 300 and the batch size spanned from 2 to 12. The activation function was set to rectified linear activation function (ReLu), the loss function was MSE, and the optimizer was stochastic gradient descent (SGD).
3.4. Singular Spectrum Analysis (SSA) algorithm
Singular Spectrum Analysis algorithm is a very useful tool to forecast different phenomena in the field of transportation. Many of authors use the SSA algorithm to filter monitored time series i.e. to separate signal component from noisy components. After that, the signal component is used as a principal component to forecast future values of time series and different methods can be used to forecast these values. Such approaches are known as hybrid forecasting algorithms.
Kolidakis et al. (2020) developed the hybrid model which is composed of Singular spectrum analysis and Artificial neural networks to forecast intraday traffic volume. Shang et al. (2016) used Singular spectrum analysis to filter traffic flow data and Kernel extreme learning machine to make prediction of short-term traffic flow. Zhou et al. (2020) forecasted passenger flow in metro transfer station by novel model which is composed of Singular spectrum analysis and AdaBoost-Weighted Extreme Learning Machine. Shuai et al. (2021) applied Singular spectrum analysis, Long-short memory, and Support vector regression for the purpose of the traffic flow prediction of expressway.
This section explains the Singular Spectrum Analysis (SSA) algorithm for time series forecasting. For more information, readers are referred to the papers of Hassani and Zhigljavsky (2009), Hassani and Mahmoudvand (2013), Hassani (2007), Harris and Yan (2010). Briefly, the algorithm is composed of the two main stages: decomposition and reconstruction. In the first stage, monitored time series is represented as a spectrum of independent components such as trend, periodic oscillatory and noise components. In the second stage, monitored time series is reconstructed by using the less noisy components, i.e., principal components of the time series are created.
In the decomposition stage, there are two following steps: embedding and singular value decomposition. The embedding process transforms one-dimensional time series into a matrix of \(L \times K\)dimension, where is the window length, equals \(N+1 - L\) \(\)and N is the amount of data. The size of the window length should be an integer within the following interval \(2 \leqslant L \leqslant \frac{1}{2}N\) The concept of the window length is similar to the concept of the \(k - th\) order autoregressive model of time series, but taking into account original data from \(t=1\) to \(t=L\). Consider a stochastic process \(\{ x(t);t=1,2,...,N\}\) and suppose there are realizations of \(X(t)=\{ x(1),x(2),...,x(N)\}\) this process. Set X(t) is time-invariant series and for simplicity, we can rewrite it as follows \({X_N}=\{ {x_1},{x_2},...,{x_N}\}\). An output of the embedding process is the trajectory matrix of the following form:
$$H=\left| {{H_1},{H_2},...,{H_K}} \right|=\left| {{x_{ij}}} \right|_{{i,j=1}}^{{L,K}}=\left| {\begin{array}{*{20}{c}} {x1}&{x2}& \cdots &{xK} \\ {x2}&{x3}& \cdots &{xK+1} \\ \vdots & \vdots & \ddots & \vdots \\ {xL}&{xL+1}& \cdots &{xN} \end{array}} \right|$$
14
The trajectory matrix is known as Hankel matrix and all elements along diagonal are equal and \(i+j=const\). Main objective of the singular value decomposition is to express Hankel matrix as a sum of weighted orthogonal matrices. Spectral decomposition is performed over the lag-covariance matrix \(H{H^T} \in {R^{L \times L}}\). Let \({\sigma _1},{\sigma _2},...,{\sigma _L}\) be eigenvalues (singular values) of \(H{H^T}\) arranged in decreasing order \({\sigma _1} \geqslant 0\), \({\sigma _2} \geqslant 0\),…,\({\sigma _L} \geqslant 0\) and \({U_1},{U_2},...,{U_L}\) be corresponding eigenvectors. If the number of nonzero eigenvalues equals the rank of matrix , then trajectory matrix can be represented as:
$$\widehat {H}=\sum\limits_{{i=1}}^{r} {{U_i}U_{i}^{T}} H={\widehat {H}_1}+{\widehat {H}_2}+...+{\widehat {H}_r}$$
15
where is the rank of \(H.\)
The second stage, called reconstruction, is accomplished in two steps: grouping and diagonal averaging or Hankelization. In the grouping step, set of matrices \(\{ {\widehat {H}_1},{\widehat {H}_2},...,{\widehat {H}_r}\}\) are divided in several disjointed subsets: \(\{ {\widehat {H}_1},{\widehat {H}_2},...,{\widehat {H}_r}\} \to \{ {E_1},{E_2},...,E_{m}^{{}}\} ,{\text{ }}m<r\). After that, all matrices within each subset are summed. The simplest case refers to the signal and noise component respectively: \(\{ {\widehat {H}_1},{\widehat {H}_2},...,{\widehat {H}_r}\} \to \{ {E_1},{E_2}\} ,m=2,{E_1} \cap {E_2}=\emptyset\). In that case there are only two subsets, \({E_1}=\sum\limits_{{i=1}}^{d} {{{\widehat {H}}_i}}\)and \({E_2}=\sum\limits_{{i=d+1}}^{r} {{{\widehat {H}}_i}}\). Subset \({E_1}\) is associated with signal component while \({E_2}\) is associated with noise component. Selection of the appropriate value of is based on the plot of logarithms of singular values, \(\log {\sigma _1},\log {\sigma _1},...,\log {\sigma _L}\), \(\forall \sigma >0,{\text{ }}\log {\sigma _1},{\text{ }}\log {\sigma _2},...,{\text{ }}\log {\sigma _L}\). Point in the plot, where a significant drop in values occurs, can be treated as the start of noise components. Diagonal averaging represents the transformation of each reconstructed trajectory matrix (16) into new time series of length . Matrix elements over anti-diagonals are averaged, \(i+j=k+1\). Reconstructed trajectory matrix is of the following form:
$$\widehat {H}=U{U^T}H=\left| {{h_{ij}}} \right|_{{i,j=1}}^{{L,K}}=\left| {\begin{array}{*{20}{c}} {{h_{11}}}&{{h_{12}}}& \cdots &{{h_{1K}}} \\ {{h_{21}}}&{{h_{22}}}& \cdots &{{h_{2K}}} \\ \vdots & \vdots & \ddots & \vdots \\ {{h_{L1}}}&{{h_{L2}}}& \cdots &{{h_{LK}}} \end{array}} \right|$$
16
Elements of the new time series are extracted from \(\widehat {H}\) by following calculations:
$${\widehat {X}_N}=\{ {x_1},{x_2},...,{x_N}\} =\{ {h_{11}},\frac{{{h_{21}}+{h_{12}}}}{2},\frac{{{h_{31}}+{h_{22}}+{h_{13}}}}{3},...,{h_{LK}}\}$$
17
Finally, original time series XN is expressed as a sum of d principal vectors:
$${X_N}=\sum\limits_{{i=1}}^{d} {{{\overrightarrow {\widehat {X}} }_{iN}}} ={\overrightarrow {\widehat {X}} _{1N}}+{\overrightarrow {\widehat {X}} _{2N}}+...+{\overrightarrow {\widehat {X}} _{dN}}$$
18
New time series is used to forecast future values for \(N+1,N+2,...,N+p\). Linear recurrent formulae is used to forecast future values of monitored time series. Eigenvector obtained by the singular value decomposition is as follows:
$$U={[{u_1}{\text{ }}{{\text{u}}_{\text{2}}}{\text{ }}...{\text{ }}{{\text{u}}_{{\text{L-1}}}}{\text{ }}{{\text{u}}_{\text{L}}}]^T}$$
19
Let \({U^\nabla }={[{u_1}{\text{ }}{{\text{u}}_{\text{2}}}{\text{ }}...{\text{ }}{{\text{u}}_{{\text{L-1}}}}]^T}\)be the subset of composed of the first \(L - 1\) coordinates, and \(\pi =\{ {u_L}\}\) be the last coordinate of the eigenvector. Accordingly, the verticality coefficient is defined as:
$${v^2}=\sum\limits_{{i=1}}^{d} {\pi _{i}^{2}} =\pi _{1}^{2}+\pi _{2}^{2}+...+\pi _{d}^{2}$$
20
Condition \({v^2}<1\) meet if we want to use Singular spectrum analysis to forecast \(p - values\) ahead. Obviously, the value of separates signal components from noise components, must be carefully selected to satisfy previous inequality as well.
The linear vector of coefficients \(R={[{\beta _{L - 1}},{\beta _{L - 1}},...,{\beta _1}]^T}\) is calculated by the following equation:
$$R=\frac{1}{{1 - {v^2}}}\sum\limits_{{i=1}}^{d} {{\pi _i}U_{i}^{\Delta }}$$
21
Future values forecasting is achieved by:
$${\{ \widetilde {x}(t)\} ^T}=\left\{ {\begin{array}{*{20}{c}} {\{ \widetilde {x}(1),\widetilde {x}(2),...,\widetilde {x}(t),t=1,2,...,N\} } \\ {{R^T}{X_p}(t),t=N+1,N+2,...,N+p} \end{array}} \right.$$
22
where
$${X_p}(t)={\{ \widetilde {x}(N - L+p+1),...,\widetilde {x}(N+p - 1)\} ^T}$$
23
Many time series encountered in the real-world exhibit nonstationary behaviour. The main reason is due to a presence of trend, seasonal variation, or a change in the local mean. For reducing a nonstationary series with trend to a stationary series (without trend) we apply the first differences of monitored time series as:
$$w(t)=x(t) - x(t - 1),{\text{ }}t=2,3,...,N$$
24
The differenced data are often easier to model than the original data. Hence, Singular spectrum analysis is now performed over the differenced time series \({W_N}=\{ {w_2},{w_3},...{w_N}\}\) According to the Eqs. (9) and (10), forecasting of differenced data is as follows:
$${\{ \widetilde {w}(t)\} ^T}=\left\{ {\begin{array}{*{20}{c}} {\{ w(2),w(3),...,w(t)\} ,{\text{ }}t=2,3,...,N} \\ {{R^T}{W_p}(t),{\text{ }}t=N+1,N+2,...,N+p} \end{array}} \right.$$
25
where
$${W_p}(t)={\{ \widetilde {w}(N - L+p+1),...,\widetilde {w}(N+p - 1)\} ^T}$$
26
Since we want to forecast the original values \({X_N}\) not the differenced values \({W_N}\) we must transform the model from the \(\{ \widetilde {w}(t)\}\) form to the \(\{ \widetilde {x}(t)\}\) form. Recall that \(w(t)=x(t) - x(t - 1)\) forecasted values become:
$$\widetilde {x}(t)=\widetilde {x}(t - 1)+\widetilde {w}(t),{\text{ }}t=2,3,...,N$$
27