Development of a reliable data-driven model may include four tasks:
-
Mathematical Structure Definition
-
Parameter Identification
-
Overfitting Avoidance
-
Cross Validation
Up to three separate data sets, modelling, validation and test data, were used to perform the listed tasks for each problem defined in section 2. Generally speaking, the purpose of these quadruple steps is to minimise error, ‘the discrepancy between the real output, from an experiment, and the estimated output by the model’. If the error is calculated using modelling, validation or test data, it is called modelling, validation or test error, respectively. Eq. 3 mathematically defines the error in this research [23]:
$$E=\frac{{\sum\limits_{{\iota =1}}^{{{n_d}}} {{{\left( {{{\hat {y}}_\iota } - {y_\iota }} \right)}^2}} }}{{{n_d}}}.$$
3
where y is an output, nd is the number of samples in the data set used to calculate the error and ^ refers to estimated values. Aforementioned quadruple tasks performed in data-driven modelling are briefly introduced in the following:
Mathematical Structure Definition
In some models, the mathematical structure is not certain from the beginning. For instance, in a neuro-fuzzy network (or in short, fuzzy model), the number of rules can be defined using the modelling data through subtractive clustering, or in exact RBFNs, the size of the model depends on the modelling data.
Parameter Identification: Parameters of a data-driven model, with a known mathematical structure, are identified using the modelling data. Methods of parameter identification, generally, minimise the modelling error and have two categories: single-step and iterative methods. Some models, e.g. linear and RBFN models, use single-step identification methods such as non-recursive least square of error (LSE) [24]. In iterative methods, e.g. the ones based on error propagation [25], the parameters are tuned step by step to minimise the modelling error (also known as the training error, as detailed in Appendix A of [2]).
Overfitting Avoidance
Overfitting refers to excessive focus on decrease of the modelling error, which diminishes the generality of data-driven models [26, 27]. In iterative parameter identification, e.g. for MLPs, FCCs and neuro-fuzzy networks, at each iteration, the error is both calculated for the modelling and the validation data sets; while, the latter is not used for parameter identification. A discrepancy in trend of these dual errors (normally, increase of the validation error and ongoing decrease of the modelling error) is considered as a sign of overfitting and triggers to stop parameter identification [2]. In models with single-step parameter identifications, e.g. RBFNs, some specific parameters are identified with the validation data rather than with the modelling data to avoid overfitting [4].
Cross Validation
Any data-driven model should fulfil the requirements of cross validation. In this paper, one round cross-validation or hold-out was employed, which requires that the estimation error of the model calculated with the test data (neither used in parameter identification nor in overfitting avoidance) is acceptable [28]. In short, the test error should be reasonably low to cross validate a model. It should be noted that the validation data were not used to perform cross validation.
Six types of data-driven models were developed in this research to tackle problems detailed in section 2. In following subsections, a brief explanation of each model is presented with a focus on four aforementioned tasks for data driven modelling and correct use of the modelling and the validation data. All the developed models have a single output of y and n inputs of ui, i = 1,…,n.
3.1. Linear Models
In these models, the output is a linear combination of inputs
$$y=\sum\limits_{{j=1}}^{n} {{{\mathbf{A}}_i}{u_i}} +{{\mathbf{A}}_{i+1}}\,.$$
4
Nothing needs to be done to define the mathematical structure of this model (i.e. task 1 in the list of quadruple tasks at the beginning of section 3), as the mathematical structure is evident. Model parameters (elements of A) were identified with single-step method of LSE [24]. Overfitting was disregarded in development of (4) (i.e. task 3 was not performed); thus, both the modelling and the validation data were used for modelling.
3.2. Multi-layer Perceptrons (MLPs)
The employed MLPs have one hidden layer with m neurons and activation function of φ.
$$y=\sum\limits_{{j=1}}^{m} {{\mathbf{B}}_{j}^{{}}\phi \left( {\sum\limits_{{i=1}}^{n} {{\mathbf{C}}_{{ij}}^{{}}{u_i}+{\mathbf{D}}_{i}^{{}}} } \right)} +{\mathbf{D}}_{{i+1}}^{{}},$$
5
where
$$\phi (x)=\frac{2}{{1+\exp ( - 2x)}} - 1$$
6
.
MLPs, presented by (5) and (6), are universal approximators. That is, the model has a proven capability to model any system when sufficient data are available [29, 30] .
In this research, m = 2n + 1, (7), based on recommendation of [25]. Considering (7), the mathematical structure would be known. Nguyen-Widrow algorithm was used to suggest initial values for parameters [31]. Then, error back propagation with Levenberg-Marquardt algorithm [32] was utilised to minimise the modelling error iteratively and to identify MLP parameters. At each iteration, the validation error was calculated. Parameter identification stopped as the trend of the modelling and the validation errors became discrepant, i.e. overfitting happened. Even with use of parameter initialisation algorithms, some initial values of parameters may push the utilised parameter identification method to be trapped in local minima of the modelling error, leading to low accuracy of the model [33]. Consequently, parameter identification was repeated with different initial parameters. The model with the lowest validation error was chosen in the end.
3.3. Fully Connected Cascade (FCC) Networks
The employed FCC networks are very similar to the MLPs, with extra parameters (E elements) which connect the inputs directly to the output.
$$y=\sum\limits_{{j=1}}^{n} {{\mathbf{\bar {B}}}_{j}^{{}}\phi \left( {\sum\limits_{{i=1}}^{m} {{\mathbf{\bar {C}}}_{{ij}}^{{}}{u_i}+{\mathbf{\bar {D}}}_{i}^{{}}} } \right)} +\sum\limits_{{i=1}}^{n} {{\mathbf{{\rm E}}}_{i}^{{}}{u_i}+} {\mathbf{\bar {D}}}_{{i+1}}^{{}}.$$
8
FCC networks have shown their merit in solving some non-engineering benchmarks [34]. The number of hidden layer neurons, m, was considered same as the one of MLPs, as the same recommendation of (7) is valid for FCC networks [34]. Parameter identification, overfitting avoidance and evasion from local minima of the modelling error in FCC networks are similar to the ones of MLPs.
3.4. Neurofuzzy Networks
Linear Sugeno-type fuzzy models were used in this research which are convertible to neuro-fuzzy networks [35]. Such fuzzy models have k rules, each with n membership functions (one per input). For jth rule and ith input, the Gaussian membership function of (9) was employed to produce a membership grade,µij, based on the input, ui [36]:
$${\mu _{ij}}=\exp \left( { - \frac{{{{({u_i} - {{\mathbf{F}}_{ij}})}^2}}}{{2{{\mathbf{G}}_{ij}}^{2}}}} \right).$$
9
The product of membership grades of a rule was considered as the weight of the rule, a number between zero and one. The output of the whole model is the weighted sum of rule outputs [36]:
\(y=\frac{{\sum\limits_{{j=1}}^{k} {\left( {\overbrace {{\left( {\sum\limits_{{i=1}}^{n} {{{\mathbf{H}}_{ij}}{u_i}} +{{\mathbf{I}}_j}} \right)}}^{{{j^t}^{{h\,}}\,rule\,\,output}}\prod\limits_{{i=1}}^{n} {{\mu _{ij}}} } \right)} }}{{\sum\limits_{{j=1}}^{k} {\underbrace {{\prod\limits_{{i=1}}^{n} {{\mu _{ij}}} }}_{{{j^t}^{{h\,}}\,rule\,\,weight}}} }}.\)
(10)
Neuro-fuzzy models, presented by (9) and (10), are universal approximators [37]. The mathematical structure of the fuzzy model, e.g. the number of rules (k), was defined through subtractive clustering with use of the modelling data, the utilised subtractive clustering algorithm is similar to the one detailed subsection 2–3 of [38].
Parameters were identified using an iterative method. At each iteration, gradient descent error back propagation algorithm was used to adjust elements of F and G, and LSE was used to adjust elements of H and I [39, 40]. The validation error, calculated at every iteration, was used to stop parameter identification procedure and to avoid overfitting, in the same way as used for MLPs, detailed in subsection 3 − 2.
3.5. Radial Basis Function Netrworks
RBFNs, which are universal approximators too [41], are presented as a combination of (11) and (12). They receive an array of inputs rather than inputs of a single data sample; a data sample has n inputs. An RBFN can estimate the output of maximum w data samples, where w is the number of data samples used to develop the model. If the input of fewer number of data samples, i.e. z, are fed into the model, first z columns of O and L are used.
$${{\mathbf{{\rm O}}}_{ik}}=\exp \left( { - {{\left( {S\underbrace {{\sum\limits_{{j=1}}^{n} {{{\left( {{{\mathbf{J}}_{ij}} - {{\mathbf{U}}_{jk}}} \right)}^2}} }}_{\begin{subarray}{l} {\text{distance}}\,\,{\text{between}}\,\,{\text{input}}\, \\ \,\,\,\,{\text{and}}\,\,{\text{weight}}\,\,{\text{arrays}}\, \end{subarray} }} \right)}^2}} \right).$$
11
\({\mathbf{\hat {Y}}}\) 1⋅w = K1⋅w ⋅Ow⋅w +L1⋅w. (12)
(12) indicates that greater elements of O are more influential on the network’s output. In addition, (11) shows that (i) the range of O elements is [0 1] and (ii) if the ith row of J is identical to the kth column of U, then Oik will be maximum, 1.
In RBFN modelling, arrays of Jw⋅n, K and L and the scalar of S namely ‘spread’ should be identified. At model development stage, where modelling data were used, (13) was used instead of (12). ^ is unnecessary in (13), since no estimation happens during model development:
$${{\mathbf{Y}}_{1 \times w}}={\left[ {{\mathbf{K}}\,\,\,{\mathbf{L}}} \right]_{1 \times 2w}}{\left[ {\begin{array}{*{20}{c}} {\mathbf{O}} \\ {\mathbf{I}} \end{array}} \right]_{2w \times w}}.$$
13
In exact RBFNs, J = UMT (14); where UMT is the transpose of an array of all inputs of the modelling data. Hence, w equals the number of modelling data samples and the mathematical structure is known from the beggining. For instance, for the second problem of section 2, UMT has the size of 2⋅30. In order to maximise the effect of O elements on the output, all of them, calculated with (11) and inputs of the modelling data, were considered to be 1. Here is a pseudo-algorithm of exact RBFN modelling (to find J, K, L and S using the input and output arrays of the modelling data, UM and YM)
-
Set J = UMT
-
Set Ow⋅w=ones(w⋅w)
-
Form and solve (13) with YM and O from step 2 to find K1⋅w and B1⋅w
-
Find S, with trial and error, so as to minimise the validation error of the developed RBFN (anti-overfitting step)
An alternative to exact RBFN modelling is efficient RBFN modelling, which may produce RBFNs with fewer parameters. In this research, despite exact RBFNs, that employ the transpose of inputs array of the modelling data as J, in efficient RBFN modelling, some columns of UM were selected and transposed to form J [42]. Hence, the number of J rows, w, is smaller or equal to the number of the columns of UM, named wmax in this paper.
Prior to select UM columns to be used as J rows, S, and a target error, Et should be defined. For each set of S and Et, every single column of UM was transposed and tried as a single-row J. Then, the corresponding RBFN was created using K and L calculated with (13). The column of UM leading to the smallest modelling error was selected, transposed and used as the first row of J. Afterwards, the remaining columns of U were tested to find the one in which addition of its transpose to J led to the largest drop in the modelling error. Transposed of such a column was added to J. This continued till the modelling error reached Et. Thus, the mathematical structure of efficient RBFNs is defined with use of the modelling data. In this research, the entire process of finding J was repeated for different pairs of S and Et, and the validation error was calculated for each pair.
Here is a pseudo-algorithm of efficient RBFN modelling:
-
J = null, Urem= UM, Uopt=null, E = VEX = 1000 (a large number), TJ=null (temporary weight matrix)
-
Choose a large S and a target modelling error, Et
-
Set w = 1
-
Set k = 1
-
Add transpose of kth column of Urem to J to form TJ
-
Calculate O from (11) with Urem, TJw⋅n and S defined at steps 5 and 2.
-
Solve (13) to find K and L (YM and O are available from the modelling data and step 6)
-
Find the modelling error, ME. The model needs to be rum more than once as w < wmax.
-
If ME < E, then E = ME and Uopt=Uk
-
k = k + 1
-
If k ≤ (wmax –w + 1) then go to 5
-
Remove Uopt from Urem and add it to J
-
w = w + 1
-
if E > Et then go to 4
-
Find the validation error, VE
-
If VE < VEX then VEX = VE, SX = SX and EtX=Et
-
If VEX is unacceptable go to 2
Choice of S and Et was performed using full space search with zooming (use of smaller step-size) at low error areas.
Line 4 of exact RBFN modelling pseudo-codes and lines 14–17 of efficient RBFN modelling pseudo-codes use the validation data to tackle overfitting. Use of the modelling data at these lines would result in no generalisation of the model, and use of the test data would violate the conditions of cross validation.
3.6. Section Summary
Table 1 summarises the tasks performed in development of each model and the data used for each task. MD and VD refer to the modelling and the validation data, respectively. Two last columns refer to avoidance of overfitting through different strategies: (1) stopping parameter identification in the case of discrepancy in trend of modelling and validation errors, used for MLP, FCC and neuro-fuzzy networks, and (2) identifying some parameters with the validation data to improve generality of the models, or dual identification, used for RBFNs.
Table 1. Development stages for different models and their associated data
Model
|
Structure Definition
|
Parameter Identification
|
Over-fitting Avoidance-Stop Process
|
Over-fitting Avoidance- Dual Identification
|
Linear
|
|
MD + VD
|
|
|
MLP
|
|
MD
|
VD
|
|
FCC
|
|
MD
|
VD
|
|
Fuzzy
|
MD
|
MD
|
VD
|
|
RBFN
|
MD
|
MD
|
|
VD
|