The ANN model has acquire from sequence data patterns and tried to accomplish as probable as an accurate prediction. Assume, the set of X data pairs containing the variables and the results, (m1, t1), (m2, t2), ….(mX, tX) in which the mi is the input value and ti is the target value for i = 1, 2, 3, …, X. We would like to build a neural net F so that ideally as
$$F\left({m}^{i}\right)= {t}^{i}$$
3.1
Moreover, it allow for error \({\in }^{i}\) typically. Let n denotes the output of ANN is expressed in Eq. 3.2 and 3.3.
$${n}^{i}=F\left({m}^{i}\right)$$
3.2
And
$${t}^{i}={n}^{i}+{\in }^{i}$$
3.3
However, the\({n}^{i}\) is based on the parameter with respect to weight and bias which turn as an optimizer issue. In this, setting the ANN in F that minimize the error function represent in Eq. 3.4.
$$E= \frac{1}{N}\sum _{i=1}^{X}{‖{t}^{i}-{n}^{i}‖}^{2}$$
3.4
Where,
N = Number of training patterns.
When the ANN has become a two-way classification then N = 2. Based on the equation, E is a parameter function of F, and required in determining the weight values that minimize the error through differentiating E.
When the research is focused on one term of the sum and it is expressed in Eq. 3.5.
$${‖t-n‖}^{2}= {\left({t}_{1}-{n}_{1}\right)}^{2}+{\left({t}_{2}-{n}_{2}\right)}^{2}+\dots +{\left({t}_{x}-{n}_{x}\right)}^{2}$$
3.5
Thus, the values of input and output have been fixed, and only the parameter is consider to be calculated through weight and it can differentiate from both sides is expressed in Eq. 3.6.
$$\frac{\partial }{{\partial W}_{ij}}\left({‖t-n‖}^{2}\right)= -2\left(t-n\right).\frac{\partial n}{\partial W}$$
3.6
There are highly specific and also verify the fits over the context of neural net. From a neural net, the output is defined as \({n}^{i}= {W}_{ij}{m}^{i}+b\).
Hence, the output is rely on weight and while differentiating both sides in accordance with \({W}_{ij}\) by chain rule as per Eq. 3.7
$$\frac{\partial }{{\partial \text{W}}_{\text{i}\text{j}}}\left({‖\text{t}-\text{n}‖}^{2}\right)= -2\left({\text{t}}_{\text{i}}-{\text{n}}_{\text{i}}\right){\text{m}}_{\text{i}}$$
3.7
Where,
\({\text{m}}_{\text{j}}\) is the i th coordinate position. The derivative has provided direction to the maximum for accomplishing the minimum point, and subsequently in opposite direction of this gradient. In addition, this derivative as close to 0 as possible for obtaining the minimum of error.
The ANN method is a layer based network that involves one or more artificial neurons which generally includes input layer, hidden layer as well as an output layer. Figure 4 has illustrates the ANN structure that has capability in identifying, mapping with robust capability and even have capacity in processing data in parallel.
a. Input Layer
This layer is the top most layer of ANN in which collection of input data is in the form of text, image files, audio files and video files that have received initially.
b. Hidden Layer
This layer is the subsequent layer in ANN model that has potential in both perceptron or several hidden layers. Based on the input data, these hidden layers have been executed various mathematical operations as well as recognize the pattern of the ANN model.
c. Output Layer
Based on the hidden layer input generation from feature weights of middle layers with accurate and exact computations have capability in producing an adequate result in the output layer.
There are several positive or negative weights have been related to every neurons for execute or prohibit an input with every connection to the artificial neurons. In order to control the performance of the artificial neurons, activation function plays a major role. This artificial neurons assist in collecting input signal through computation of net input signal as a function with its associative weights. The performed input from net input signals to activation functions have calculated the artificial neuron as an output signal which operate various stages in mathematical process whereas the unit numbers are arranged in layer numbers. The single unit is said to be neuron in which the input unit arranged from input layer consumes various inputs from the acquired data. After, the input data is further progressed to hidden layer unit that transform input data into output unit.
Relu is a function and major benefit of Relu activation function and doesn’t activate all neuron simultaneously in which a neuron with negative value have converted into zero or it gets deactivated. Networks become sparse as well as computationally effective as a result. The gradient value of the graphs at the negative side is zero. It suggests that neurons terminate are never stimulated during back propagation. The sigmoid function is utilized to maintain issues with multi-class which maps the value of output from 0 to 1. It works best when utilised in the classifier's output layer. There isn't a guideline that may be used to select the activation function. The characteristics of the issue might assist in choosing a quicker converging network. Certain characteristics are based on the research as the Relu and sigmoid for activation functions.
A procedure called optimization aims to decrease network error. This is essential for increasing the model's accuracy. The optimizer has three different iterations: Adam, SGD, and RMSProp. SGD is iterative gradient descent technique that uses iteration to search for an optimal error. These models produce predictions in every iteration that follows and predicted results are compared to the predictions. Error is defined as the difference between the projected value and the actual result. The internal parameters of the model as well as the weights of the network are updated using this error. The back propagation algorithm follows this updating process. SGD also finds it challenging to get away from the saddle points. The most popular choices for handling saddle points are AdaGrad, RMSprop, ADAM, and AdaDelta. To modify the gradient with the slop and quicken the SGD, Nesterov accelerating gradient is utilized. Due to its ability to execute more updates for infrequently parameters as well as fewer changes for frequent parameters, AdaGrad outperforms Nesterov accelerated gradient. As a result, SGD becomes faster, more scalable, and more resilient. In order to train ANN, it is utilized and the fundamental disadvantage in AdaGrad optimizer has minimized the model's ability for training through the sequential pattern better. Two optimizers like RMSProp as well as AdaDelta, have been created individually to address the problem with AdaGrad. Moreover, the optimizer like AdaDelta and RMSProp are identical whereas the main difference is AdaDelta which is not fixed as an early learning rate with constant. Adam is an optimizer which incorporate the beneficial features of RMSprop as well as Adadelta. Adam is deemed a good choice since it get improves RMSProp as well as Adadelta. As a result, the optimizers used in this paper are Adam, RMSProp, and SGD.