2.2. Model description
In this study, a Conv1D model, i.e., a one-dimensional convolutional neural network, was employed. Conv1D creates a convolution kernel that is convolved with the input data in one spatial dimension to generate results (Dewantara et al., 2020). Batch Normalization was used in a model for classification purposes. This layer introduces random perturbation to the decision boundary of deep networks, forcing the model to learn boundaries with increased margins to the nearest training samples (Balestriero and Baraniuk, 2022). MaxPooling1D was used to perform pooling operations with the maximum values in a single spatial dimension when creating output data (Dewantara et al., 2020). The flattening layer was utilized as it converts an array into a vector. The Dropout layer was applied as a method to overcome overfitting, which often occurs when using a deep learning approach. The Dropout layer randomly reduces the number of neuron units in the network to decrease the number of connections in each iteration of the learning process. The dense layer that was used, is a simple layer of neurons, where each neuron receives input data from all neurons in the previous layer (Dumane, 2020). The model utilized in this study is a sequential model.
The sequential model applied to the analyzed data for mustard varieties is presented in Table 2. It was initialized using the Keras library (an open-source Python library for machine learning). The layers of the model include two convolutional layers (Conv1D), two Batch Normalization layers, a pooling layer (MaxPooling1D), a flattening layer (Flatten), a Dropout layer, and two Dense layers. Each layer has its own parameters such as size, activation function, and input shape. In total, the model has 2,408 parameters, of which 2,344 are trainable, and 64 are not, as they are parameters of batch normalization layers. The activation function in layers 1, 3, and 7 is a rectified linear unit (ReLU), shown in Fig. 1. If the input value is below zero, the output value is zero. When the input value rises above a certain threshold, it has a linear relationship with the dependent variable (Bisen, 2021). The formula for the ReLU function is provided below (Formula 1). The activation function in layer 9 is Softmax. This function is a combination of multiple sigmoid functions (Sharma, et al., 2020). Since we know that the sigmoid function returns values in the range from 0 to 1, they can be treated as probabilities of data points for a specific class. Unlike sigmoid functions used for binary classification, the Softmax function can be used for multi-class classification problems. The function returns the probability for each data point for all individual classes. When building a network or model for multi-class classification, the output layer of the network will have the same number of neurons as the number of classes in the target object. The formula for the Softmax function is provided below (Formula 2).
Table 2
Model summary (Model: "sequential").
Layer Number | Layer | Output Shape | Parameter |
1 | conv1d | (None, 13, 16) | 64 |
2 | batch_normalization | (None, 13, 16) | 64 |
3 | conv1d | (None, 11, 16) | 784 |
4 | batch_normalization | (None, 11, 16) | 64 |
5 | max_pooling1d | (None, 5, 16) | 0 |
6 | flatten | (None, 80) | 0 |
7 | dense | (None, 16) | 1296 |
8 | dropout | (None, 16) | 0 |
9 | dense_1 | (None, 8) | 136 |
| Total parameters: | 2,408 | |
| Trainable parameters: | 2,344 | |
| Non-trainable params: | 64 | |
Formula 1. ReLU activation function [Bisen, 2021].
$$\:f\left(x\right)=\left\{\begin{array}{c}0,\:\:x<0\\\:x,\:\:x\ge\:0\end{array}\right.$$
Where x is the input to the ReLU activation function.
Formula 2. Softmax activation function [Sharma, et al., 2020].
$$\:\sigma\:{\left(z\right)}_{j}=\frac{{e}^{{z}_{j}}}{{\sum\:}_{k=1}^{K}{e}^{{z}_{k}}}\:\:\:\:for\:j=1,\dots\:,K.$$
Where:
\(\:\sigma\:{\left(z\right)}_{j}\) is the probability for every data point of all the individual classes,
\(\:{\left(z\right)}_{j}\) is the input to the Softmax activation function corresponding to the j-th element.
In the model, applied loss function was 'categorical_crossentropy,' the optimizer was 'adam,' and the metric was 'accuracy.' The training parameters of the model were configured as follows: the 'epochs' parameter, determining the number of training epochs, is set to 80; the 'batch_size' parameter, defining the size of the training data batches, is set to 50; and applied 'validation_data' technique uses test data as the validation set during training.
The results for the applied algorithm were not satisfactory, so cross-validation was applied but the results did not improve. It was then decided to use a new approach – division into subsets. It turned out that the best solution is to divide it into 8 subsets as shown in Fig. 2. The data was divided into eight subsets as presented in Table 3. In the subset number 8 were eight varieties and seven varieties in each of the remaining subsets. The aim was to improve the classification accuracy of a complex dataset by creating smaller, more manageable subsets.
Table 3
Division of the variety dataset into subsets.
Subset Number | Varieties |
1 | 1, 2, 3, 4, 5, 6, 7 |
2 | 8, 9, 11, 13, 14, 15, 16 |
3 | 17, 18, 20, 21, 8000, 8001, 8023 |
4 | 8002, 8003, 8006, 8008, 8009, 8010, 8012 |
5 | 8007, 8011, 8013, 8014, 8016, 8017, 8018 |
6 | 19, 8019, 8020, 8021, 8022, 8024, 8025 |
7 | 8026, 8027, 8028, 8029, 8030, 8031, 8032 |
8 | 8033, 8034, 8035, 8036, 8037, 8038, 8039, 8040 |
The detailed operation of data subdivision for the data used in this study is presented in Fig. 3. First, the user inputs the total number of subsets to be processed (variable n) which in our case is eight. Then, the loop counter i is initialized to 1, indicating the start of the loop for processing each subset. For the current subset, a CNN model is applied. This step involves training the model using the subset data. Next, two types of plots are generated and shown. First plot is Train and Validation Accuracy which shows the accuracy of the model on both the training and validation data over epochs. Second plot is Confusion Matrix for the predictions, showing how well the model performed in classifying the data. Then, the evaluation metrics (e.g., accuracy, F1 score, precision, recall) for the current subset are printed. The classification decision for new data is printed, indicating whether the new data is classified as an existing variety or a new variety. A decision point checks if the loop counter i equals n, indicating whether all subsets have been processed. If not, the loop counter i is incremented by 1 and the process loops back to apply the CNN model to the next subset. If i is equal to n, the process proceeds to the final step and the mean evaluation metrics for all subsets are printed and also the final classification decision based on the collective results of all subsets is printed.
The computations were conducted in Python 3.7.7, utilizing libraries such as tensorflow, scikit-learn, pandas, numpy, scipy, matplotlib, and seaborn.
2.3. Application of the model to the data
Subsequently, the created Convolutional Neural Network (CNN) model in the Keras library was applied to each subset. The input data consists of 15 features describing a mustard variety, and the output label is an assigned variety number. Using the StandardScaler method from the sklearn.preprocessing library, the input data was standardized, ensuring that mean is equal to 0 and variance is equal to 1. This standardization process is achieved using the fit_transform() method, which calculates mean and standard deviation for each feature based on the available training data and then transforms the input data by subtracting the mean and dividing by the standard deviation. This process was applied to enhance the stability and effectiveness of machine learning.
Subsequently, using the ‘to_categorical’ function from the tensorflow.keras.utils library, the output labels are transformed into binary form through a process known as one-hot encoding. This transformation ensures that each label is represented as a binary vector with a length equal to the number of classes, which is require for training a neural network.
Thereafter, the data is divided into a training set and a test set using the ‘train_test_split’ function from the sklearn.model_selection library. The parameter ‘test_size = 0.2’ indicates that 20% of the data will constitute the test set. The parameter ‘stratify = y’ ensures that the split maintains the class proportions in the output labels. The parameter ‘random_state = 42’ sets the seed for randomness to ensure reproducibility of the split.
The model is evaluated separately for each subset and in the end, mean is calculated for all evaluation metrics. Additionally, accuracy is calculated on the test set. The evaluation metrics include: train accuracy, test accuracy, F1 score, precision and recall.
The model also assumes the possibility of the existence of previously unknown varieties. When providing the results of assigning the given data to a specific variety, the model may recognize these data as a new variety. Therefore, when submitting data to determine the variety, at least 10 observations should be provided. The model will iterate through all subsets, assign each observation to a specific variety, and then check if each observation is assigned to the same variety. If not, it means that the data is classified as a new variety. After checking this for all subsets, if in any of them, all given observations are classified into one specific variety, then the model classifies the given data as that variety. If in every subset the result is 'The new data is classified as a new variety.', it means that the data belongs to a new, previously unknown variety.