In the study, a data set consisting of a total of 2076 images, 1125 good quality, and 951 bad quality, was used to determine the lemon quality. Before the training with deep learning and transformer methods, data augmentation is applied to images. These methods are rescaling, random zoom, random flip, and random rotation. In order to determine the quality of lemons Transformer methods which are Vision Transformer (ViT), Swin Transformer, and deep-learning methods, which are Xception, ResNet50, InceptionV3, NASNetMobile, EfficientNetB5, InceptionResNetV2, ResNet152, DenseNet201 methods are used. For the performance evaluation of machine learning and deep learning models, the data set is divided into 70% training (5812 images with augmented) and 30% testing (623 images). The block diagram of our proposed model, which includes data augmentation, and deep learning methods, is shown in Fig. 5.
The hybrid models implemented in Figure-5 were tested using Python on a computer with an i9 12950 processor, RTX 3080TI graphics card, and 32 GB RAM.
3.1. Evaluation Criteria
In the field of machine learning, evaluating model performance is crucial for assessing the effectiveness and generalization capabilities of trained models. There are several metrics for performance evaluation for classification methods which are validation accuracy, validation loss, precision, recall, and F1 score [37]. In this study, validation accuracy, validation loss, precision, and recall metrics were used to evaluate the performance of deep learning methods. These metrics provide valuable insights into the model's ability to make accurate predictions on unseen data and are widely employed in model selection and performance comparison.
Validation loss refers to the measurement of the discrepancy between the model's predicted output and the true target values on a validation dataset, which consists of examples that were not used during the model training phase. The validation loss is typically computed using a specific loss function that quantifies the dissimilarity between predicted and true values [38]. By monitoring the validation loss, researchers and practitioners can gauge the model's ability to generalize to unseen data and detect signs of overfitting. A low validation loss indicates that the model is performing well on the validation set, implying that it is effectively capturing the underlying patterns and regularities in the data. A high validation loss, on the other hand, suggests that the model may be struggling to generalize or is overfitting to the training data [39]. The goal is to minimize the validation loss, as it reflects the model's performance on unseen instances and serves as a proxy for its performance in real-world scenarios. Loss calculation is given in Eq. 5.
\(Loss=\frac{1}{N}{\sum }_{i}^{N}f({ŷ}_{i},{y}_{i} )\) | (Eq. 5) |
where N is the number of the data and f is the loss function.
Accuracy, on the other hand, measures the proportion of correctly predicted instances from the total number of examples in the dataset. It is a metric that is particularly relevant in classification tasks, where the model's output is a class label or a probability distribution over classes [39]. The accuracy provides a measure of how well the model can classify unseen data, offering insights into its overall predictive capabilities. A high accuracy implies that the model is making accurate predictions on the validation set, correctly assigning instances to their respective classes. Conversely, a low validation accuracy suggests that the model may struggle with generalization or encounter difficulties in distinguishing between different classes. Similar to loss, the objective is to maximize the accuracy, indicating that the model is performing well on unseen data [40]. Accuracy calculation is given in Eq. 6.
\(Accuracy=\frac{TP+TN}{TP+FP+FN+TN}x100\) | (Eq. 6) |
\(Recall=\frac{TP}{TP+FN}x100\) | (Eq. 7) |
\(Precision=\frac{TP}{TP+FP}x100\) | (Eq. 8) |
Included in the equations are TP true positive, TN true negative, FP false positive, and FN false negative. Loss and accuracy are complementary metrics that provide a comprehensive evaluation of a trained model's performance. While loss quantifies the model's prediction errors in a continuous manner, accuracy provides a more interpretable measure of classification correctness. Precision is the proportion of correctly predicted positive instances (true positives) out of the total instances predicted as positive. It measures the accuracy of positive predictions, indicating how reliable the model is when it identifies positive samples. Recall, also known as sensitivity or true positive rate, is the proportion of correctly predicted positive instances (true positives) out of all actual positive instances. It measures the model's ability to identify all positive samples, indicating how effectively it captures the relevant instances. These metrics play a vital role in model evaluation, enabling researchers and practitioners to compare different models, assess their generalization capabilities, and make informed decisions about model selection and hyperparameter tuning [41].
After applying data augmentation techniques to the data set consisting of lemon images, eight different deep learning models, namely Xception, ResNet50, InceptionV3, NASNetMobile, EfficientNetB5, InceptionResNetV2, ResNet152, DenseNet201, were applied. The training parameters used while applying these models are given in Table 1.
Table 1
Hyperparameters of Deep learning models
Hyperparameters | Value |
Epoch | 20 |
Learning-Rate | 0.01 |
Batch-size | 8 |
Input-Shape | 300x300 |
Optimizer | Adam |
Dropout | 0.1 |
Activation Function | ReLU |
Output Function | Softmax |
As a result of experimental tests, the values with high classification accuracy and low loss value were chosen as the training parameters for the deep learning models shown in Table 1. The value "20" was selected for the Epoch, which displays how many times deep learning models have been trained using the training data set. A value of "0.01" was chosen for the Learning-rate, which affects the learning capacity and the learning time. The Batch-size value used to update the weights at each training step was found to be "8" when computing the loss function. The Optimizer function "Adam" was selected to enhance the weights. The Dropout value, which breaks the connection between neurons, was set at "0.1" to avoid over-fitting. The outcomes of the deep learning models are provided in Table 2 as a result of the training settings chosen. The best accuracy and lowest loss values that each deep learning model was able to achieve after training are shown in this table.
Table 2
The results of the experimental studies carried out in the study
Models | Epoch | Recall | Precision | Accuracy (%) | Loss |
Xception | 20 | 98.35 | 97.81 | 98.02 | 0.0736 |
ResNet50 | 96.53 | 95.96 | 96.17 | 0.1123 |
InceptionV3 | 98.02 | 97.59 | 97.83 | 0.0822 |
NASNetMobile | 96.95 | 96.38 | 96.62 | 0.1086 |
EfficientNetB5 | 99.29 | 98.86 | 99.03 | 0.0382 |
InceptionResNetV2 | 98.82 | 98.19 | 98.43 | 0.0694 |
ResNet152 | 97.21 | 96.66 | 96.84 | 0.0985 |
DenseNet201 | 98.96 | 98.49 | 98.65 | 0.0563 |
As seen in Table 2, as a result of the application of eight different deep learning models to lemon images, it is seen that the EfficientNetB5 and DenseNet201 models have higher accuracy values than the other deep learning models given in the table. In addition to the accuracy value, recall and precision values are also calculated. Considering these values, it has been observed that the recall value is higher than the accuracy value, and the precision value is lower than the accuracy value. The Recall value gives the proportion of correctly classified positive samples. Because false negative classifications can cause serious problems, the Recall value is very important in classification processes. False negative classifications can overlook what the object is and create obstacles to making the right decision. Precision value is an important evaluation metric used in classification processes where false positive classification is a priority. The fact that the recall value is higher than the accuracy value in this study indicates that good-quality lemons are classified with high accuracy. Considering the products used in fruit juice factories, it is thought that it is meaningful that the recall value is higher than the accuracy value in our study since it is used in fruit juice production in medium-quality products. In addition to deep learning models, recently popular vision transformer models have also been applied to lemon images in order to increase the accuracy values. Training parameters used when applying transformer models are given in Table 3.
Table 3
Hyperparameters of Transformer models
Hyperparameters | Vision Transformer | Swin Transformer |
Epoch | 100 | 100 |
Learning-Rate | 0.0001 | 0.0001 |
Batch Size | 8 | 8 |
Optimizer | Adam | Adam |
Input Shape | 300x300 | 300x300 |
Patch Size | 15 | 10 |
Projection Dimension | 225 | 200 |
MLP Units | 1800,900 | 1024 |
Number of Transformer Layers | 5 | - |
Number of Heads | 45 | 8 |
Window Size | - | 5 |
Shift Size | - | 1 |
Label Smoothing | - | 0.1 |
Activation Function | ReLU | ReLU |
Output Function | Softmax | Softmax |
In the vision transformer, we train the model by splitting the image into patches. Patch size refers to the size of these patches. Projection dimension refers to the length of the vector that we project these separated patches with the linear projection method. After the projection, the vectors we have obtained are placed in the multi-head attention layers in the transformer encoders, and it is decided how much attention should be paid to the result, considering how much it affects the result. The number of Heads parameter refers to the number of heads in the multi-head attention layers. A transformer layer includes normalization, multi-head attention, and MLP layers. The Number of Transformer Layer parameters indicates the number of these transformer layers. After the transformers come the MLP layers, and the MLP Units parameter refers to the size of these MLP layers. Swin transformer has a shifted-window structure compared to vision transformer. This shifted-window mechanism processes the image by selecting windows on the image that we have divided into patches and shifting these windows. The Window Size parameter expresses the size of the windows on these patches. Shift size refers to how many pixels these windows will be shifted. The label smoothing parameter in Swin Transformer refers to a correction factor that is used to smooth the sharp target distribution, usually caused by the hard coding of the labels. This factor usually takes a value in the range [0, 1]. 0 means that label smoothing is not applied, while a value of 1 means maximum label smoothing [34]. The evaluation metrics obtained as a result of applying the transformer models and the two most successful deep learning models with the parameters specified in Table 3 to the data set consisting of lemon images are given in Table 4.
Table 4
The results of the experimental studies carried out in the study
| Recall | Precision | Accuracy (%) | Loss |
EfficientNetB5 | 99.29 | 98.86 | 99.03 | 0.0382 |
DenseNet201 | 98.96 | 98.49 | 98.65 | 0.0563 |
VisionTransformer | 99.95 | 99.66 | 99.84 | 0.0070 |
SwinTransformer | 99.38 | 99.12 | 99.23 | 0.0174 |
As seen in Table 5, it is seen that transformer models are more successful than deep learning models. Among the transformer models, it is seen that the VisionTransformer model performs a more successful classification than the Swin Transformer model with an accuracy value of 99.84%. In order to show the consistency of the accuracy and loss values of these four models, box-plot graphs are drawn and shown in Figs. 6 and 7.
Figures 6 and 7 show the average loss and accuracy values for the dataset prepared to determine lemon quality. Experimental evaluations were carried out on EfficientNetB5, DenseNet201 deep learning architectures and VisionTransformer, Swin Transformer transformer architectures. In the light of the results obtained, the VisionTransformer method has the best average loss and accuracy values compared to other methods. The accuracy and loss values of the VisionTransformer method are between 0.9871–0.9984 and 0.0070–0.0076, respectively. As seen in Figs. 6 and 7, the boxplot of the Vision Transformer architecture is much smaller than other architectures. In addition, the distance between the extreme values in the boxplot for the Vision Transformer architecture is very small and the difference in accuracy rates is very small. It is seen that the box drawing lengths of the Vision Transformer architecture are shorter than the box drawing lengths of other architectures, the distance of the whiskers to the box is closer, and the median value is in the middle of the box. According to the results, it is seen that the Vision Transformer architecture offers more stable results in the data set prepared for determining lemon quality compared to other architectures. In order to show the contribution of the Vision Transformer method, which is the most successful method we proposed in the study, to the literature, comparisons were made with the studies conducted on the same data set and the results are shown in Table 5.
Table-5 Comparison of the studies with Lemon Quality Dataset on literature
Authors | Year | Microarray Data | Number of Data | Method | Accuracy (%) |
He et al. | 2021 | Lemon | 1847 | VGG16 | 95.44 |
Pramanik et al. | 2021 | Lemon | 314 | Xception | 94.34 |
Hernandez et al. | 2021 | Lemon | 913 | CNN | 92 |
Bird et al. | 2022 | Lemon | 2690 | VGG16 | 83.77 |
2690 + 400 | VGG16 after CGAN | 88.75 |
Sharma et al. | 2022 | Lemon | 3000 | CNN + LSTM | 94.2 |
Yılmaz et al. | 2023 | Lemon | 2076 | SAE- CNN | 98.96 |
Proposed Method | 2023 | Lemon | 2076 | VisionTransformer | 99.84 |
As seen in Table 5, when the studies on the quality evaluation of the lemon product were examined, the Vision Transformer model used in the study provided a higher success rate than other studies. Fruit diseases are one of the serious major problems in lemon cultivation. Therefore, the detection of these diseases is of vital importance for the cultivation of lemons and other fruits. Lemon is a fruit that is frequently consumed in many parts of the world. Since it is a potential therapeutic for diseases such as cancer and tumors, and also because the vitamins it contains are extremely important for human health, lemon quality and detection of lemon diseases is an important issue. Previously, the detection of these diseases could only be done by observation. Today, these diseases can be detected automatically with image processing methods. In this study, various deep-learning methods were used to classify lemon quality. Vision Transformer and Swin Transformer methods, which are new methods in the literature, and ready-made models such as EfficentNet-B5 and DenseNet-201 were used and the performance of these models was compared. The proposed Vision Transformer model performed better than the other models. This study makes a successful contribution to the literature for lemon quality classifications, as seen in Table 5.