2.1. Overview
The method proposed in this paper to study the regurgitation behavior of fruit fly is divided into three main parts as follows:
1. Detect and recognize the regurgitation behavior of fruit fly through a behavior recognition network.
2. Use Unet network combined with CBAM attention mechanism and other networks to segment the regurgitated spots, regurgitated spots can be extracted precisely and the area of each spots can be calculated by OpenCV, so that the total amount of regurgitated spots can be estimated.
3. In order to conduct a more comprehensive study of insect regurgitation, we used Deepsort and Yolov5 method to track the moving trajectory of insects so that the number of them and their moving trajectory during regurgitation can be recorded at the same time.
2.2. Experimental equipment and environment
The computer equipment used for the behavior recognition experiments is Intel(R) Core(TM) i9-9900 K CPU @ 3.60GHz, NVIDIA GeForce RTX 2080Ti with 11G video memory, and the software development environment used is Ubuntu 20.04.1, Python 3.7, and Cuda 11.3. The software development environment used is Ubuntu 20.04.1 operating system, Python environment is 3.7, Cuda 11.3, deep learning framework is Pytorch 1.10.0.
The computer equipment used for the regurgitated spots extraction and insect trajectory tracking experiments was an 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70 GHz 2.69 GHz, and the graphics card was an NVIDIA GeForce RTX 3060 with 6 G of video memory. Chinese version, python environment is 3.8, Cuda11.5, and deep learning framework is Pytorch 1.10.0.
2.3. Model performance metrics
The first part of the behavior recognition experiment is about classification Top-1 Accuracy is used to evaluate the model accuracy. Top-1 Accuracy and Top-5 Accuracy are both important metrics used to evaluate the accuracy of the classification model. Top-1 Accuracy refers to tracking the category with the highest probability among the prediction labels as the prediction category, and if the prediction result is the same as the actual result, then it is judged to be correct. Top-5 Accuracy refers to taking the top five categories with the highest probability in the prediction labels as the prediction categories, and if one of the categories is the same as the actual result, then it is judged to be correct. In the behavior recognition experiment, only fruit fly regurgitation behavior and other behaviors were recognized, so Top-1 Accuracy was chosen as the evaluation index.
In the second part of the regurgitation spots extraction experiment, the main purpose is to evaluate the results of the semantic segmentation experiment and calculate the Miou of semantic segmentation, (Miou is Mean Intersection over Union). In semantic segmentation, the intersection and merge ratio of a single category is the ratio of the intersection and merge of the true label and the predicted value of that category. (Fig. 1).
Here the positive cases refer to regurgitated spotss and the negative cases refer to non-regurgitated spotss.
Miou is the average of the cross-merge ratio for each type of label in this data set. The calculation formula is as follows.
$$MIoU=\frac{1}{k+1}\sum _{i=0}^{k} \frac{{p}_{ii}}{\sum _{j=0}^{k} {p}_{ij}+\sum _{j=0}^{k} {p}_{ji}-{p}_{ii}}$$
1
Where \(i\) denotes the true value, \(j\)denotes the predicted value, and \({p}_{ij}\) denotes the prediction of \(i\) to \(j\). Also equivalent to
$$MIoU=\frac{1}{k+1}\sum _{i=0}^{k} \frac{TP}{FN+FP+TP}$$
2
In the third part of the trajectory tracking experiment, two metrics, precision and recall, were used to evaluate the effectiveness of Yolov5 in detecting fruit fly. Precision is a measure of accuracy that describes how many of the predicted positive cases are true positive cases. Here positive cases refer to fruit fly and negative cases refer to non-fruit fly, and is expressed as follows:
$$\text{Precision }=\frac{TP}{TP+FP}$$
3
Recall is a coverage metric that describes how many positive examples were selected from a true outcome perspective, with the following expression.
$$\text{Recall }=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$
4
The loss function serves as another evaluation metric for each model training, which is used to estimate the degree of inconsistency between the predicted and true values of the model. It is a non-negative real-valued function. The smaller the loss function, the better the robustness of the model.
2.4. Data collection and processing
In the process of data collection, this paper mainly targeted at Bactrocera minax and Bactrocera tau. Both species were photographed at the Insect Ecology Laboratory of the College of Agriculture, Yangtze University. Bactrocera minax affects almost all fruits of the genus Citrus in the family Rutaceae, and its individuals are relatively large; Bactrocera tau is smaller in size compared to Bactrocera minax, and mainly affects squash, cucumber, tomato and other fruits.
In order to enable the fruit fly to regurgitate, Bactrocera minax was fed honey water with a concentration of 5% honey water and then placed in closed petri dishes one by one. The video of regurgitation behavior and other actions of the fruit fly was obtained by vertical filming using a Sony video camera (FDR-AX60) with a filming resolution of 1920×1080 and a filming frame rate of 50 fps.
In the behavior recognition experiment, the video needs to be edited into several clips. In this paper, the video of Bactrocera minax regurgitating was edited into 50 10s clips, and that of other actions were edited into 50 10s clips (other actions include various grooming behaviors and resting states).
In the semantic segmentation experiment, one image was extracted from every 20 frames of the regurgitation video as the dataset for semantic segmentation, and 200 images were finally obtained.
In the trajectory tracking experiments, videos containing insects at rest, walking, and while regurgitating were selected, and one image was extracted every 50 frames, and 300 images were obtained. This part of the experiment used the Bactrocera tau, which is much smaller in size than the Bactrocera minax. Because of the small size, it is more difficult to track and better to test whether the network meets the criteria for tracking insects. Additionally, the petri dish is limited in size, so smaller-sized fruit fly will more random which will make the experiment result more precise.
2.5. Regurgitation behavior recognition experiment
The behavior recognition task is to identify different behavioral actions from the video, and the actions can occur continuously or intermittently. Behavior recognition seems to be an extension of the image classification task to multi-frame detection, and then aggregating the predictions for each frame. Traditional behavior recognition focuses on feature extraction of video. It extracts local high-dimensional visual features of video regions, then combine them into fixed-size video-level descriptions, and finally use classifiers for final prediction. With the development of deep learning technology, 2D convolutional neural network (2DCNN) is applied to behavior recognition. 2DCNN is a two-dimensional matrix of input, so the input video will be transformed into images, and the sliding window operation can only be performed on one frame of a single channel. This approach cannot take into account the inter-frame motion information in the time dimension, so the application of 2DCNN in behavior recognition is not satisfactory. However, with 3D convolutional neural network (3DCNN), the behavior recognition can be done in a more effective manner. 3DCNN has three dimensions: image width, graphic height and image channel. And the convolutional kernel can move in three directions, and the input of one video will output another video, which retains the input temporal information so as to better capture the temporal and spatial information in the video.19–23
In this paper, three typical networks are used for experiments, namely, 3D Convolutional Networks (C3D),24 Inflated 3D ConvNet (I3D)25 and Expanding Architectures for Efficient Video Recognition (X3D).26 C3D can be regarded as a breakthough for it is a relatively early proposed to apply 3DCNN method to be behavior recognition. The paper proposed to apply 3D convolutional operations to extract spatial and temporal features from video data for behavior recognition. These 3D feature extractors operate in both spatial and temporal dimensions, thus capturing the motion information in the video stream. And the structure can generate information channels from adjacent video frames and perform convolution and subsampling in each channel separately to combine the information from all channels to obtain the final features. Compared to 2DCNN, C3D network is more suitable for learning spatio-temporal features, which can model temporal information by 3D convolution and 3D pooling, while 2D convolution can only learn features spatially. I3D network proposes to transform 2D into 3D, not only to process time repeatedly, which can be obtained by temporal inflation of all filters and pooling kernels. The main advantage of this method is that the model parameters can be extended to 3D with pre-trained 2D images, which solves the problem of not having 3D pre-trained parameters. X3D is a relatively new network model nowadays, with new improvements based on the previous networks. The previous 3D network mainly expands the 2D convolutional neural network in time dimension. ut expanding in time scale is not necessarily the best choice. It is worthy to try to expand in other scales, such as the total frame length of input data, frame rate of input data, size of input frames, network width and depth. The network eventually outperforms all previous networks in terms of accuracy while requiring only one-fifth of the previous computation and parameters, and it is found that the network can keep the number of channels low while maintaining high input pixels.
The above networks, are for human action behavior datasets in recognition, such as Kinetics, UCF101, HMDB-51 and other datasets. For example, Kinetics has 400 classes of datasets, and each sample comes from a different Youtube video, and the corresponding human action is extracted from the video into a video segment of about 10 seconds.
The main idea of this paper is to expand human behavior recognition to insect behavior recognition. Because Insect’s behaviors are much smaller than human behaviors, so we are not sure if the network model can have a good extraction of insect fine action features when detecting their behaviors. Therefore, we confirmed this problem through experiments. We labeled the prepared video clips with data, labeled each small video as an action, and then putting them into C3D, I3D and X3D networks for training respectively. (Fig. 2)
Table 1
Values of the hyperparameters for the three different network models evaluated in the study.
Model
|
Batch
|
Momentum
|
Optimizer
|
Initial learning rate
|
Training epochs
|
C3D
|
16
|
0.9
|
SGD
|
1e-4
|
50
|
I3D
|
16
|
0.9
|
SGD
|
1e-4
|
50
|
X3D
|
16
|
0.9
|
SGD
|
1e-4
|
50
|
SGD, stochastic gradient descent.
|
2.6. Regurgitation spots extraction experiment
After detecting regurgitation of fruit fly, we need to semantically segment their regurgitated spots and then calculate the area of regurgitated spots by threshold segmentation, which can provide a quantitative assessment for regurgitation studies.
In segmenting the regurgitated spots, the Unet network was firstly used. Why was the Unet network chosen? We are inspired by medical image segmentation. Medical semantics is simpler and more fixed in structure. And the organ itself is fixed in structure and not particularly rich in semantic information, so high-level semantic information and low-level features are important. The skip connection and U-shaped structure of Unet combine high-level semantic information and low-level features, so it is more suitable for medical semantic segmentation. The fruit fly regurgitated image features are similar to medical images. In other words, the regurgitated spots is like a group of ellipse-shaped cells. Its structure is more fixed, and the semantic structure is also relatively simple, so all that needs to be done is accurate segmentation.27–29
In order to obtain better segmentation results, we have tried to modify the backbone network of Unet by using Vgg16 and ResNet50 respectively, and then added CBAM attention mechanism to it, which makes the segmentation effect further improved. In order to make the experiments more scientific, ablation experiments are also done in this paper. The semantic segmentation network deeplabv3 + is used, and Xception and MobileNetv2 are also used as the backbone network of DeeplabV3+. 30–37 The training hyperparameter settings and training results are shown in Table 2.
Table 2
Model performance metrics at different training hyperparameter settings for the two convolutional neural networks evaluated in the study
Model
|
Backbone
|
Optimizer
|
Initial learning rate
|
Miou
|
Loss
|
Unet
|
Vgg16
|
adam
|
1e-4
|
89.5
|
0.056
|
Unet + CBAM
|
Vgg16
|
adam
|
1e-4
|
90.96
|
0.055
|
Unet
|
ResNet50
|
adam
|
1e-4
|
85.20
|
0.096
|
Unet + CBAM
|
ResNet50
|
adam
|
1e-4
|
85.95
|
0.077
|
deeplabV3+
|
Xecption
|
SGD
|
7e-3
|
80.69
|
0.3144
|
deeplabV3+
|
MobileNetv2
|
SGD
|
7e-3
|
80.66
|
0.3211
|
As shown in the Table 2, Unet's accuracy is higher when using vgg16 as the backbone, and combining with it the CBAM attention mechanism. Therefore, the training weights of this model were chosen to segment the randomly selected fruit fly regurgitated images. Before segmentation, a square millimeter piece of labeled paper was placed in a petri dish as a "scale" and photographed together with the fruit fly, so that the bottom area of the regurgitated spots could be derived from the pixels of the regurgitated spots through the area and pixels of the paper.
After the segmentation, the extracted regurgitated spots can be clearly seen, but the segmented image contains impurities, such as the fruit fly themselves and the tiny impurities on the petri dish, which not only bring visual disturbance, but also affect the next step of calculating the regurgitated spots area. Therefore, the threshold segmentation method can be used to remove the impurities and background. Since only regurgitated spots and marker paper pieces need to be retained, we chose binarization the simplest threshold segmentation, to assign black values to all the impurities and background, and to keep and deepen the color of regurgitated water droplets and marker paper pieces, so as to obtain an completely extracted regurgitated spots picture. The extraction process is shown in Fig. 3.
The number of closed shapes in the image and the pixels of each closed shape are calculated by OpenCV, and then the area of each spots is obtained by marking the pixels and area of the paper sheet.
2.7. Trajectory tracking experiment
The Yolov5 target detection algorithm combined with the DeepSort algorithm, which has a good tracking performance at present, is used in this paper to track the trajectory of fruit fly in the process of regurgitation, and it can realize the counting of fruit fly. We also used DeepSort network, the most important feature of which is the use of Kalman filtering algorithm and Hungarian algorithm, both of which can greatly improve the accuracy and speed of multi-object tracking. The Kalman filtering algorithm is divided into two processes: prediction and update. Prediction: when the target is moved, the target frame position and speed and other parameters of the current frame are predicted by the target and speed parameters of the previous frame. Update: the two positively distributed states of the predicted and observed values are linearly weighted to obtain the current system predicted transition state. In other words, the Kalman filter can predict the position of the current moment based on the position of the target at the previous moment and can estimate the position of the target more accurately than the sensor. The Hungarian algorithm mainly calculates the similarity to get the similarity matrix of the two frames before and after, so as to determine whether the target in the current frame is the same as the target in the previous frame. 38–44
Although DeepSort network has high accuracy and high speed in multi-object tracking, it is mostly used for pedestrian and vehicle tracking and counting. It can achieve good results in tracking objects with relatively large targets and obvious features, but it is seldom used for insect trajectory tracking and counting.45, 46 This is because insects are small in size, relatively inconspicuous in features, and their trajectories are much messier than those of straight-line vehicles and pedestrians, and there is no obvious motion pattern. There is no clear movement pattern. In this paper, we want to try to use DeepSort network to track insects and explore whether there is a network model that can meet the requirements of tracking insects.
Therefore, 270 images of fruit fly were used to train the the Yolov5 network, and then 30 images to verify its effect. After 50 iterations of training, the accuracy of network could reach 99.8 percent. The best weights of Yolov5 training are used as the weights of DeepSort object tracking for tracking experiments. Two South Asian solid flies in a video are detected and tracked.