3.1 Experimental Environment
This experiment was conducted on the Win10 platform and developed using tools such as Python, Tensorflow, PyTorch, and Keras. Both CUDA 11.0 and cuDNN 11.0 were used for accelerated calculations. For ease of computation, the experiments were performed using 256x256 remote sensing images after data augmentation, with a training set of 10,000 and a testing set of 5,000. The batch size was set to 4, buffer size to 100, and the number of iterations to 20.
3.2 DataSet
The dataset was provided by the CCF Big Data Competition and consists of high-definition remote sensing images of a city in southern China in 2015. This dataset is relatively small and contains five large-sized RGB remote sensing images with annotations, ranging in size from 3000×3000 to 6000×6000. The annotations in the dataset classify four types of objects, including vegetation (marker 1), buildings (marker 2), water bodies (marker 3), roads (marker 4), and others (marker 0). To better visualize the labeling, we visualize three of the training images as follows: blue-water, yellow-house, green-vegetation, and brown-road, as shown in Fig. 4.
As the size of the remote sensing images is too large, we cannot directly feed them into the network for training due to memory constraints and their varying sizes. Therefore, we randomly crop small 256x256 images from these coordinates and perform data augmentation. The results are shown in Fig. 5.
3.3 Experimental Results and Analysis
In this experiment, Unet was used as the backbone, and CBAM, SAM, and Dual Channel Attention Module (DCAM) were combined for experimentation. Different channel structures were compared and analyzed. Figure 6 displays the changes in loss and accuracy curves of Unet, Unet + CBAM, Unet + SAM, and Unet + DCAM models on the train data, as well as the IOU change curve on the test data. From the train data loss curve in Fig. 6 (a), it can be observed that Unet has the highest convergence value; The convergence process and values of CBAM and SAM are similar; The DCAM curve converges faster and has a significantly smaller convergence value. Figure 6 (b) shows the variation process of train data accuracy value, which is essentially synchronized with the variation process of train data loss value, with Unet having the lowest accuracy value; The accuracy of CBAM and SAM has always been similar; DCAM has the highest accuracy. Figure 6 (c) displays the IOU variation process on Test data. Through the variation curve, it can be seen that Unet has the lowest IOU; Although the IOU of SAM experiences significant vibration during training, it is generally similar to CBAM; The IOU of DCAM is the highest. Through the loss, accuracy, and IOU statistics of the comparative experiments of Unet, Unet + CBAM, Unet + SAM, and Unet + DCAM in Fig. 6, it is clear that the DCAM model is effective and significantly improves the semantic segmentation effect of remote sensing images.
To further validate the superiority of the model, we conducted objective model metric statistics on Unet, Unet + CBAM, Unet + SAM, and Unet + DCAM models, including loss, accuracy, and IOU metrics on trial and test data. As shown in Table 1, it can be observed that Unet + DCAM has significantly better loss, accuracy, and IOU values than Unet or the single channel Unet + CBAM and Unet + SAM models, whether in the Train data or Test data.
Table 1
Objective indicators were statistically analyzed on train data and test data after 20 iterations, under different models of Unet, Unet + CBAM, Unet + SAM, and Unet + DCAM model.
| Train_data Loss | Train_data Accuracy(%) | Train_data IOU | Test_data Loss | Test_data Accuarcy(%) | Test_data IOU |
---|
Unet | 0.1384 | 94.5 | 0.867 | 1.1824 | 69.5 | 0.508 |
Unet + CBAM | 0.0792 | 95.9 | 0.875 | 1.1025 | 71.4 | 0.536 |
Unet + SAM | 0.1124 | 95.6 | 0.889 | 1.1549 | 71.1 | 0.536 |
Unet + DCAM | 0.0504 | 97.8 | 0.943 | 1.0480 | 73.9 | 0.567 |
The accuracy statistics in Fig. 6 and Table 1 represent the overall pixel accuracy statistics. To further illustrate the accuracy of target segmentation for vegetation, building, water, road, and others, we conducted statistics on the classification pixel accuracy of each target on the Train data and Test data, as shown in Fig. 7. Figure 7 (a) displays the accuracy of the Train data pixel accuracy class, while Fig. 7 (b) shows the accuracy of the Test data pixel accuracy class. From Figs. 7 (a) and 7 (b), it can be observed that overall, the classification accuracy of Unet + DCAM is higher than that of Unet + CBAM, Unet + SAM, and Unet. In terms of specific targets, the segmentation accuracy for vegetation, building, and water is not significantly different, with the accuracy of road being relatively low, which is related to the small size of the road target.
Figure 8 displays the semantic segmentation results of Unet, Unet + CBAM, Unet + SAM, and Unet + DCAM models on Test data. As shown in Fig. 8, each model has achieved commendable results in object segmentation. However, upon closer analysis, the segmentation results of Unet + DCAM are more precise and closer to the label image. Therefore, whether evaluated through objective indicators or intuitive perception of segmentation results, the Unet + DCAM model in this paper outperforms the Unet, Unet + CBAM, and Unet + SAM models.
The aforementioned experiment supports the claim that the proposed DCAM model performs better in semantic segmentation of remote sensing images than single channel CBAM or SAM models. To further validate the superiority of the model, we conducted a comparison with other mainstream semantic segmentation methods on the CCF dataset, including FCN, Segnet, DeeplabV3 [19], DAnet, OCnet, and raw CCnet. The FCN, Segnet, and DeeplabV3 segmentation models were implemented using TensorFlow and Keras, with the following parameters: batch size of 8, 20 epochs, a learning rate of 0.0001, and loss using sparse_categorical_crossentropy. DAnet, OCnet, and CCnet model algorithms were implemented using PyTorch, with ResNet50 as the backbone. The batch size, epoch, and learning rate were set similarly as the previous models, using cross_entropy_loss as the loss function. Figure 9 shows the experimental test results on the Test data after each model was trained. From the segmentation results of various algorithms, FCN, Segnet, DeeplabV3, DAnet, OCnet, and raw CCnet each have their own strengths and can segment the target to a certain extent. However, in terms of overall segmentation effect and accuracy of details, the Unet + DCAM algorithm in this paper outperforms other models and is closer to the actual label value.
In order to verify the precise performance of FCN, Segnet, DeeplabV3, DAnet, OCnet, CCnet algorithms, and Unet + DCAM semantic segmentation algorithms, we analyzed the experimental results using four objective indicators: PA, MPA, MIOU, and Kappa [20, 21].
· Pixel Accuracy(PA) is the ratio of correctly marked pixels to the total number of pixels
· Mean Pixel Accuracy(MPA) is an improvement to the PA calculation, which calculates the proportion of correctly classified pixels in each class, and then averages all classes.
· Mean Intersection over Union(MIoU) computes the ratio of the intersection and union of two sets, which are ground truth and predicted segmentation. It is the average after computing the IoU on each class.
· Kappa is an indicator used to measure the consistency of two variables. If the two variables are replaced by classification results and verification samples, it can be used to evaluate the classification accuracy.
Table 2
Comparison of objective indicators for semantic segmentation of remote sensing images using FCN, Segnet, DeeplabV3, DAnet, OCnet, CCnet, and DACM algorithms, including PA, MPA, MIOU, and Kappa.
| PA | MPA | MIoU | kappa |
---|
FCN | 0.518 | 0.223 | 0.148 | 0.079 |
SegNet | 0.539 | 0.383 | 0.183 | 0.253 |
DeepLav V3 | 0.744 | 0.433 | 0.331 | 0.538 |
DANet | 0.774 | 0.451 | 0.348 | 0.541 |
CCNet | 0.803 | 0.524 | 0.482 | 0.582 |
OCNet | 0.798 | 0.502 | 0.423 | 0.559 |
Unet + DCAM | 0.878 | 0.547 | 0.503 | 0.671 |
Table 2 shows the objective indicator statistics of the experimental results of FCN, Segnet, DeeplabV3, DAnet, OCnet, CCnet algorithms, and Uent + DCAM on Test data. Through comparison, we found that Uent + DCAM is significantly superior to other algorithms in terms of PA, MPA, MIOU, and Kappa.