Labelling in the context of machine learning pertains to the process of associating meaningful and descriptive tags, commonly referred to as labels or annotations, with the input data (Mjolsness and DeCoste 2001). These labels play a crucial role as they serve as the ground truth or target output for the machine learning model during its training phase. By providing these labels, the model is able to learn and identify patterns from the labelled data, thereby enabling it to make accurate predictions or classifications on new and unseen data (Dietterich, 2002). Traditionally, labelling has been carried out through manual or semi-automatic methods, involving human annotators or domain experts assigning labels to individual data instances by hand. However, these approaches can be quite time-consuming and labour-intensive, particularly when dealing with large datasets. To address this challenge, automated labelling techniques have been developed, which offer faster and more efficient processing capabilities. This allows machine learning models to be trained on larger volumes of data. Unlike human annotators, who may introduce inconsistencies in labelling due to varying interpretations or biases, automated labelling ensures a higher level of consistency across the entire dataset. Consequently, this leads to more reliable model training (Ng et al. 2001; Sebastian 1998; Russakovsky et al. 2015).
Labelling in geospatial datasets holds immense importance due to several reasons, as it plays a pivotal role in enhancing the effectiveness and dependability of geospatial analyses. The accuracy of labelling ensures that the algorithm learns from precise and representative examples, thereby improving its capability to generalize and make precise predictions on unseen data. The process of labelling in geospatial datasets is indispensable for a wide array of applications, ranging from training machine learning models to facilitating decision-making in diverse domains such as emergency response, navigation, infrastructure planning, and environmental monitoring. The quality and precision of labels directly influence the reliability and effectiveness of geospatial analyses and applications.
The utilization of Unmanned Aerial Vehicles (UAVs), commonly known as drones, has witnessed a significant rise in the realm of geospatial applications, and this trend is projected to continue its growth trajectory. UAVs offer a cost-effective alternative to manned aerial surveys or satellite imagery, particularly for smaller-scale projects or areas that necessitate frequent updates. The process of processing ultra-high-resolution images obtained from UAVs for geospatial applications encompasses multiple steps aimed at extracting meaningful information and generating accurate output. The success of the classification process relies on the quality of the training data, the appropriateness of the chosen model, and the meticulous tuning of parameters throughout the entire process.
The quality of the training data holds paramount importance in ensuring the success of machine learning models employed in the processing of UAV data. The training data serves as the bedrock for the model to comprehend patterns and relationships within the data, and its quality directly impacts the accuracy, generalization, and ability of the model to make meaningful predictions. High-quality training data guarantees that the machine learning model acquires precise representations of the various land cover or land use classes present in the UAV imagery. Precise annotations and labels play a crucial role in enabling the model to differentiate between various classes, ultimately enhancing classification accuracy and consequently improving decision-making. The utilization of automated labelling tools, rooted in machine learning methodologies, has facilitated the automation of the labelling process (Aksoy et al. 2012; Ghamisi & Benediktsson 2014).
The Segment Anything Model (SAM) is an innovative AI platform that offers automated labelling capabilities. Developed by Meta AI, SAM serves as a foundational model for image segmentation (Kirillov et al. 2023). It has been trained on a vast dataset of images and masks, enabling it to accurately segment various objects within images. SAM utilizes advanced techniques such as Convolutional Neural Networks (CNNs) and transformers specifically designed for natural language processing. These techniques allow SAM to learn the relationships between objects in images and focus on specific parts of an image when making predictions, thanks to the attention mechanism incorporated into its architecture. One of SAM's notable strengths is its ability to excel in zero-shot learning. This means that SAM can adapt to diverse prompts, including clicks, bounding boxes, and textual descriptions, without explicit training on the specific data. Instead, it leverages knowledge from previously learned tasks and concepts to perform tasks on new data (Palatucci et al. 2009). On the other hand, one-shot learning aims to train machine learning models with minimal data exposure, ideally requiring only a single instance per class, for accurate classification or task recognition (Duan et al. 2017). SAM's focus on object segmentation is facilitated by its training using self-supervised learning. This approach involves learning from the intrinsic structure and relationships present within the data itself, eliminating the need for manually labelled examples (Shurrab and Duwairi 2022). In the case of SAM, it learns to segment images by predicting the order of pixels in the image. Initially developed for segmenting RGB datasets, SAM's accurate results have led to its utilization in geospatial datasets as well (Wang et al. 2023). The architecture of SAM consists of an image encoder, prompt encoder, and mask decoder. The image encoder incorporates a Masked Autoencoder (MAE) pre-trained Vision Transformer (ViT) to effectively handle high-resolution images (Kirillov et al. 2023). Overall, SAM's capabilities in automated labelling and its adaptability to different prompts make it a valuable tool for various applications in image segmentation.
Vision Transformers (ViTs) implemented in Self-Attention Mechanisms (SAM) demonstrate superior efficiency in processing high-resolution images compared to Convolutional Neural Networks (CNNs). CNNs frequently encounter challenges related to memory constraints and computational burdens (Dosovitskiy at al 2020). ViTs leverage self-attention mechanisms to effectively capture extensive dependencies across pixels, enabling the model to prioritize crucial segments of an input sequence through comprehension of inter-element relationships (Vaswani et al. 2017). Consequently, SAM proves to be highly suitable for processing high-resolution geospatial datasets.
Previous prominent works in the application of SAM in the field of remote sensing involved testing across multiple datasets using input methods such as bounding boxes, individual points and text descriptors (Osco et al. 2023). Another such work explores evaluation of SAM’s efficacy for image segmentation in agricultural and UGS contexts (Gui et al. 2024).
In our geospatial analysis, we emphasized SamGeo, a module that encompasses various functionalities and serves as a crucial component. This module offers a diverse selection of models, namely ViT-H, ViT-L, and ViT-B, each possessing unique computational and architectural attributes. Notably, ViT-H SAM stands out as an exceptionally efficient model, prompting us to incorporate it into our tests in order to fully harness the capabilities of SamGeo (Kirillov et al. 2023).
In this study, the training pixels produced by the SAM architecture are utilized to train well-known machine learning models including Random Forest, Support Vector Machines (SVM), and XGBoost. A comparison is made between the classification outcomes obtained from these machine learning algorithms trained with labelled datasets.