A well-established method for addressing the class imbalance problem is resampling the data (Chawla et al., 2002; Ghorbani & Ghousi, 2020; Lin et al., 2023). The objective of resampling is to ensure that the samples used in the model closely resemble the population they originate from, facilitating accurate inferences about the true population from the sample (Rajer-Kanduč et al., 2003). Resampling encompasses two main approaches: undersampling, which involves removing observations from the majority class to align with the minority class, and oversampling, which entails adding more observations to the minority class (An & Suh, 2020; Xuan et al., 2018). Both undersampling and oversampling techniques aim to adjust the class ratio in an imbalanced dataset for modelling purposes.
The subsequent section reviews the literature on the undersampling and oversampling techniques that have gained prominence over the years. Special attention will be given to their effectiveness in addressing class imbalances, their application in various domains, and their impact on model performance and generalization. The literature to follow is illustrative rather than exhaustive. The intention is to explore the use and applicability of undersampling and overs-sampling techniques in machine learning applications that have been in ascendency in the literature.
3.1 Undersampling Majority Class
Undersampling is a technique where the number of observations from the majority class is reduced to match the minority class (An & Suh, 2020; Dal Pozzolo et al., 2014; Zuech et al., 2021). Figure 1 presents a diagrammatic illustration of the undersampling technique, where there are 12,000 observations of the majority class and 2,000 observations of the minority class. A model built with this sample distribution will be biased towards the majority class because the sample is more likely to be fed into the algorithm multiple times compared to the minority class (An & Suh, 2020; Hernandez et al., 2013; Santos et al., 2018). As shown in Fig. 1, to create a balanced dataset, the data is resampled with random samples of 2,000 observations from the majority class to match the 2,000 observations of the minority class (An & Suh, 2020).
Undersampling is often favoured due to the presence of a sufficient minority class, allowing for a balanced 50–50 split while maintaining a representative sample. This approach aids data mining algorithms in effectively analyzing the data within manageable limits (Dal Pozzolo et al., 2014; Fujiwara et al., 2020). With undersampling, the majority of class observations are discarded at random until there is a more balanced distribution of the data. Balancing out the minority and majority classes allows the classifier to weigh both classes equally and produce more representative results (Dal Pozzolo et al., 2014; Hernandez et al., 2013; Yap et al., 2014). Several undersampling techniques commonly utilized in the literature include random undersampling, Tomek links, NearMiss, and Condensed Nearest Neighbour (CNN). Below, we examine the existing literature assembled to date on these undersampling techniques.
3.1.2 Random undersampling
Random undersampling (RUS) is a technique used in machine learning to address the issue of imbalanced datasets, where the majority class significantly outnumbers the minority class (Kamei et al., 2007; Saripuddin et al., 2021; Zuech et al., 2021). This approach involves randomly selecting a subset of samples from the majority class and removing them from the dataset to achieve a more balanced class distribution. Practitioners can implement RUS by specifying either a desired ratio of majority to minority class samples or a fixed number of samples to be removed from the majority class. RUS is simple to implement and computationally time- and resource-efficient (Zuech et al., 2021).
Recent studies on RUS in machine learning span across various domains, each exploring the technique's effectiveness in addressing class imbalance problems in specific domains. Some researchers have employed RUS to improve classification performance in anomaly detection (Huan et al., 2020; Saripuddin et al., 2021; Y. Yang et al., 2023). These studies underscore the versatility of RUS in different domains to detect anomalies and improve the performance of machine learning models. Others have utilized RUS to address the challenges of class imbalance in cybersecurity datasets (Bagui & Li, 2021; Silva et al., 2021; Zuech et al., 2021). By employing RUS, these researchers were able to enhance detection accuracy for cybersecurity threats by making the models more reliable and efficient in identifying potential threats in vast amounts of normal traffic data. Another area where RUS has been utilized with very good performance is in predictive modeling and disease diagnosis in the healthcare industry (Bauder & Khoshgoftaar, 2018). Researchers have applied RUS to Medicare fraud detection, showing that it significantly boosts classification accuracy (Hancock et al., 2022). Pias et al. (2023) explore the use of RUS to balance datasets in predicting diabetes and prediabetes in patients. The authors found that RUS achieved robust results and enhanced the fairness and performance of machine learning models (Pias et al., 2023). Others have noted that a 50:50 RUS does not produce the best Medicare fraud detection results; rather, the authors found that a 90:10 class distribution offers the best detection (Bauder & Khoshgoftaar, 2018). These findings suggest that RUS may not be the best technique for fraud prediction in the healthcare industry. That aside, these studies collectively underscore the significance of RUS in enhancing the performance of machine learning models in a variety of domains.
3.1.3 Tomek links
The Tomek links undersampling (TLUS) method is used to balance imbalanced datasets by removing observations located between two different classes. The basic idea behind Tomek links is to identify pairs of observations, one from the majority class and one from the minority class, that are closest to each other but belong to different classes (Alamri & Ykhlef, 2024; Devi et al., 2017). These pairs are called Tomek links (Devi et al., 2017, p. 3). TLUS works by eliminating instances belonging to the majority class from Tomek linkages. Removing the majority class can effectively enhance the separation between classes and boost the efficiency of classification systems. TLUS focuses on the ambiguous cases that are prone to causing misclassification and eliminates them to provide a more distinct decision boundary between the classes. TLUS aims to improve the overall balance of the dataset while preserving the minority class instances that are farthest from the majority class.
Research in TLUS to address class imbalance spans a variety of domains and demonstrates the versatility and effectiveness of the technique to enhance the performance of machine learning models in different applications. Some studies propose an approach combining one-class SVM for anomaly detection with adapted TLUS pairs to eliminate redundant and overlapping cases and address imbalance data issues (Basit et al., 2022; Devi et al., 2017; Vuttipittayamongkol et al., 2021).
Interestingly, the findings suggest that class overlap has a more detrimental effect on performance than class imbalance. That said, there is evidence to show that TLUS has been very effective in addressing imbalances and overlapping issues in medical datasets (Basit et al., 2022). Others have used TLUS in the area of multi-label classification to remove boundary and noise samples from datasets (Ai-jun & Peng, 2020; Pereira et al., 2020). Like Devi et al. (2019), these studies address imbalance by selectively removing overlapping samples between classes to improve the performance of multi-label classification (Ai-jun & Peng, 2020; Pereira et al., 2020). Others have used TLUS in bioinformatics and medical diagnostics to improve machine learning model performance on imbalanced datasets (Ning et al., 2022; Zeng et al., 2016). The authors note that TLUS, in combination with the synthetic minority oversampling technique (SMOTE), enhances classification accuracy across different metrics and demonstrates the benefits of combining these resampling techniques for medical data classification (Zeng et al., 2016). These studies illustrate a growing interest in using TLUS to address the intertwined challenges of class imbalance and overlapping issues across different applications.
3.1.4 NearMiss
NearMiss is an undersampling method that balances imbalanced datasets by removing observations from the majority class closest to the minority class. NearMiss is a k-nearest neighbor approach that balances the class distribution by choosing instances based on the distance between the majority class and the minority class (Mqadi et al., 2021, pp. 3–4; Oladunni et al., 2021, p. 3). When the distance between the majority and minority classes is too close, NearMiss removes instances of the majority class in order to increase the distance between them (Ha & Lee, 2016, p. 2). To find the closest instances of the majority class, there are three types of NearMiss techniques: NearMiss-1, NearMiss-2, and NearMiss-3. NearMiss-1 selects instances from the majority class whose individual distance to the three closest instances to the minority class is the smallest; NearMiss-2 selects instances from the majority class whose individual distance to the three farthest instances to the minority class is the smallest; and NearMiss-3 selects instances from the majority class for which each instance in the minority class has the closest distance (Ha & Lee, 2016, p. 2). The commonality of the NearMiss family is that they select the instances in the majority class that are closest to the minority class to better learn the decision boundary in the data (p. 2).
Most of the literature on NearMiss undersampling is sparse, with few studies exploring its effectiveness across various domains. Bao et al. (2016) proposed boosted near-miss undersampling on SVM (BNU-SVMs) ensembles for concept detection in large-scale imbalanced datasets. The authors discovered that BNU-SVMs may effectively manage large-scale imbalanced datasets by balancing and reducing the training dataset through undersampling. The classifier's performance is enhanced by integrating multiple classifiers. Other studies have employed NearMiss undersampling to address imbalances in financial crime datasets (Mqadi et al., 2021; Rubaidi et al., 2022). The studies found that machine learning algorithms performed very well using the NearMiss undersampling technique. NearMiss undersampling has also been used in detecting insider threats, specifically data leakage by malicious insiders prior to leaving an organization (Alsowail, 2022). The author found that NearMiss undersampling achieved enhanced performance in improving the detection of insider data leakage. Further research in healthcare has found NearMiss to be a promising undersampling technique in bioinformatics and predicting disease with high performance accuracy (Alamsyah et al., 2022; Nayan et al., 2023).
3.1.5 Condensed Nearest Neighbor (CNN)
The Condensed Nearest Neighbor (CNN) undersampling algorithm operates by systematically removing redundant instances from the majority class that are correctly classified by their nearest neighbor in the minority class while preserving essential information in the dataset. The algorithm follows two main steps:
-
Initially, a random subset of the majority class samples is selected to form the "condensed set."
-
The algorithm then iteratively eliminates samples from the majority class that the condensed set's nearest neighbor correctly classified. This procedure keeps iterating until every sample in the condensed set receives the wrong classification from their nearest neighbor in the majority class.
The CNN algorithm employs the 1-NN rule, which dictates that all minority class instances are assigned to set S, one majority class instance is placed in set S, and the remaining majority class instances are allocated to the set (Chaplot et al., 2019, p. 95). Each sample from set C is individually assessed using the 1-NN rule (p. 95). If correctly classified, the sample is discarded; otherwise, it is moved to set S (p. 95). This process repeats for all instances until all the instances have been evaluated using the 1-NN rule (p. 95). The underlying concept behind CNN is that instances correctly classified by their nearest neighbor are considered uninformative for establishing the decision boundary between classes and can be safely removed without compromising classification performance (Batista et al., 2004; Chaplot et al., 2019; Devi et al., 2017).
While CNN is limited in application, there are a few studies that provide evidence of its effectiveness in addressing class imbalance. In their study, Bansal and Jain (2021) employed CNN as a method to balance the number of instances between two classes (Bansal & Jain, 2021). They achieve this by undersampling the majority class according to specific criteria and finding that CNN was the best performer among others. Batista et al. (2004) tested CNN along with other techniques across thirteen different datasets and found that class imbalance does not hinder the performance of learning systems (Batista et al., 2004). The problems appear to be rooted in learning with too few minority classes in the presence of other heterogeneous factors such as class overlapping (p. 20). In another study, Xie et al. (2021) employed CNN and other undersampling techniques on 40 public benchmark datasets. The study discovered that CNNs eliminate noisy or boundary occurrences from the majority class, which can be advantageous for learning models dealing with imbalanced data. Nevertheless, even when using undersampling techniques, a significant disparity between the majority class and the minority class might still negatively impact the performance of the learning process (Xie et al., 2021, p. 8).
Each undersampling strategy has its own distinct advantages in mitigating class imbalances in machine learning applications. RUS has exhibited adaptability and enhanced effectiveness in multiple areas, such as anomaly detection, cybersecurity, and healthcare. Its implementation has been recognized for its ability to improve machine learning models. However, TLUS specifically aims to address ambiguous observations between different classes. TLUS has shown potential in the fields of medical diagnostics and multi-label classification, where it has been successfully utilized to enhance model performance by getting rid of irrelevant information and resolving overlaps in the data. NearMiss effectively enhances the distinction between classes by deliberately eliminating instances of the majority class that are closest to the minority class. NearMiss has been employed to enhance machine learning models in the domains of financial crime detection and healthcare. Although CNN and other undersampling approaches provide potential for addressing class imbalance problems in machine learning, the continued existence of substantial differences between the majority and minority classes highlights the persistent difficulty in reaching optimal performance across varied datasets. While these undersampling strategies offer unique advantages in mitigating class imbalance across diverse domains, challenges persist in achieving optimal performance due to significant disparities between the majority and minority classes. Oversampling techniques provide an opportunity to obtain insights on how these methods have been used to address class imbalances in datasets.
3.2 Oversampling Minority Class
Random oversampling of the minority class is done by duplicating the instances and their representation in the dataset (More, 2016, pp. 5–8). Referring back to the fraud (2%) and non-fraud (98%) examples above, the majority of the data will be skewed towards the no-fraud class. In this case, because there is only 2% of the fraud class, the model will mostly train on the 98% of the no-fraud class. One way of oversampling is to generate new instances for the minority class by sampling with replacement (An & Suh, 2020; Chawla et al., 2002). Figure 2 presents a diagrammatic illustration of random oversampling. Note from Fig. 2 that 2000 transactions were duplicated six times to balance the data. Oversampling is preferred when there is an abundance of data for the majority class and rare events for the minority class (Elreedy & Atiya, 2019; Yap et al., 2014).
3.2.1 Random Oversampling
Random oversampling (ROS) is a method used in handling imbalanced datasets, where instances from the minority class are duplicated randomly to augment the dataset until it reaches the desired ratio or balance with the majority class (Elreedy & Atiya, 2019; Nayan et al., 2023). Unlike other techniques, ROS does not consider the similarity and characteristics of the data points and simply duplicates the instances without considering the relevant features in the dataset. ROS typically involves the following steps:
-
Determining the number of samples in the minority class.
-
Randomly selecting instances from the minority class and duplicating them to increase the overall count of minority class samples.
-
Continuing the duplication process until the number of minority class samples matches that of the majority class.
While ROS can help address class imbalance, it can also lead to overfitting, especially when the same observations are replicated through multiple iterations in the dataset.
Studies that employ ROS have added some other unique techniques to enhance model performance. In one study, the issue of imbalanced data in binary text classification is tackled through the introduction of distributional random oversampling (Moreo et al., 2016). This method utilizes the distributional hypothesis, which posits that the meaning of a feature is shaped by its distribution across extensive data corpora, in order to create synthetic minority-class documents. The results suggest that the distributed random oversampling methods enhance the accuracy of classification algorithms by creating balanced datasets. Others have used ROS to address imbalances in multilabel datasets with very good results across different classification measures (Charte et al., 2015). In another variation of ROS, Zhao et al. (2016) address the class imbalance through stratified random oversampling. The study found that the proposed stratified oversampling method effectively addresses the challenge of imbalanced data by generating balanced and diverse training datasets (Zhao et al., 2016). Others have introduced the Random Walk OverSampling approach to balancing the minority and majority classes by creating synthetic samples by randomly walking from the real data (Zhang & Li, 2014). The study found that random walk oversampling statistically performs much better than alternative methods on imbalanced datasets when implementing common baseline algorithms (p. 99).
3.2.2 SMOTE
SMOTE is an oversampling technique that creates synthetic samples for the minority class by interpolating between existing minority samples. SMOTE enables researchers to use synthetic elements to rebalance under-sampled data and is one of the most effective techniques to address imbalanced datasets (Almhaithawi et al., 2020; Branco et al., 2017; Chawla et al., 2002). As shown in Fig. 3, SMOTE used the k-nearest neighbor closest to the data points to create synthetic samples from the 2000 fraud instances (Branco et al., 2017, p. 18). Instead of randomly oversampling the data with replacements, SMOTE takes "each minority class sample and introduces synthetic examples… joining any/all of the k minority class nearest neighbors" (Chawla et al., 2002, p. 327). Depending on the number of instances needed to balance the data, SMOTE will randomly choose to generate synthetic data points from the minority sample by connecting them with their nearest neighbors (Srinilta & Kanharattanachai, 2021; Sun & Chen, 2021).
Since its publication in 2002, SMOTE has proven successful in a variety of applications across several different domains (Fernandez et al., 2018). SMOTE has inspired several approaches to address class imbalance and has significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, and multi-instance learning, among others (p. 863). SMOTE has been effectively used to address class imbalance using machine learning algorithms in the financial crime domain with excellent results (Lokanan, 2023; Lokanan & Sharma, 2022). Others have used SMOTE to address class imbalances within high-dimensional datasets (Maldonado et al., 2019; Tiwari et al., 2018). Maldonado et al. (2019) propose a modified version of SMOTE designed for high-dimensional binary scenarios, such as natural language processing. This change involves using a new distance metric that focuses solely on the most important features to generate synthetic observations. Similarly, Tiwari et al. (2015) apply SMOTE to investigate the effect of various resampling ratios on observed peptides and absent peptides in protein mass spectrometry data. Both studies found that class balance greatly improves the performance of machine learning models. Others studied how close the distribution of the patterns and number of neighbors generated by SMOTE is to the original distribution and found that it performed better on large rather than small datasets (Elreedy & Atiya, 2019).
3.3.3 SVMSMOTE
Support Vector Machine The Synthetic Minority OverSampling Technique (SVMSMOTE) is a type of oversampling method made for datasets that are not balanced. Its goal is to fix the overfitting problems that come with regular SMOTE (AlJame et al., 2021, pp. 4–5). The SMOTE algorithm and Support Vector Machines (SVM) are combined in SVMSMOTE to make fake samples that are less similar to the original minority class instances (Krayem et al., 2021; Nguyen et al., 2011). To use this method, the original minority class samples are used to train an SVM classifier and find the decision boundary. This is then used to find "safe" and "borderline" areas within the minority class distribution. Synthetic samples are then subsequently generated by interpolating between instances identified as "safe," thus reducing the risk of overfitting by ensuring the synthetic samples are less similar to the original minority class instances.
SVMSMOTE has been widely used across various domains to address class imbalances with enhanced performance. Studies have utilized SVMSMOTE to address class imbalances in datasets pertaining to medicine, education, and cancer research. Researchers have used SVMSMOTE to conduct diagnostic tests and predict prostate cancer using multiparametric data with outstanding results (Barlow et al., 2019; Bertelli et al., 2022). Sujitha and Paramasivan (2023) employed SVMSMOTE to predict stages in lung diseases with enhanced performance (Sujitha & Paramasivan, 2023). Others have used SVMSMOTE to guess how well students will do in multiple classes using a dataset for education that has better classification performance (Ghorbani & Ghousi, 2020; Tariq et al., 2023). All of these studies show that SVMSMOTE can be used to make models work better and fix uneven data distributions in datasets from various fields.
3.3.4 SMOTE Tomek
SMOTETomek is a fusion of SMOTE and Tomek links, aimed at balancing imbalanced datasets while reducing noise. Applying the SMOTE algorithm to oversample the minority class is the first step in the process and removing Tomek links refines the dataset (Z. Wang et al., 2019). This process aims to eliminate overlapping and noisy observations, thereby enhancing the quality of the dataset and the performance of the machine learning models. By combining the strengths of SMOTE in generating synthetic samples and Tomek links in cleaning the dataset, SMOTETomek creates a more balanced and refined dataset for machine learning tasks.
SMOTETomek has been widely utilized in various domains to address imbalanced datasets and enhance model performance. In healthcare, SMOTETomek has been applied to build models for disease prediction, such as diabetes and hypertension, with improved accuracy and sensitivity (Fitriyani et al., 2019). Cancer research studies have also used SMOTETomek to address skewed data and accurately predict high-risk prostate and cervical cancer (Boratto et al., 2022; Ijaz et al., 2020; Lin et al., 2023; Tanimu et al., 2022). In computer science, SMOTETomek has been used for recommender systems and predicting software bugs, showing that it can be used in a variety of situations and is good at fixing class imbalances (Arif et al., 2024; Boratto et al., 2022). SMOTETomek has also been used effectively to address severe sample distribution imbalances in personality recognition datasets (Z. Wang et al., 2019). These studies collectively underscore the utility of SMOTETomek in various domains to address class imbalance issues across various datasets.
3.3.5 K-MeansSMOTE
K-MeansSMOTE is an oversampling technique designed to address class imbalance in datasets by generating synthetic samples for the minority class using k-means clustering. This method integrates the k-means clustering algorithm with SMOTE, executed in three distinct steps: clustering, filtering, and oversampling (Chen & Zhang, 2021; Douzas et al., 2018). Initially, the algorithm employs k-means clustering to partition the minority class samples into clusters. Subsequently, synthetic samples are created for each cluster by interpolating feature values from the minority class samples within the cluster. These synthetic instances are then added to the original dataset, resulting in a rebalanced dataset (De & Prabu, 2022).
K-MeansSMOTE has proven to be effective in oversampling methods and has improved model performance across various classification datasets (Chen & Zhang, 2021; Xu et al., 2021). In medical science, K-MeansSMOTE has been used to effectively balance data. In a recent study, the authors applied K-MeansSMOTE to eight UCI medical datasets with excellent classification scores (Xu et al., 2021). In other studies, such as predicting credit default, K-MeansSMOTE has been proven to be very efficient in addressing class imbalance with enhanced performance (Alam et al., 2020; Chen & Zhang, 2021; Srinivasan et al., 2024). K-MeansSMOTE has also been used to handle imbalances in classifying financial distress companies and shows improved performance across various classification metrics (Aljawazneh et al., 2021). Another domain where K-MeansSMOTE has been effectively used to address class imbalance and enhance performance is churn prediction (De & Prabu, 2022).
3.3.6 SMOTE + ENN
SMOTe + ENN is an oversampling technique devised to rectify imbalanced datasets by combining SMOTE with the edited nearest neighbor (ENN) method. The SMOTE + ENN technique operates in two sequential steps. Initially, SMOTE is applied to produce synthetic samples for the minority class. SMOTE accomplishes this by selecting a minority class sample along with its k-nearest neighbors (k-NN), then creating new synthetic samples by interpolating the feature values between the chosen sample and its neighbors. Subsequently, the ENN method is utilized to eliminate any majority-class samples that are misclassified by a k-NN classifier. ENN detects the majority of class samples misclassified by the k-NN classifier and removes them from the dataset. By amalgamating these two steps, SMOTE + ENN can simultaneously generate new synthetic samples for the minority class and eliminate misclassified majority class samples (Lokanan, 2023; Sisodia et al., 2017; F. Yang et al., 2022).
SMOTE + ENN has proven to be effective in balancing the class distribution and enhancing classification performance. Studies have found that SMOTE + ENN consistently yields superior outcomes compared to most oversampling methods (Batista et al., 2004; Singh et al., 2022; Sisodia et al., 2017). SMOTE + ENN has been extensively used to address class imbalances in medical data classification (Lamari et al., 2021). In medicine, SMOTE + ENN has been heavily utilized in healthcare for early detection tasks, such as predicting septic shock onset and diagnosing missed abortion, with enhanced diagnostic accuracy (Xu et al., 2020; F. Yang et al., 2022). Others have used SMOTE + ENN in predicting Parkinson's disease and chronic heart failure with very good classification results (Keller & Pandey, 2021; K. Wang et al., 2021). In the financial crimes’ arena, SMOTE + ENN has been effectively used to predict fraud with enhanced performance (Lokanan, 2023; Mienye & Sun, 2023). These studies found that SMOTE + ENN was excellent at balancing the datasets and enhancing the robustness of predictive models.
3.3.7 ADASYN
ADASYN, short for Adaptive Synthetic Sampling, is an oversampling technique specifically designed to address class imbalances in datasets. The algorithm works by generating synthetic samples for the minority class by adaptively adjusting the density of synthetic samples based on the difficulty of the classification problem (Fernandez et al., 2018, p. 870). The basic idea behind ADASYN is to generate more synthetic samples in regions where the decision boundary of the minority class is more complex to increase the diversity of the minority class samples (Haibo He et al., 2008). This adaptiveness makes ADASYN particularly effective in scenarios where the imbalance between classes is substantial and the classification problem is challenging.
ADASYN oversampling has been widely applied across various domains, showcasing its effectiveness in addressing class imbalances and improving model performance. In the field of medicine, researchers have used ADASYN to up-sample features to mitigate missing value concerns in cervical cancer detection and breast cancer diagnosis with exceptional accuracy (Khan et al., 2021; Kurniawati et al., 2018; Munshi, 2024). Others have employed ADASYN in fraud prediction. In one study, the authors found that ADASYN proved to be more advantageous than the traditional SMOTE algorithm in telecom fraud identification (Lu et al., 2020). In other studies, ADASYN was used to predict insurance and credit card fraud with enhanced effectiveness on a balanced data set over an unbalanced one (Muranda et al., 2020; Singh et al., 2022; Subudhi & Panigrahi, 2018). ADASYN has also proven to be effective in customer churn prediction (Rao et al., 2024). These applications underscore the versatility and efficiency of using ADASYN to up-sample imbalanced datasets in different domains.
The foregoing review indicates that various oversampling techniques are employed to address the challenge of imbalanced datasets in machine learning. Depending on the dataset and domain, different oversampling methods may be employed. RUS aims to balance the class distribution by manipulating minority class observations across different applications with enhanced performance. The SMOTE-based methods used SMOTE as a foundation to generate synthetic samples for the minority class and combined it with additional methods to balance the classes and enhance performance. ADASYN stands out as a unique category by adjusting synthetic sample density based on the complexity of the decision boundary. In doing so, ADASYN addresses class imbalance by focusing on domains where classification appears to be more challenging. Each oversampling technique reviewed provides distinct advantages and can be chosen according to the specific attributes and characteristics of the dataset and the classification problem being addressed.