Exploring Resampling Techniques in Credit Card Default Prediction

doi:10.21203/rs.3.rs-4087259/v1

Download PDF

Research Article

Exploring Resampling Techniques in Credit Card Default Prediction

https://doi.org/10.21203/rs.3.rs-4087259/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In the field of machine learning, the preparation of data is a pivotal step in optimizing model performance. This paper delves into the crucial role of data cleaning and transformation, with a particular emphasis on resampling techniques tailored for addressing imbalanced datasets. By emphasizing the significance of tailored data preparation methodologies, this study underscores the role of resampling techniques in optimizing model performance, especially when dealing with imbalanced datasets. Through an exploration of both undersampling and oversampling methods, the study delves into their nuanced impacts on classification performance and explores the potential trade-offs inherent in each approach. Focusing on the domain of credit card default prediction, the research leverages the UCI Credit Card dataset to provide a comprehensive analysis. The results demonstrate that NearMiss outperformed other undersampling techniques across all classifiers and evaluation metrics. Similarly, K-MeansSMOTE emerged as the top-performing oversampling technique across all classifiers and evaluation metrics. Among the techniques investigated in the study, K-MeansSMOTE oversampling yielded the highest performance accuracy. The findings from this paper enhance our understanding of the performance of different resampling techniques and contribute to the scholarship on handling imbalanced datasets. The results show the pros and cons of different resampling methods used with different machine learning algorithms. They also show how important customized methods are for getting accurate predictions. While offering valuable insights, the study acknowledges the necessity for further research to refine and generalize these techniques across diverse domains and real-world applications, thereby contributing to the broader landscape of machine learning methodologies.

Undersampling

Oversampling

Machine learning

Credit Default

Imbalance dataset

The success and accuracy of machine learning models depend heavily on the quality of the data. Before building a machine learning model, the raw, unstructured data needs to be cleaned and transformed for processing and analysis. This transformation process is known as data preparation. Briefly put, data preparation is the cleaning, processing, validating, and transformation of unstructured data to make it more suitable for building machine learning models. One aspect of data preparation is working with unbalanced data. Unbalanced data refers to a situation in which the classes or categories in the data are not equally represented in the dataset. In other words, one class may have significantly more or fewer observations than the other class. Class imbalances can arise in various real-world scenarios. For example, in medical imaging, the occurrence of an infectious disease may be less than the occurrence of a non-infectious disease. Similarly, in fraud detection, fraudulent transactions may constitute only a small sample compared to non-fraudulent transactions.

Imbalanced datasets present numerous challenges for machine learning algorithms (Araf et al., 2024). The skewed distribution of classes can introduce biases during model training, favouring the majority class and resulting in diminished performance for the minority class. The dominance of the majority class within the dataset often leads algorithms to prioritize predicting this class, potentially overlooking the minority class, resulting in biased classifications and poor generalization. Imbalanced classification scenarios, also known as rare event modelling, arise when the target variable exhibits a significant imbalance, with the minority class representing the rare events. In such cases, the model's tendency to learn predominantly from the majority class can make predicting the minority class particularly difficult (Araf et al., 2024; Khushi et al., 2021; Naseriparsa & Kashani, 2014). Consequently, machine learning algorithms may struggle to construct accurate models (Chawla et al., 2002), leading to challenges in evaluation metrics that could fall into the "metric trap" and result in inaccurate results (Jeni et al., 2013; Yap et al., 2014).

When handling imbalanced data, many machine learning algorithms tend to excel at accurately predicting the majority class while struggling with the minority class (Zhou, 2013). In datasets with class imbalances, classification algorithms often prioritize predicting the majority class without adequately analyzing the minority class. Consequently, this can lead to consistently high-performance accuracy scores that are not necessarily reliable (An & Suh, 2020; Hooda et al., 2018; Lin et al., 2023). Relying solely on raw performance accuracy may not be ideal for evaluating classification models, especially in the context of imbalanced datasets (Araf et al., 2024; Hooda et al., 2018). This outcome is because accuracy only considers true positive/negative observations and ignores false positive and negative observations. While alternative metrics such as precision, recall, F1-score, and Area Under ROC Curve (AUROC) offer more robust evaluation options for imbalanced datasets, they still tend to favour the majority class (Fujiwara et al., 2020; Hasanin et al., 2019).

This paper aims to address the challenge of working with imbalanced datasets, focusing specifically on data preparation techniques. While various methods have been proposed in the literature to tackle imbalanced data, including ensemble, algorithmic, cost-sensitive, and data-matching approaches (Mqadi et al., 2021), this paper will concentrate on the utilization of resampling techniques. Specifically, this paper used a credit default dataset to delve deeper into the application of both undersampling and oversampling techniques to effectively handle imbalanced datasets. Several under- and oversampling techniques are used to enhance the representation of minority and majority classes, respectively, thereby improving the model's ability to accurately capture patterns and make predictions across all classes.

This paper makes several contributions to the literature. First, the paper presents a focused examination of data exploration techniques for addressing the challenges of dealing with imbalanced datasets. Secondly, while previous research has applied a single under- or oversampling approach to address class imbalance based on the dataset they are working on, this paper uses several approaches to provide a comprehensive understanding of the best resampling techniques to optimize and make inferences from the data. By doing so, this paper aims to offer practitioners a deeper understanding of how to effectively handle imbalanced data, ultimately leading to more reliable models and accurate performance in real-world applications. Thirdly, comparing various under- and oversampling techniques provides scientific insights that experts dealing with imbalanced data may easily apply, bridging the gap between theoretical research and practical application in machine learning.

The remainder of this paper follows a structured format outlined as follows: Section 2 delves into the importance of data preparation, explores the necessity of resampling techniques, and addresses challenges associated with interpreting results derived from imbalanced datasets. Section 3 reviews existing literature on undersampling and oversampling techniques. Section 4 analyzes the positive and negative effects of undersampling and oversampling techniques on classifier performance. Section 5 discusses the methodology, including the characteristics of the dataset and algorithms used in this study. Section 6 provides an analysis of the findings. Finally, Section 7 presents a conclusion summarizing the main findings of the study.

Data preparation stands as a pivotal component within the data mining process, often commanding a significant portion of a data scientist's project timeline, typically ranging from 70–80% (Lean Yu et al., 2006; Sattler & Schallehn, 2001). This emphasis stems from the fundamental principle that the quality of insights derived from models is directly linked to the quality of the input data—a notion encapsulated in the adage "garbage in, garbage out" (GIGO) from computer science. Data preparation transcends mere data cleansing tasks like handling missing values, duplicates, and outliers. Instead, data preparation represents a concerted effort to extract valuable business insights concealed within the raw, unstructured data (Angra & Ahuja, 2017; Bauder & Khoshgoftaar, 2018). By transforming data into a format conducive to machine learning algorithms, data preparation serves as the conduit through which the business problem is presented to the algorithm. The data preparation process streamlines the data mining endeavour by facilitating a more focused exploration of the underlying business question(s) and enabling machine learning algorithms to provide actionable insights.

There are always patterns in the data, but how can this be the case? For example, think of a dataset on credit card customers. There are always patterns in the data connecting these customers' previous and future behaviours. These patterns can provide valuable insights into the company's strategic policies regarding who to reward with a credit card (Nigrini, 2019). Hence, there are always discernible patterns in data mining problems (Bhattacharyya et al., 2011). There are no optimal solutions. Data is a simple by-product of business operations, and patterns can only be discovered through trial and error and experiments. To leverage the insights from the data, one must use intuition and domain knowledge, see data preparation as a technique to search and manipulate the problem space, and work with the algorithms to provide insights on the problem (Mirchevska et al., 2014). In the same way that meticulous food preparation requires making a delicious dish, data preparation is a vital step to getting something meaningful and actionable out of your data. Although there are numerous data preparation steps within the machine learning workflow, this paper will concentrate on addressing the challenge of dealing with imbalanced data. The subsequent section will delve into a review of the literature concerning the various resampling techniques that have been employed to handle imbalanced data.

A well-established method for addressing the class imbalance problem is resampling the data (Chawla et al., 2002; Ghorbani & Ghousi, 2020; Lin et al., 2023). The objective of resampling is to ensure that the samples used in the model closely resemble the population they originate from, facilitating accurate inferences about the true population from the sample (Rajer-Kanduč et al., 2003). Resampling encompasses two main approaches: undersampling, which involves removing observations from the majority class to align with the minority class, and oversampling, which entails adding more observations to the minority class (An & Suh, 2020; Xuan et al., 2018). Both undersampling and oversampling techniques aim to adjust the class ratio in an imbalanced dataset for modelling purposes.

The subsequent section reviews the literature on the undersampling and oversampling techniques that have gained prominence over the years. Special attention will be given to their effectiveness in addressing class imbalances, their application in various domains, and their impact on model performance and generalization. The literature to follow is illustrative rather than exhaustive. The intention is to explore the use and applicability of undersampling and overs-sampling techniques in machine learning applications that have been in ascendency in the literature.

3.1 Undersampling Majority Class

Undersampling is a technique where the number of observations from the majority class is reduced to match the minority class (An & Suh, 2020; Dal Pozzolo et al., 2014; Zuech et al., 2021). Figure 1 presents a diagrammatic illustration of the undersampling technique, where there are 12,000 observations of the majority class and 2,000 observations of the minority class. A model built with this sample distribution will be biased towards the majority class because the sample is more likely to be fed into the algorithm multiple times compared to the minority class (An & Suh, 2020; Hernandez et al., 2013; Santos et al., 2018). As shown in Fig. 1, to create a balanced dataset, the data is resampled with random samples of 2,000 observations from the majority class to match the 2,000 observations of the minority class (An & Suh, 2020).

Undersampling is often favoured due to the presence of a sufficient minority class, allowing for a balanced 50–50 split while maintaining a representative sample. This approach aids data mining algorithms in effectively analyzing the data within manageable limits (Dal Pozzolo et al., 2014; Fujiwara et al., 2020). With undersampling, the majority of class observations are discarded at random until there is a more balanced distribution of the data. Balancing out the minority and majority classes allows the classifier to weigh both classes equally and produce more representative results (Dal Pozzolo et al., 2014; Hernandez et al., 2013; Yap et al., 2014). Several undersampling techniques commonly utilized in the literature include random undersampling, Tomek links, NearMiss, and Condensed Nearest Neighbour (CNN). Below, we examine the existing literature assembled to date on these undersampling techniques.

3.1.2 Random undersampling

Random undersampling (RUS) is a technique used in machine learning to address the issue of imbalanced datasets, where the majority class significantly outnumbers the minority class (Kamei et al., 2007; Saripuddin et al., 2021; Zuech et al., 2021). This approach involves randomly selecting a subset of samples from the majority class and removing them from the dataset to achieve a more balanced class distribution. Practitioners can implement RUS by specifying either a desired ratio of majority to minority class samples or a fixed number of samples to be removed from the majority class. RUS is simple to implement and computationally time- and resource-efficient (Zuech et al., 2021).

Recent studies on RUS in machine learning span across various domains, each exploring the technique's effectiveness in addressing class imbalance problems in specific domains. Some researchers have employed RUS to improve classification performance in anomaly detection (Huan et al., 2020; Saripuddin et al., 2021; Y. Yang et al., 2023). These studies underscore the versatility of RUS in different domains to detect anomalies and improve the performance of machine learning models. Others have utilized RUS to address the challenges of class imbalance in cybersecurity datasets (Bagui & Li, 2021; Silva et al., 2021; Zuech et al., 2021). By employing RUS, these researchers were able to enhance detection accuracy for cybersecurity threats by making the models more reliable and efficient in identifying potential threats in vast amounts of normal traffic data. Another area where RUS has been utilized with very good performance is in predictive modeling and disease diagnosis in the healthcare industry (Bauder & Khoshgoftaar, 2018). Researchers have applied RUS to Medicare fraud detection, showing that it significantly boosts classification accuracy (Hancock et al., 2022). Pias et al. (2023) explore the use of RUS to balance datasets in predicting diabetes and prediabetes in patients. The authors found that RUS achieved robust results and enhanced the fairness and performance of machine learning models (Pias et al., 2023). Others have noted that a 50:50 RUS does not produce the best Medicare fraud detection results; rather, the authors found that a 90:10 class distribution offers the best detection (Bauder & Khoshgoftaar, 2018). These findings suggest that RUS may not be the best technique for fraud prediction in the healthcare industry. That aside, these studies collectively underscore the significance of RUS in enhancing the performance of machine learning models in a variety of domains.

3.1.3 Tomek links

The Tomek links undersampling (TLUS) method is used to balance imbalanced datasets by removing observations located between two different classes. The basic idea behind Tomek links is to identify pairs of observations, one from the majority class and one from the minority class, that are closest to each other but belong to different classes (Alamri & Ykhlef, 2024; Devi et al., 2017). These pairs are called Tomek links (Devi et al., 2017, p. 3). TLUS works by eliminating instances belonging to the majority class from Tomek linkages. Removing the majority class can effectively enhance the separation between classes and boost the efficiency of classification systems. TLUS focuses on the ambiguous cases that are prone to causing misclassification and eliminates them to provide a more distinct decision boundary between the classes. TLUS aims to improve the overall balance of the dataset while preserving the minority class instances that are farthest from the majority class.

Research in TLUS to address class imbalance spans a variety of domains and demonstrates the versatility and effectiveness of the technique to enhance the performance of machine learning models in different applications. Some studies propose an approach combining one-class SVM for anomaly detection with adapted TLUS pairs to eliminate redundant and overlapping cases and address imbalance data issues (Basit et al., 2022; Devi et al., 2017; Vuttipittayamongkol et al., 2021).

Interestingly, the findings suggest that class overlap has a more detrimental effect on performance than class imbalance. That said, there is evidence to show that TLUS has been very effective in addressing imbalances and overlapping issues in medical datasets (Basit et al., 2022). Others have used TLUS in the area of multi-label classification to remove boundary and noise samples from datasets (Ai-jun & Peng, 2020; Pereira et al., 2020). Like Devi et al. (2019), these studies address imbalance by selectively removing overlapping samples between classes to improve the performance of multi-label classification (Ai-jun & Peng, 2020; Pereira et al., 2020). Others have used TLUS in bioinformatics and medical diagnostics to improve machine learning model performance on imbalanced datasets (Ning et al., 2022; Zeng et al., 2016). The authors note that TLUS, in combination with the synthetic minority oversampling technique (SMOTE), enhances classification accuracy across different metrics and demonstrates the benefits of combining these resampling techniques for medical data classification (Zeng et al., 2016). These studies illustrate a growing interest in using TLUS to address the intertwined challenges of class imbalance and overlapping issues across different applications.

3.1.4 NearMiss

NearMiss is an undersampling method that balances imbalanced datasets by removing observations from the majority class closest to the minority class. NearMiss is a k-nearest neighbor approach that balances the class distribution by choosing instances based on the distance between the majority class and the minority class (Mqadi et al., 2021, pp. 3–4; Oladunni et al., 2021, p. 3). When the distance between the majority and minority classes is too close, NearMiss removes instances of the majority class in order to increase the distance between them (Ha & Lee, 2016, p. 2). To find the closest instances of the majority class, there are three types of NearMiss techniques: NearMiss-1, NearMiss-2, and NearMiss-3. NearMiss-1 selects instances from the majority class whose individual distance to the three closest instances to the minority class is the smallest; NearMiss-2 selects instances from the majority class whose individual distance to the three farthest instances to the minority class is the smallest; and NearMiss-3 selects instances from the majority class for which each instance in the minority class has the closest distance (Ha & Lee, 2016, p. 2). The commonality of the NearMiss family is that they select the instances in the majority class that are closest to the minority class to better learn the decision boundary in the data (p. 2).

Most of the literature on NearMiss undersampling is sparse, with few studies exploring its effectiveness across various domains. Bao et al. (2016) proposed boosted near-miss undersampling on SVM (BNU-SVMs) ensembles for concept detection in large-scale imbalanced datasets. The authors discovered that BNU-SVMs may effectively manage large-scale imbalanced datasets by balancing and reducing the training dataset through undersampling. The classifier's performance is enhanced by integrating multiple classifiers. Other studies have employed NearMiss undersampling to address imbalances in financial crime datasets (Mqadi et al., 2021; Rubaidi et al., 2022). The studies found that machine learning algorithms performed very well using the NearMiss undersampling technique. NearMiss undersampling has also been used in detecting insider threats, specifically data leakage by malicious insiders prior to leaving an organization (Alsowail, 2022). The author found that NearMiss undersampling achieved enhanced performance in improving the detection of insider data leakage. Further research in healthcare has found NearMiss to be a promising undersampling technique in bioinformatics and predicting disease with high performance accuracy (Alamsyah et al., 2022; Nayan et al., 2023).

3.1.5 Condensed Nearest Neighbor (CNN)

The Condensed Nearest Neighbor (CNN) undersampling algorithm operates by systematically removing redundant instances from the majority class that are correctly classified by their nearest neighbor in the minority class while preserving essential information in the dataset. The algorithm follows two main steps:

Initially, a random subset of the majority class samples is selected to form the "condensed set."

The algorithm then iteratively eliminates samples from the majority class that the condensed set's nearest neighbor correctly classified. This procedure keeps iterating until every sample in the condensed set receives the wrong classification from their nearest neighbor in the majority class.

The CNN algorithm employs the 1-NN rule, which dictates that all minority class instances are assigned to set S, one majority class instance is placed in set S, and the remaining majority class instances are allocated to the set (Chaplot et al., 2019, p. 95). Each sample from set C is individually assessed using the 1-NN rule (p. 95). If correctly classified, the sample is discarded; otherwise, it is moved to set S (p. 95). This process repeats for all instances until all the instances have been evaluated using the 1-NN rule (p. 95). The underlying concept behind CNN is that instances correctly classified by their nearest neighbor are considered uninformative for establishing the decision boundary between classes and can be safely removed without compromising classification performance (Batista et al., 2004; Chaplot et al., 2019; Devi et al., 2017).

While CNN is limited in application, there are a few studies that provide evidence of its effectiveness in addressing class imbalance. In their study, Bansal and Jain (2021) employed CNN as a method to balance the number of instances between two classes (Bansal & Jain, 2021). They achieve this by undersampling the majority class according to specific criteria and finding that CNN was the best performer among others. Batista et al. (2004) tested CNN along with other techniques across thirteen different datasets and found that class imbalance does not hinder the performance of learning systems (Batista et al., 2004). The problems appear to be rooted in learning with too few minority classes in the presence of other heterogeneous factors such as class overlapping (p. 20). In another study, Xie et al. (2021) employed CNN and other undersampling techniques on 40 public benchmark datasets. The study discovered that CNNs eliminate noisy or boundary occurrences from the majority class, which can be advantageous for learning models dealing with imbalanced data. Nevertheless, even when using undersampling techniques, a significant disparity between the majority class and the minority class might still negatively impact the performance of the learning process (Xie et al., 2021, p. 8).

Each undersampling strategy has its own distinct advantages in mitigating class imbalances in machine learning applications. RUS has exhibited adaptability and enhanced effectiveness in multiple areas, such as anomaly detection, cybersecurity, and healthcare. Its implementation has been recognized for its ability to improve machine learning models. However, TLUS specifically aims to address ambiguous observations between different classes. TLUS has shown potential in the fields of medical diagnostics and multi-label classification, where it has been successfully utilized to enhance model performance by getting rid of irrelevant information and resolving overlaps in the data. NearMiss effectively enhances the distinction between classes by deliberately eliminating instances of the majority class that are closest to the minority class. NearMiss has been employed to enhance machine learning models in the domains of financial crime detection and healthcare. Although CNN and other undersampling approaches provide potential for addressing class imbalance problems in machine learning, the continued existence of substantial differences between the majority and minority classes highlights the persistent difficulty in reaching optimal performance across varied datasets. While these undersampling strategies offer unique advantages in mitigating class imbalance across diverse domains, challenges persist in achieving optimal performance due to significant disparities between the majority and minority classes. Oversampling techniques provide an opportunity to obtain insights on how these methods have been used to address class imbalances in datasets.

3.2 Oversampling Minority Class

Random oversampling of the minority class is done by duplicating the instances and their representation in the dataset (More, 2016, pp. 5–8). Referring back to the fraud (2%) and non-fraud (98%) examples above, the majority of the data will be skewed towards the no-fraud class. In this case, because there is only 2% of the fraud class, the model will mostly train on the 98% of the no-fraud class. One way of oversampling is to generate new instances for the minority class by sampling with replacement (An & Suh, 2020; Chawla et al., 2002). Figure 2 presents a diagrammatic illustration of random oversampling. Note from Fig. 2 that 2000 transactions were duplicated six times to balance the data. Oversampling is preferred when there is an abundance of data for the majority class and rare events for the minority class (Elreedy & Atiya, 2019; Yap et al., 2014).

3.2.1 Random Oversampling

Random oversampling (ROS) is a method used in handling imbalanced datasets, where instances from the minority class are duplicated randomly to augment the dataset until it reaches the desired ratio or balance with the majority class (Elreedy & Atiya, 2019; Nayan et al., 2023). Unlike other techniques, ROS does not consider the similarity and characteristics of the data points and simply duplicates the instances without considering the relevant features in the dataset. ROS typically involves the following steps:

Determining the number of samples in the minority class.

Randomly selecting instances from the minority class and duplicating them to increase the overall count of minority class samples.

Continuing the duplication process until the number of minority class samples matches that of the majority class.

While ROS can help address class imbalance, it can also lead to overfitting, especially when the same observations are replicated through multiple iterations in the dataset.

Studies that employ ROS have added some other unique techniques to enhance model performance. In one study, the issue of imbalanced data in binary text classification is tackled through the introduction of distributional random oversampling (Moreo et al., 2016). This method utilizes the distributional hypothesis, which posits that the meaning of a feature is shaped by its distribution across extensive data corpora, in order to create synthetic minority-class documents. The results suggest that the distributed random oversampling methods enhance the accuracy of classification algorithms by creating balanced datasets. Others have used ROS to address imbalances in multilabel datasets with very good results across different classification measures (Charte et al., 2015). In another variation of ROS, Zhao et al. (2016) address the class imbalance through stratified random oversampling. The study found that the proposed stratified oversampling method effectively addresses the challenge of imbalanced data by generating balanced and diverse training datasets (Zhao et al., 2016). Others have introduced the Random Walk OverSampling approach to balancing the minority and majority classes by creating synthetic samples by randomly walking from the real data (Zhang & Li, 2014). The study found that random walk oversampling statistically performs much better than alternative methods on imbalanced datasets when implementing common baseline algorithms (p. 99).

3.2.2 SMOTE

SMOTE is an oversampling technique that creates synthetic samples for the minority class by interpolating between existing minority samples. SMOTE enables researchers to use synthetic elements to rebalance under-sampled data and is one of the most effective techniques to address imbalanced datasets (Almhaithawi et al., 2020; Branco et al., 2017; Chawla et al., 2002). As shown in Fig. 3, SMOTE used the k-nearest neighbor closest to the data points to create synthetic samples from the 2000 fraud instances (Branco et al., 2017, p. 18). Instead of randomly oversampling the data with replacements, SMOTE takes "each minority class sample and introduces synthetic examples… joining any/all of the k minority class nearest neighbors" (Chawla et al., 2002, p. 327). Depending on the number of instances needed to balance the data, SMOTE will randomly choose to generate synthetic data points from the minority sample by connecting them with their nearest neighbors (Srinilta & Kanharattanachai, 2021; Sun & Chen, 2021).

Since its publication in 2002, SMOTE has proven successful in a variety of applications across several different domains (Fernandez et al., 2018). SMOTE has inspired several approaches to address class imbalance and has significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, and multi-instance learning, among others (p. 863). SMOTE has been effectively used to address class imbalance using machine learning algorithms in the financial crime domain with excellent results (Lokanan, 2023; Lokanan & Sharma, 2022). Others have used SMOTE to address class imbalances within high-dimensional datasets (Maldonado et al., 2019; Tiwari et al., 2018). Maldonado et al. (2019) propose a modified version of SMOTE designed for high-dimensional binary scenarios, such as natural language processing. This change involves using a new distance metric that focuses solely on the most important features to generate synthetic observations. Similarly, Tiwari et al. (2015) apply SMOTE to investigate the effect of various resampling ratios on observed peptides and absent peptides in protein mass spectrometry data. Both studies found that class balance greatly improves the performance of machine learning models. Others studied how close the distribution of the patterns and number of neighbors generated by SMOTE is to the original distribution and found that it performed better on large rather than small datasets (Elreedy & Atiya, 2019).

3.3.3 SVMSMOTE

Support Vector Machine The Synthetic Minority OverSampling Technique (SVMSMOTE) is a type of oversampling method made for datasets that are not balanced. Its goal is to fix the overfitting problems that come with regular SMOTE (AlJame et al., 2021, pp. 4–5). The SMOTE algorithm and Support Vector Machines (SVM) are combined in SVMSMOTE to make fake samples that are less similar to the original minority class instances (Krayem et al., 2021; Nguyen et al., 2011). To use this method, the original minority class samples are used to train an SVM classifier and find the decision boundary. This is then used to find "safe" and "borderline" areas within the minority class distribution. Synthetic samples are then subsequently generated by interpolating between instances identified as "safe," thus reducing the risk of overfitting by ensuring the synthetic samples are less similar to the original minority class instances.

SVMSMOTE has been widely used across various domains to address class imbalances with enhanced performance. Studies have utilized SVMSMOTE to address class imbalances in datasets pertaining to medicine, education, and cancer research. Researchers have used SVMSMOTE to conduct diagnostic tests and predict prostate cancer using multiparametric data with outstanding results (Barlow et al., 2019; Bertelli et al., 2022). Sujitha and Paramasivan (2023) employed SVMSMOTE to predict stages in lung diseases with enhanced performance (Sujitha & Paramasivan, 2023). Others have used SVMSMOTE to guess how well students will do in multiple classes using a dataset for education that has better classification performance (Ghorbani & Ghousi, 2020; Tariq et al., 2023). All of these studies show that SVMSMOTE can be used to make models work better and fix uneven data distributions in datasets from various fields.

3.3.4 SMOTE Tomek

SMOTETomek is a fusion of SMOTE and Tomek links, aimed at balancing imbalanced datasets while reducing noise. Applying the SMOTE algorithm to oversample the minority class is the first step in the process and removing Tomek links refines the dataset (Z. Wang et al., 2019). This process aims to eliminate overlapping and noisy observations, thereby enhancing the quality of the dataset and the performance of the machine learning models. By combining the strengths of SMOTE in generating synthetic samples and Tomek links in cleaning the dataset, SMOTETomek creates a more balanced and refined dataset for machine learning tasks.

SMOTETomek has been widely utilized in various domains to address imbalanced datasets and enhance model performance. In healthcare, SMOTETomek has been applied to build models for disease prediction, such as diabetes and hypertension, with improved accuracy and sensitivity (Fitriyani et al., 2019). Cancer research studies have also used SMOTETomek to address skewed data and accurately predict high-risk prostate and cervical cancer (Boratto et al., 2022; Ijaz et al., 2020; Lin et al., 2023; Tanimu et al., 2022). In computer science, SMOTETomek has been used for recommender systems and predicting software bugs, showing that it can be used in a variety of situations and is good at fixing class imbalances (Arif et al., 2024; Boratto et al., 2022). SMOTETomek has also been used effectively to address severe sample distribution imbalances in personality recognition datasets (Z. Wang et al., 2019). These studies collectively underscore the utility of SMOTETomek in various domains to address class imbalance issues across various datasets.

3.3.5 K-MeansSMOTE

K-MeansSMOTE is an oversampling technique designed to address class imbalance in datasets by generating synthetic samples for the minority class using k-means clustering. This method integrates the k-means clustering algorithm with SMOTE, executed in three distinct steps: clustering, filtering, and oversampling (Chen & Zhang, 2021; Douzas et al., 2018). Initially, the algorithm employs k-means clustering to partition the minority class samples into clusters. Subsequently, synthetic samples are created for each cluster by interpolating feature values from the minority class samples within the cluster. These synthetic instances are then added to the original dataset, resulting in a rebalanced dataset (De & Prabu, 2022).

K-MeansSMOTE has proven to be effective in oversampling methods and has improved model performance across various classification datasets (Chen & Zhang, 2021; Xu et al., 2021). In medical science, K-MeansSMOTE has been used to effectively balance data. In a recent study, the authors applied K-MeansSMOTE to eight UCI medical datasets with excellent classification scores (Xu et al., 2021). In other studies, such as predicting credit default, K-MeansSMOTE has been proven to be very efficient in addressing class imbalance with enhanced performance (Alam et al., 2020; Chen & Zhang, 2021; Srinivasan et al., 2024). K-MeansSMOTE has also been used to handle imbalances in classifying financial distress companies and shows improved performance across various classification metrics (Aljawazneh et al., 2021). Another domain where K-MeansSMOTE has been effectively used to address class imbalance and enhance performance is churn prediction (De & Prabu, 2022).

3.3.6 SMOTE + ENN

SMOTe + ENN is an oversampling technique devised to rectify imbalanced datasets by combining SMOTE with the edited nearest neighbor (ENN) method. The SMOTE + ENN technique operates in two sequential steps. Initially, SMOTE is applied to produce synthetic samples for the minority class. SMOTE accomplishes this by selecting a minority class sample along with its k-nearest neighbors (k-NN), then creating new synthetic samples by interpolating the feature values between the chosen sample and its neighbors. Subsequently, the ENN method is utilized to eliminate any majority-class samples that are misclassified by a k-NN classifier. ENN detects the majority of class samples misclassified by the k-NN classifier and removes them from the dataset. By amalgamating these two steps, SMOTE + ENN can simultaneously generate new synthetic samples for the minority class and eliminate misclassified majority class samples (Lokanan, 2023; Sisodia et al., 2017; F. Yang et al., 2022).

SMOTE + ENN has proven to be effective in balancing the class distribution and enhancing classification performance. Studies have found that SMOTE + ENN consistently yields superior outcomes compared to most oversampling methods (Batista et al., 2004; Singh et al., 2022; Sisodia et al., 2017). SMOTE + ENN has been extensively used to address class imbalances in medical data classification (Lamari et al., 2021). In medicine, SMOTE + ENN has been heavily utilized in healthcare for early detection tasks, such as predicting septic shock onset and diagnosing missed abortion, with enhanced diagnostic accuracy (Xu et al., 2020; F. Yang et al., 2022). Others have used SMOTE + ENN in predicting Parkinson's disease and chronic heart failure with very good classification results (Keller & Pandey, 2021; K. Wang et al., 2021). In the financial crimes’ arena, SMOTE + ENN has been effectively used to predict fraud with enhanced performance (Lokanan, 2023; Mienye & Sun, 2023). These studies found that SMOTE + ENN was excellent at balancing the datasets and enhancing the robustness of predictive models.

3.3.7 ADASYN

ADASYN, short for Adaptive Synthetic Sampling, is an oversampling technique specifically designed to address class imbalances in datasets. The algorithm works by generating synthetic samples for the minority class by adaptively adjusting the density of synthetic samples based on the difficulty of the classification problem (Fernandez et al., 2018, p. 870). The basic idea behind ADASYN is to generate more synthetic samples in regions where the decision boundary of the minority class is more complex to increase the diversity of the minority class samples (Haibo He et al., 2008). This adaptiveness makes ADASYN particularly effective in scenarios where the imbalance between classes is substantial and the classification problem is challenging.

ADASYN oversampling has been widely applied across various domains, showcasing its effectiveness in addressing class imbalances and improving model performance. In the field of medicine, researchers have used ADASYN to up-sample features to mitigate missing value concerns in cervical cancer detection and breast cancer diagnosis with exceptional accuracy (Khan et al., 2021; Kurniawati et al., 2018; Munshi, 2024). Others have employed ADASYN in fraud prediction. In one study, the authors found that ADASYN proved to be more advantageous than the traditional SMOTE algorithm in telecom fraud identification (Lu et al., 2020). In other studies, ADASYN was used to predict insurance and credit card fraud with enhanced effectiveness on a balanced data set over an unbalanced one (Muranda et al., 2020; Singh et al., 2022; Subudhi & Panigrahi, 2018). ADASYN has also proven to be effective in customer churn prediction (Rao et al., 2024). These applications underscore the versatility and efficiency of using ADASYN to up-sample imbalanced datasets in different domains.

The foregoing review indicates that various oversampling techniques are employed to address the challenge of imbalanced datasets in machine learning. Depending on the dataset and domain, different oversampling methods may be employed. RUS aims to balance the class distribution by manipulating minority class observations across different applications with enhanced performance. The SMOTE-based methods used SMOTE as a foundation to generate synthetic samples for the minority class and combined it with additional methods to balance the classes and enhance performance. ADASYN stands out as a unique category by adjusting synthetic sample density based on the complexity of the decision boundary. In doing so, ADASYN addresses class imbalance by focusing on domains where classification appears to be more challenging. Each oversampling technique reviewed provides distinct advantages and can be chosen according to the specific attributes and characteristics of the dataset and the classification problem being addressed.

4.1 Negative and Positive Impact of Undersampling

While resampling is a useful technique to address class imbalances, there are some negative impacts associated with both undersampling and oversampling. One of the problems associated with undersampling is that it discards potentially valuable information from the dataset (Branco et al., 2017, p. 16), an approach that tends to underfit the data (Oladunni et al., 2021, p. 3). Think of a situation where the ratio of class imbalance is 20,000:100, 40,000:100, or 50,000:100. Removing instances to match the minority and majority classes will lead to a significant loss of data. There is no way to preserve the rich information that will be randomly removed from the majority class with undersampling (Jo & Japkowicz, 2004). The loss of data can make the decision boundary between the minority and majority classes harder to learn from and increase the loss function (Branco et al., 2017; Moreo et al., 2016). Another important point is that the sample chosen from the majority class may be systematically biased and not representative of the population (Chakraborty et al., 2021). The loss of information results in poor and inaccurate classification performance on the test set.

Despite these drawbacks, undersampling is a useful strategy to address class imbalance problems. The main advantage of undersampling is that it is a valuable approach to reducing the risk of the model being trained and the analysis conducted by the classifier only on the majority class (Yap et al., 2014). Machine learning modeling performed only on the majority class will lead to distortion in the performance accuracy (i.e., the model is only reading the majority class) (Jo & Japkowicz, 2004; Rao et al., 2024). The minority events are harder for machine learning algorithms to predict because there are only a few instances. There is a high probability, for example, that a dataset characterized by a non-fraud to fraud ratio of 10,000:100 will lead to a model trained on the majority class (non-fraud) because there are fewer instances for the algorithm to learn from the data. Undersampling tackles this issue by reducing the number of instances in the majority class to align with the minority class. With large datasets, the use of undersampling helps to reduce computational costs and improve run time (Fujiwara et al., 2020, p. 3). Since undersampling shrinks the data, less training time is needed (p. 3). Consequently, businesses will require less storage space and time to obtain insights from analyzing the data (Zhu et al., 2017). To avoid scaling issues associated with excessive data, it is best to discard some to save on computational time and resource usage.

4.2 Negative and Positive Impact of Oversampling

Since minority class instances are randomly selected to produce exact copies of the majority class, random oversampling increases the likelihood of overfitting, especially for higher oversampling rates (Branco et al., 2017; Chawla et al., 2002; Fujiwara et al., 2020). Overfitting occurs when the model achieves high accuracy on the training data (overfits) but is a poor predictor (or fails to perform well) on the unseen test data (Santos et al., 2018). The use of artificial data to construct rules for the replication of the minority class that are apparently accurate may end up generating the wrong synthetic instances and increasing the generalization error (Zhu et al., 2017, p. 330). A more fundamental problem with oversampling is that it overgeneralizes the minority class with data from the majority class and leads to "overestimation of minority data" (Pan et al., 2020, p. 1215). This strategy is particularly problematic in the case of "highly skewed class distributions where the minority class examples are very sparse, thus resulting in a greater chance of class mixture" (Branco et al., 2017, p. 19). Considering that the model is evaluating the same samples multiple times, datasets that are highly skewed may result in increased computational costs (Zhu et al., 2017, p. 333).

Despite these drawbacks, oversampling can have a positive impact on the results (Chawla et al., 2002; Elreedy & Atiya, 2019). The reason for oversampling the data is to ensure that there are enough instances to reduce class bias in the performance evaluation (Naseriparsa & Kashani, 2014). A dataset that is trained on the majority class label will result in the classifier only reading the instances from the majority class in the test set. SMOTE-based techniques can directly address this problem because they do not simply oversample the data with duplicates; rather, SMOTE generates synthetic data that is slightly different from the original data (Branco et al., 2017; Chawla et al., 2002). The main idea of oversampling is to improve the classification performance of the minority class and reduce the information loss of the majority class (Gong & Gu, 2016, p. 3). In this regard, oversampling effectively helps to provide insights on the minority class samples, which would have been overlooked otherwise. In designing studies with imbalanced data, the minority and majority classes have to be dealt with care to ensure that each class has an equal chance to be analyzed by the classifier. Oversampling improves this outcome by giving instances from the original minority class a higher chance of being equally distributed in the data (Branco et al., 2017; Pan et al., 2020).

Data for this study came from the UCI repository. The UCI Credit Card dataset contains information on credit card clients and their default payments. This dataset includes information about credit card holders, such as their demographic characteristics, credit card usage, repayment history, and default payment status. The dataset consists of the following attributes:

Demographic Characteristics

Demographic attributes gender, age, level of education, and marital status.

Credit Card Usage

The variable LIMIT_BAL represent the amount of credit given to each cardholder.

Repayment History

The dataset provides information on the repayment status of cardholders for six consecutive months (PAY_0 to PAY_6). These variables indicate whether payments were made duly or delayed, and if delayed, the duration of the delay.

Bill Statements

Amount of bill statements are in reverse order from September to April (BILL_AMT1 to BILL_AMT6).

Previous Payments

Includes the amount of previous payment (PAY_AMT1 to PAY_AMT6) for each cardholder over six months.

Default Payment Status

The dependent variable default payment indicates whether a cardholder defaulted on their payment in the following month (1 = yes, 0 = no).

This dataset is commonly used for predictive modelling purposes, specifically to estimate the probability of default based on demographic and repayment history characteristics. Understanding credit risk and devising methods to minimize default rates is crucial for credit card issuers. There are 23 independent variables and data on 30,000 clients in the dataset. Table 1 displays the compilation of independent variables that make up the dataset.

Table 1

Variables and Measurements
Variable	Description	Type
LIMIT_BAL	Amount of given credit in NT dollars (includes individual and family/supplementary credit)	Continuous
SEX	Gender	Nominal
EDUCATION	Level of education (1 = graduate school, 2 = university, 3 = high school, 4 = others, 5 = unknown, 6 = unknown)	Nominal
MARRIAGE	Marital status (1 = married, 2 = single, 3 = others)	Nominal
AGE	Age in years	Continuous
PAY_0 to PAY_6	Repayment status from April to September 2005 (-1 = pay duly, 1 = payment delay for one month, 2 = payment delay for two months, … 8 = payment delay for eight months, 9 = payment delay for nine months and above)	Ordinal
BILL_AMT1 to BILL_AMT6	Amount of bill statement from April to September 2005 in NT dollars	Continuous
PAY_AMT1 to PAY_AMT6	Amount of previous payment from April to September 2005 in NT dollars	Continuous

5.1 Data Cleaning and Processing

The dataset required minimal cleaning and pre-processing. We removed certain categories from the "Education" variable, namely "others" and the two "unknowns," to prevent interpretation issues and redundancy during feature engineering. Additionally, we identified features with a correlation above the 0.70 threshold, indicating potential multicollinearity issues. After several iterations of collinearity checks, we removed the following features from the dataset: BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_3, PAY_4, PAY_5, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4. To scale the numeric features between 0 and 1, we applied MinMaxScaler. This scaling method was chosen because certain algorithms used in the analysis, such as neural networks and k-NN, require input features to be within this range, and the numerical features did not necessarily follow a normal distribution.

5.2 Algorithm Selection

Table 2 shows the algorithms used to evaluate the various under- and oversampling techniques. These classifiers were selected because they represent various categories of machine learning algorithms, have distinct mathematical foundations and learning mechanisms, and have been applied to datasets in various domains. Moreover, these algorithms behave differently with respect to the characteristics of the dataset and represent diverse categories of learning algorithms. We avoid relying on the literature review to choose algorithms for this paper because they may not adequately represent the full range of available classifiers. Instead, we opted to choose a diverse set of algorithms, ensuring representation from each category of classifiers. By representing diverse categories of machine learning, we ensure a comprehensive evaluation of under- and oversampling techniques across different modelling approaches. This approach allows us to conduct comparative experiments for a comprehensive understanding of their impact on classification performance.

Table 2

Algorithms Employed
Category	Algorithms
Ensemble Learning	CatBoost Classifier, Random Forest with GSCV
Deep Learning	ANN Classifier
Linear Model	Gradient Descent, Logistic Model
Probabilistic Model	Naive Bayes Classifier

We used traditional machine learning classification measures to evaluate the performance of the models. These measures include accuracy, precision, recall, F1-Score, and area under the curve (AUC), which are widely recognized in the field. Although the details of these metrics are well-established and will now be extensively discussed here, the formula for each classification metric is presented in Table 3 for reference.

Table 3

Performance Metrics
Metric	Description	Formula
Accuracy	Proportion of correct predictions	TP + TN/TP + TN + FP = FN
Precision	Proportion of true positive predictions	TP/TP + FP
Recall (Sensitivity)	Proportion of actual positives correctly identified	TP/TP + FN
F1 Score	Harmonic mean of precision and recall	2PrecisionRecall/Precision + Recall
AUC	Area under the ROC Curve	AUC=∑i = 1n − 12(xi + 1−xi)⋅(yi + yi + 1)/2

Our objective in looking at these findings is to find out how the under- and oversampling techniques affect the performance of the machine learning algorithms used for data analysis. It is important to note that we are not advocating for a specific method over others. Rather, we believe that the choice of under- or oversampling techniques should be based on the type of data and the desired balance between precision, recall, and overall accuracy, depending on the specific research objectives. Our intention is to present the under- and oversampling techniques that yield the best results in addressing class imbalances using the credit default dataset.

6.1 Results of Undersampling Techniques

Table 4 displays the undersampling techniques utilized in this paper and their respective outcomes. When compared to all the undersampling techniques, the NearMiss algorithm consistently produced better results. Under NearMiss, the ANN Classifier consistently demonstrated high performance across all metrics, including accuracy (.82), precision (.83), recall (.80), F1-score (.81), and AUC (.87). These findings suggest that the NearMiss undersampling technique effectively balanced the dataset, enabling the ANN classifier to make accurate predictions with high precision and recall. Note also that the ANN classifier with NearMiss undersampling had the highest classification performance scores across all measures when compared to the other algorithms. When compared to the other classifiers, the CatBoost and Random Forest with GSCV models demonstrated consistent performance under NearMiss. The results show that these models had moderate to high accuracy and balanced precision, recall, and AUC scores. This suggests that the NearMiss undersampling technique worked well to fix the class imbalance and work well on the test data. These findings also indicate that the NearMiss model generates fewer false positives and false negative observations. While the performance of the Logistic Model and Gradient Descent Classifier was lower than the models, the results were consistent across all the performance measures. Although the Logistic Model and Gradient Descent Classifier did not perform as well as the ANN, CatBoost, and Random Forest with GSCV models, they still demonstrated improved and consistent performance compared to the undersampling methods.

Overall, the NearMiss undersampling technique appears to be the most effective in addressing the class imbalance issue in the credit default dataset. These results show that NearMiss might be a good choice for dealing with class imbalance datasets. They also corroborate earlier findings that NearMiss can handle large-scale imbalance datasets (Alamsyah et al., 2022; Bao et al., 2016; Mqadi et al., 2021; Oladunni et al., 2021). The findings also indicate that NearMiss may be a suitable choice for similar imbalanced datasets where achieving high accuracy, precision, and recall is crucial to model performance (Alamsyah et al., 2022; Nayan et al., 2023). Further experimentation and optimization with NearMiss across different domains may provide additional insights into the effectiveness of undersampling techniques for modeling objectives.

The other undersampling techniques have inconsistent performance across various metrics. While TomekLinks undersampling resulted in higher overall accuracy compared to RUS and CNN undersampling techniques, the latter two tend to have more consistent and moderate precision and recall scores. These findings indicate that RUS and CNN may provide a more balanced trade-off between precision, recall, and accuracy, whereas TomekLinks may prioritize accuracy at the expense of precision and recall (Alamri & Ykhlef, 2024; Devi et al., 2017). Furthermore, the fact that precision and recall scores are consistent across different algorithms suggests that RUS and CNN may be able to perform better in a variety of situations compared to TomekLinks.

In terms of the AUC, TomekLinks undersampling demonstrated reasonable performance overall, albeit slightly less effective than NearMiss. The modest AUC scores for TomekLinks suggest that the algorithm has a satisfactory ability to discriminate between the classes but with room for further improvement (Ai-jun & Peng, 2020; Basit et al., 2022; Zeng et al., 2016). On the other hand, classifiers trained with CNN and RUS had lower AUC scores, which suggests that these techniques might not be the best way to improve discrimination when working with datasets that are not balanced (Bansal & Jain, 2021; Batista et al., 2004; Bauder & Khoshgoftaar, 2018; Zuech et al., 2021).

Table 4

Undersampling Performance Metric
Algorithm	Accuracy	Precision	Recall	F1_Score	AUC	Method
ANN	0.82	0.74	0.34	0.47	0.77	TomekLinks
Random Forest GSCV	0.79	0.57	0.52	0.54	0.70	TomekLinks
Naive Bayes	0.80	0.58	0.49	0.53	0.69	TomekLinks
CatBoost	0.82	0.76	0.31	0.44	0.64	TomekLinks
Logistic Model	0.81	0.74	0.32	0.45	0.64	TomekLinks
Gradient Descent	0.80	0.70	0.25	0.37	0.61	TomekLinks
ANN	0.82	0.83	0.80	0.81	0.87	NearMiss
Random Forest GSCV	0.76	0.79	0.72	0.75	0.76	NearMiss
CatBoost	0.76	0.78	0.72	0.75	0.76	NearMiss
Naive Bayes	0.72	0.83	0.57	0.68	0.72	NearMiss
Gradient Descent	0.68	0.70	0.63	0.66	0.68	NearMiss
Logistic Model	0.68	0.70	0.63	0.66	0.68	NearMiss
ANN	0.70	0.77	0.56	0.65	0.76	Random
CatBoost	0.69	0.79	0.51	0.62	0.69	Random
Random Forest GSCV	0.69	0.8	0.51	0.62	0.69	Random
Logistic Model	0.66	0.67	0.62	0.64	0.66	Random
Gradient Descent	0.65	0.65	0.67	0.66	0.65	Random
Naive Bayes	0.65	0.64	0.69	0.66	0.65	Random
ANN	0.66	0.73	0.42	0.53	0.70	CNN
Random Forest GSCV	0.66	0.71	0.46	0.56	0.65	CNN
CatBoost Classifier	0.66	0.72	0.43	0.54	0.64	CNN
Logistic Model	0.66	0.69	0.46	0.55	0.64	CNN
Naive Bayes	0.64	0.62	0.58	0.60	0.64	CNN
Gradient Descent	0.65	0.66	0.48	0.56	0.63	CNN

6.2 Results of Oversampling Techniques

Table 5 shows the results from the oversampling techniques used to evaluate the data. An analysis of the performance of the oversampling techniques across multiple classifiers reveals valuable insights into their effectiveness in addressing class imbalance. Across all the classifiers and performance metrics, K-MeansSMOTe had the overall best performance. With the exception of Naïve Bayes, all of the classifiers performed exceptionally well under KMeansSMOTE, with high accuracy (> 0.86), precision (0.87), recall (> 0.86), and F1-score (> 0.83). It is important to note that Naïve Bayes had excellent recall (0.87) but moderate F1-scores (0.82). These findings corroborate previous results and imply that K-MeansSMOTE successfully generates synthetic samples that improve the classifiers' ability to accurately classify instances from both the minority and majority classes (Aljawazneh et al., 2021; Chen & Zhang, 2021; Douzas et al., 2018). The fact that all of the models have excellent performance across various evaluation metrics under K-MeansSMOTE underscores its versatility and robustness to effectively work with different algorithms. Across all classifiers, K-MeansSMOTE consistently achieved high AUC scores of over 0.86, indicating its ability to perform robust discrimination between the majority and minority classes. The consistency in performance across all evaluation metrics suggests that the effectiveness of K-MeansSMOTE is not limited to any specific algorithms. Instead, these results support earlier work that showed K-MeansSMOTE to be a good way to oversample uneven data for classification tasks (Chen & Zhang, 2021; De & Prabu, 2022; Douzas et al., 2018; Srinivasan et al., 2024).

The other oversampling techniques displayed mixed results. SMOTE + ENN shows relatively moderate accuracy scores across all classifiers, ranging from 0.73 to 0.78, and has achieved decent precision, recall, and F1-score values. These findings align with prior research on SMOTE + ENN to address class imbalance with enhanced performance (Batista et al., 2004; Lokanan, 2023; Sisodia et al., 2017; F. Yang et al., 2022). However, except for the ANN classifier, its AUC scores are relatively lower compared to K-MeansSMOTE, hinting at potentially diminished discrimination capabilities across other classifiers. The AUC scores for SMOTE + ENN fall within the range of 0.68 to 0.80. On the other hand, SMOTE, SMOTETomek, SVMSMOTE, and ROS exhibit promising AUC scores between 0.55 and 0.78 across classifiers, albeit with mixed performance in precision, recall, and F1-scores. ADASYN tends to yield lower AUC scores, ranging from 0.54 to 0.64, suggesting potential struggles in synthesizing informative samples and resulting in less discriminative models.

While SMOTE, SMOTETomek, SVMSMOTE, ROS, and ADASYN exhibit moderate effectiveness in addressing class imbalance, K-MeansSMOTE stands out as the best all-round oversampling technique for its consistent performance across multiple classifiers. That said, each oversampling technique offers its own advantages. Practitioners must carefully weigh the trade-offs between accuracy, AUC, and consistency in discrimination ability, considering the characteristics of the dataset and domain applicability before deciding on which algorithms to use (Batista et al., 2004; Lamari et al., 2021).

Table 5

Oversampling Performance Metric
Algorithm	Accuracy	Precision	Recall	F1-Score	AUC	Method
Gradient Descent	0.88	0.90	0.84	0.87	0.88	KMeansSMOTE
Logistic Model	0.88	0.91	0.84	0.87	0.88	KMeansSMOTE
ANN	0.88	0.92	0.83	0.87	0.91	KMeansSMOTE
CatBoost	0.88	0.91	0.84	0.87	0.88	KMeansSMOTE
Random Forest GSCV	0.87	0.88	0.85	0.86	0.87	KMeansSMOTE
Naive Bayes	0.81	0.78	0.87	0.82	0.81	KMeansSMOTE
ANN	0.78	0.84	0.74	0.79	0.87	SMOTEENN
CatBoost	0.78	0.93	0.65	0.77	0.80	SMOTEENN
Naive Bayes	0.74	0.75	0.80	0.77	0.73	SMOTEENN
Gradient Descent	0.73	0.75	0.77	0.76	0.72	SMOTEENN
Logistic Model	0.73	0.76	0.77	0.76	0.73	SMOTEENN
Random Forest GSCV	0.70	0.70	0.81	0.75	0.68	SMOTEENN
ANN	0.70	0.77	0.58	0.66	0.76	ROS
Random Forest GSCV	0.69	0.78	0.52	0.62	0.69	ROS
CatBoost	0.69	0.78	0.52	0.62	0.69	ROS
Gradient Descent	0.66	0.66	0.64	0.65	0.66	ROS
Logistic Model	0.66	0.67	0.63	0.65	0.66	ROS
Naive Bayes	0.65	0.64	0.69	0.66	0.65	ROS
CatBoost	0.70	0.79	0.54	0.64	0.70	SVMSMOTE
Gradient Descent	0.69	0.76	0.55	0.64	0.69	SVMSMOTE
Random Forest GSCV	0.55	0.53	0.84	0.65	0.55	SVMSMOTE
Logistic Model	0.68	0.76	0.54	0.63	0.68	SVMSMOTE
Naive Bayes	0.65	0.62	0.75	0.68	0.65	SVMSMOTE
ANN	0.71	0.79	0.56	0.66	0.78	SVMSMOTE
ANN	0.69	0.73	0.62	0.67	0.76	SMOTETomek
CatBoost	0.68	0.80	0.52	0.63	0.69	SMOTETomek
Logistic Model	0.66	0.67	0.68	0.67	0.66	SMOTETomek
Naive Bayes	0.65	0.63	0.74	0.68	0.65	SMOTETomek
Gradient Descent	0.64	0.65	0.65	0.65	0.64	SMOTETomek
Random Forest GSCV	0.60	0.58	0.80	0.67	0.59	SMOTETomek
CatBoost	0.68	0.78	0.51	0.62	0.68	SMOTE
ANN	0.68	0.80	0.48	0.60	0.76	SMOTE
Logistic Model	0.66	0.67	0.64	0.65	0.66	SMOTE
Gradient Descent	0.65	0.66	0.63	0.64	0.65	SMOTE
Naive Bayes	0.65	0.64	0.71	0.67	0.65	SMOTE
Random Forest GSCV	0.61	0.58	0.77	0.66	0.61	SMOTE
Random Forest GSCV	0.54	0.52	0.79	0.63	0.54	ADASYN
ANN	0.65	0.73	0.46	0.56	0.70	ADASYN
CatBoost	0.64	0.74	0.43	0.54	0.64	ADASYN
Logistic Model	0.62	0.61	0.63	0.62	0.62	ADASYN
Gradient Descent	0.62	0.61	0.62	0.61	0.62	ADASYN
Naive Bayes	0.61	0.59	0.70	0.64	0.61	ADASYN

This paper sets out to address challenges in handling imbalanced datasets, with a focus on data preparation techniques. Through a comprehensive investigation of various under- and oversampling techniques using a credit default dataset, we obtain valuable insights into the effectiveness of these techniques in handling class imbalances in datasets. By scrutinizing their performance across a range of algorithms and classification metrics, we aim to provide comprehensive insights on the effectiveness of these techniques to enhance performance with balanced datasets. Our analysis provides insights into the strengths and limitations of various under- and oversampling techniques using a variety of machine learning algorithms.

Of the undersampling techniques, NearMiss emerged as the most effective technique for achieving enhanced performance across classifiers representing ensemble and deep learning, linear, and probabilistic models. Additionally, RUS and CNN undersampling techniques demonstrated a balanced trade-off between precision, recall, and accuracy, making them suitable alternatives to undersampling data depending on the characteristics of the dataset. On the other hand, K-MeansSMOTE stood out as the best all-round oversampling technique, consistently producing excellent results across various classifiers with high accuracy, precision, recall, and AUC scores. If accuracy is employed as the primary metric for assessing the performance of both under and oversampling techniques, then the classifiers utilizing K-MeansSMOTE oversampling, namely gradient descent, logistic regression, ANN, and CatBoost, achieved the highest scores, each reaching 0.88, respectively. Taken together, NearMiss and K-MeansSMOTE emerged as the best under- and oversampling techniques for this credit default dataset. The effectiveness of NearMiss and K-MeansSMOTE for the credit default dataset can be attributed to their inherent ability to rebalance the data distribution in a way that facilitates better classification performance while preserving the essential characteristics of the original dataset.

Although the under- and oversampling techniques explored in this paper have their own advantages and disadvantages, our research highlights the significance of carefully choosing resampling techniques based on the specific research objectives, characteristics of the dataset, and domain applicability. The findings from this paper enhance our understanding of the performance of different resampling techniques and contribute to the scholarship on handling imbalanced datasets. Furthermore, the insights gathered from this study also have significance for predictive analytics in different domains. We add to the ongoing discussion about how to deal with imbalanced datasets by showing how well different data preparation methods work. We also give practical advice for how to deal with this common problem to improve model performance and generalization in real-world applications.

7.1 Limitations and Future Research

This paper suffers from a few limitations. First, it is difficult to generalize the findings because of the reliance on one dataset in banking and finance. By focusing exclusively on a specific domain and dataset, the generalizability of the findings may be limited. The domain-specific characteristics inherent in the credit default dataset, such as economic factors, market conditions, and regulatory frameworks, have implications for the resampling techniques used to balance the data. To address these limitations, future research could explore the effectiveness of under- and oversampling techniques across multiple datasets spanning different domains to provide a more comprehensive understanding of their applicability and performance.

Second, this paper did not examine the full range of experiments and evaluation metrics used to measure performance in predictive machine learning research. While we utilized commonly employed evaluation metrics and best practices in machine learning research, alternative metrics or experimental setups could yield different results. There may be inherent biases or assumptions embedded within the credit default dataset that could impact the findings of the analysis. Future research may benefit from considering various techniques in order to enhance the reliability and validity of models used to address imbalanced datasets. These techniques include cross-validation, the utilization of the Matthews correlation coefficient, bootstrapping sampling for handling imbalanced datasets, GridSearch optimization, as well as other robust metrics and architectural approaches.

Author Contribution

I would like to recognize the valuable contribution of my graduate research assistant, Vikas Maddhesia, to this project.

Ai-jun, L., & Peng, Z. (2020). Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote. Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, 13–17. https://doi.org/10.1145/3430199.3430222
Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An Investigation of Credit Card Default Prediction in the Imbalanced Datasets. IEEE Access, 8, 201173–201198. https://doi.org/10.1109/ACCESS.2020.3033784
Alamri, M., & Ykhlef, M. (2024). Hybrid Undersampling and Oversampling for Handling Imbalanced Credit Card Data. IEEE Access, 12, 14050–14060. https://doi.org/10.1109/ACCESS.2024.3357091
Alamsyah, A. R. B., Anisa, S. R., Belinda, N. S., & Setiawan, A. (2022). SMOTE and Nearmiss Methods for Disease Classification with Unbalanced Data: Case Study: IFLS 5. Proceedings of The International Conference on Data Science and Official Statistics, 2021(1), 305–314. https://doi.org/10.34123/icdsos.v2021i1.240
AlJame, M., Imtiaz, A., Ahmad, I., & Mohammed, A. (2021). Deep forest model for diagnosing COVID-19 from routine blood tests. Scientific Reports, 11(1), 16682. https://doi.org/10.1038/s41598-021-95957-w
Aljawazneh, H., Mora, A. M., Garcia-Sanchez, P., & Castillo-Valdivieso, P. A. (2021). Comparing the Performance of Deep Learning Methods to Predict Companies’ Financial Failure. IEEE Access, 9, 97010–97038. https://doi.org/10.1109/ACCESS.2021.3093461
Almhaithawi, D., Jafar, A., & Aljnidi, M. (2020). Example-dependent cost-sensitive credit cards fraud detection using SMOTE and Bayes minimum risk. SN Applied Sciences, 2(9), 1574. https://doi.org/10.1007/s42452-020-03375-w
Alsowail, R. A. (2022). An Insider Threat Detection Model Using One-Hot Encoding and Near-Miss Under-Sampling Techniques. In M. S. Uddin, P. K. Jamwal, & J. C. Bansal (Eds.), Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 183–196). Springer Nature Singapore. https://doi.org/10.1007/978-981-19-0332-8_13
An, B., & Suh, Y. (2020). Identifying financial statement fraud with decision rules obtained from Modified Random Forest. Data Technologies and Applications, 54(2), 235–255. https://doi.org/10.1108/DTA-11-2019-0208
Angra, S., & Ahuja, S. (2017). Machine learning and its applications: A review. 2017 International Conference on Big Data Analytics and Computational Intelligence (ICBDAC), 57–60. https://doi.org/10.1109/ICBDACI.2017.8070809
Araf, I., Idri, A., & Chairi, I. (2024). Cost-sensitive learning for imbalanced medical data: A review. Artificial Intelligence Review, 57(4), 80. https://doi.org/10.1007/s10462-023-10652-8
Arif, M., Fang, G., Fida, H., Musleh, S., Yu, D.-J., & Alam, T. (2024). iMRSAPred: Improved Prediction of Anti-MRSA Peptides Using Physicochemical and Pairwise Contact-Energy Properties of Amino Acids. ACS Omega, 9(2), 2874–2883. https://doi.org/10.1021/acsomega.3c08303
Bagui, S., & Li, K. (2021). Resampling imbalanced data for network intrusion detection datasets. Journal of Big Data, 8(1), 6. https://doi.org/10.1186/s40537-020-00390-x
Bansal, A., & Jain, A. (2021). Analysis of Focussed Under-Sampling Techniques with Machine Learning Classifiers. 2021 IEEE/ACIS 19th International Conference on Software Engineering Research, Management and Applications (SERA), 91–96. https://doi.org/10.1109/SERA51205.2021.9509270
Bao, L., Juan, C., Li, J., & Zhang, Y. (2016). Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing, 172, 198–206. https://doi.org/10.1016/j.neucom.2014.05.096
Barlow, Mao, & Khushi. (2019). Predicting High-Risk Prostate Cancer Using Machine Learning Methods. Data, 4(3), 129. https://doi.org/10.3390/data4030129
Basit, M. S., Khan, A., Farooq, O., Khan, Y. U., & Shameem, M. (2022). Handling Imbalanced and Overlapped Medical Datasets: A Comparative Study. 2022 5th International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT), 1–7. https://doi.org/10.1109/IMPACT55510.2022.10029111
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
Bauder, R., & Khoshgoftaar, T. (2018). Medicare Fraud Detection Using Random Forest with Class Imbalanced Big Data. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 80–87. https://doi.org/10.1109/IRI.2018.00019
Bertelli, E., Mercatelli, L., Marzi, C., Pachetti, E., Baccini, M., Barucci, A., Colantonio, S., Gherardini, L., Lattavo, L., Pascali, M. A., Agostini, S., & Miele, V. (2022). Machine and Deep Learning Prediction Of Prostate Cancer Aggressiveness Using Multiparametric MRI. Frontiers in Oncology, 11, 802964. https://doi.org/10.3389/fonc.2021.802964
Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613. https://doi.org/10.1016/j.dss.2010.08.008
Boratto, L., Carta, S., Iguider, W., Mulas, F., & Pilloni, P. (2022). Fair performance-based user recommendation in eCoaching systems. User Modeling and User-Adapted Interaction, 32(5), 839–881. https://doi.org/10.1007/s11257-022-09339-6
Branco, P., Torgo, L., & Ribeiro, R. P. (2017). A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, 49(2), 1–50. https://doi.org/10.1145/2907070
Chakraborty, J., Majumder, S., & Menzies, T. (2021). Bias in machine learning software: Why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 429–440. https://doi.org/10.1145/3468264.3468537
Chaplot, A., Choudhary, N., & Jain, K. (2019). A Review on Data Level Approaches for Managing Imbalanced Classification Problem. 6(2), 91–97.
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3–16. https://doi.org/10.1016/j.neucom.2014.08.091
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Chen, Y., & Zhang, R. (2021). Research on Credit Card Default Prediction Based on k-Means SMOTE and BP Neural Network. Complexity, 2021, 1–13. https://doi.org/10.1155/2021/6618841
Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
De, S., & Prabu, P. (2022). A Sampling-Based Stack Framework for Imbalanced Learning in Churn Prediction. IEEE Access, 10, 68017–68028. https://doi.org/10.1109/ACCESS.2022.3185227
Devi, D., Biswas, S. kr., & Purkayastha, B. (2017). Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, 93, 3–12. https://doi.org/10.1016/j.patrec.2016.10.006
Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056
Elreedy, D., & Atiya, A. F. (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32–64. https://doi.org/10.1016/j.ins.2019.07.070
Fernandez, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863–905. https://doi.org/10.1613/jair.1.11192
Fitriyani, N. L., Syafrudin, M., Alfian, G., & Rhee, J. (2019). Development of Disease Prediction Model Based on Ensemble Learning Approach for Diabetes and Hypertension. IEEE Access, 7, 144777–144789. https://doi.org/10.1109/ACCESS.2019.2945129
Fujiwara, K., Huang, Y., Hori, K., Nishioji, K., Kobayashi, M., Kamaguchi, M., & Kano, M. (2020). Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Frontiers in Public Health, 8, 178. https://doi.org/10.3389/fpubh.2020.00178
Ghorbani, R., & Ghousi, R. (2020). Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques. IEEE Access, 8, 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809
Gong, C., & Gu, L. (2016). A Novel SMOTE-Based Classification Approach to Online Data Imbalance Problem. Mathematical Problems in Engineering, 2016, 1–14. https://doi.org/10.1155/2016/5685970
Ha, J., & Lee, J.-S. (2016). A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification. Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, 1–6. https://doi.org/10.1145/2857546.2857643
Haibo He, Yang Bai, Garcia, E. A., & Shutao Li. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Hancock, J., Khoshgoftaar, T. M., & Johnson, J. M. (2022). The Effects of Random Undersampling for Big Data Medicare Fraud Detection. 2022 IEEE International Conference on Service-Oriented System Engineering (SOSE), 141–146. https://doi.org/10.1109/SOSE55356.2022.00023
Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2019). Severely imbalanced Big Data challenges: Investigating data sampling approaches. Journal of Big Data, 6(1), 107. https://doi.org/10.1186/s40537-019-0274-4
Hernandez, J., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2013). An Empirical Study of Oversampling and Undersampling for Instance Selection Methods on Imbalance Datasets. In J. Ruiz-Shulcloper & G. Sanniti di Baja (Eds.), Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (Vol. 8258, pp. 262–269). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-41822-8_33
Hooda, N., Bawa, S., & Rana, P. S. (2018). Fraudulent Firm Classification: A Case Study of an External Audit. Applied Artificial Intelligence, 32(1), 48–64. https://doi.org/10.1080/08839514.2018.1451032
Huan, W., Lin, H., Li, H., Zhou, Y., & Wang, Y. (2020). Anomaly Detection Method Based on Clustering Undersampling and Ensemble Learning. 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), 980–984. https://doi.org/10.1109/ITOEC49072.2020.9141897
Ijaz, M. F., Attique, M., & Son, Y. (2020). Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods. Sensors, 20(10), 2809. https://doi.org/10.3390/s20102809
Jeni, L. A., Cohn, J. F., & De La Torre, F. (2013). Facing Imbalanced Data—Recommendations for the Use of Performance Metrics. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 245–251. https://doi.org/10.1109/ACII.2013.47
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40–49. https://doi.org/10.1145/1007730.1007737
Kamei, Y., Monden, A., Matsumoto, S., Kakimoto, T., & Matsumoto, K. (2007). The Effects of Over and Under Sampling on Fault-prone Module Detection. First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), 196–204. https://doi.org/10.1109/ESEM.2007.28
Keller, A., & Pandey, A. (2021). SMOTE and ENN based XGBoost prediction model for Parkinson’s disease detection. 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), 839–846. https://doi.org/10.1109/ICOSEC51865.2021.9591716
Khan, T. M., Xu, S., Khan, Z. G., & Uzair chishti, M. (2021). Implementing Multilabeling, ADASYN, and Relief Techniques for Classification of Breast Cancer Diagnostic through Machine Learning: Efficient Computer-Aided Diagnostic System. Journal of Healthcare Engineering, 2021, 1–15. https://doi.org/10.1155/2021/5577636
Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960–109975. https://doi.org/10.1109/ACCESS.2021.3102399
Krayem, A., Yeretzian, A., Faour, G., & Najem, S. (2021). Machine learning for buildings’ characterization and power-law recovery of urban metrics. PLOS ONE, 16(1), e0246096. https://doi.org/10.1371/journal.pone.0246096
Kurniawati, Y. E., Permanasari, A. E., & Fauziati, S. (2018). Adaptive Synthetic-Nominal (ADASYN-N) and Adaptive Synthetic-KNN (ADASYN-KNN) for Multiclass Imbalance Learning on Laboratory Test Data. 2018 4th International Conference on Science and Technology (ICST), 1–6. https://doi.org/10.1109/ICSTC.2018.8528679
Lamari, M., Azizi, N., Hammami, N. E., Boukhamla, A., Cheriguene, S., Dendani, N., & Benzebouchi, N. E. (2021). SMOTE–ENN-Based Data Sampling and Improved Dynamic Ensemble Selection for Imbalanced Medical Data Classification. In F. Saeed, T. Al-Hadhrami, F. Mohammed, & E. Mohammed (Eds.), Advances on Smart and Soft Computing (Vol. 1188, pp. 37–49). Springer Singapore. https://doi.org/10.1007/978-981-15-6048-4_4
Lean Yu, Shouyang Wang, & Lai, K. K. (2006). An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering, 18(2), 217–230. https://doi.org/10.1109/TKDE.2006.22
Lin, C., Tsai, C.-F., & Lin, W.-C. (2023). Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: An experimental study. Artificial Intelligence Review, 56(2), 845–863. https://doi.org/10.1007/s10462-022-10186-5
Lokanan, M. E. (2023). Predicting mobile money transaction fraud using machine learning algorithms. Applied AI Letters, 4(2), e85. https://doi.org/10.1002/ail2.85
Lokanan, M. E., & Sharma, K. (2022). Fraud prediction using machine learning: The case of investment advisors in Canada. Machine Learning with Applications, 8, 100269. https://doi.org/10.1016/j.mlwa.2022.100269
Lu, C., Lin, S., Liu, X., & Shi, H. (2020). Telecom Fraud Identification Based on ADASYN and Random Forest. 2020 5th International Conference on Computer and Communication Systems (ICCCS), 447–452. https://doi.org/10.1109/ICCCS49078.2020.9118521
Maldonado, S., López, J., & Vairetti, C. (2019). An alternative SMOTE oversampling strategy for high-dimensional datasets. Applied Soft Computing, 76, 380–389. https://doi.org/10.1016/j.asoc.2018.12.024
Mienye, I. D., & Sun, Y. (2023). A Deep Learning Ensemble With Data Resampling for Credit Card Fraud Detection. IEEE Access, 11, 30628–30638. https://doi.org/10.1109/ACCESS.2023.3262020
Mirchevska, V., Luštrek, M., & Gams, M. (2014). Combining domain knowledge and machine learning for robust fall detection. Expert Systems, 31(2), 163–175. https://doi.org/10.1111/exsy.12019
More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. https://doi.org/10.48550/ARXIV.1608.06048
Moreo, A., Esuli, A., & Sebastiani, F. (2016). Distributional Random Oversampling for Imbalanced Text Classification. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 805–808. https://doi.org/10.1145/2911451.2914722
Mqadi, N. M., Naicker, N., & Adeliyi, T. (2021). Solving Misclassification of the Credit Card Imbalance Problem Using Near Miss. Mathematical Problems in Engineering, 2021, 1–16. https://doi.org/10.1155/2021/7194728
Munshi, R. M. (2024). Correction: Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction. PLOS ONE, 19(2), e0298980. https://doi.org/10.1371/journal.pone.0298980
Muranda, C., Ali, A., & Shongwe, T. (2020). Detecting Fraudulent Motor Insurance Claims Using Support Vector Machines with Adaptive Synthetic Sampling Method. 2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS), 1–5. https://doi.org/10.1109/ITMS51158.2020.9259322
Naseriparsa, M., & Kashani, M. M. R. (2014). Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset. https://doi.org/10.48550/ARXIV.1403.1949
Nayan, N. M., Islam, A., Islam, M. U., Ahmed, E., Hossain, M. M., & Alam, M. Z. (2023). SMOTE Oversampling and Near Miss Undersampling Based Diabetes Diagnosis from Imbalanced Dataset with XAI Visualization. 2023 IEEE Symposium on Computers and Communications (ISCC), 1–6. https://doi.org/10.1109/ISCC58397.2023.10218281
Nguyen, H. M., Cooper, E. W., & Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), 4. https://doi.org/10.1504/IJKESDP.2011.039875
Nigrini, M. J. (2019). The patterns of the numbers used in occupational fraud schemes. Managerial Auditing Journal, 34(5), 606–626. https://doi.org/10.1108/MAJ-11-2017-1717
Ning, Q., Zhao, X., & Ma, Z. (2022). A Novel Method for Identification of Glutarylation Sites Combining Borderline-SMOTE With Tomek Links Technique in Imbalanced Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2632–2641. https://doi.org/10.1109/TCBB.2021.3095482
Oladunni, T., Tossou, S., Haile, Y., & Kidane, A. (2021). COVID-19 County Level Severity Classification with Imbalanced Dataset: A NearMiss Under-sampling Approach [Preprint]. Epidemiology. https://doi.org/10.1101/2021.05.21.21257603
Pan, T., Zhao, J., Wu, W., & Yang, J. (2020). Learning imbalanced datasets based on SMOTE and Gaussian distribution. Information Sciences, 512, 1214–1233. https://doi.org/10.1016/j.ins.2019.10.048
Pereira, R. M., Costa, Y. M. G., & Silla Jr., C. N. (2020). MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing, 383, 95–105. https://doi.org/10.1016/j.neucom.2019.11.076
Pias, T. S., Su, Y., Tang, X., Wang, H., Faghani, S., & Yao, D. (Daphne). (2023). Enhancing Fairness and Accuracy in Diagnosing Type 2 Diabetes in Young Population [Preprint]. Health Informatics. https://doi.org/10.1101/2023.05.02.23289405
Rajer-Kanduč, K., Zupan, J., & Majcen, N. (2003). Separation of data on the training and test set for modelling: A case study for modelling of five colour properties of a white pigment. Chemometrics and Intelligent Laboratory Systems, 65(2), 221–229. https://doi.org/10.1016/S0169-7439(02)00110-7
Rao, C., Xu, Y., Xiao, X., Hu, F., & Goh, M. (2024). Imbalanced customer churn classification using a new multi-strategy collaborative processing method. Expert Systems with Applications, 247, 123251. https://doi.org/10.1016/j.eswa.2024.123251
Rubaidi, Z. S., Ammar, B. B., & Aouicha, M. B. (2022). Fraud Detection Using Large-scale Imbalance Dataset. International Journal on Artificial Intelligence Tools, 31(08), 2250037. https://doi.org/10.1142/S0218213022500373
Santos, M. S., Soares, J. P., Abreu, P. H., Araujo, H., & Santos, J. (2018). Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]. IEEE Computational Intelligence Magazine, 13(4), 59–76. https://doi.org/10.1109/MCI.2018.2866730
Saripuddin, M., Suliman, A., Syarmila Sameon, S., & Jorgensen, B. N. (2021). Random Undersampling on Imbalance Time Series Data for Anomaly Detection. 2021 The 4th International Conference on Machine Learning and Machine Intelligence, 151–156. https://doi.org/10.1145/3490725.3490748
Sattler, K.-U., & Schallehn, E. (2001). A data preparation framework based on a multidatabase language. Proceedings 2001 International Database Engineering and Applications Symposium, 219–228. https://doi.org/10.1109/IDEAS.2001.938088
Silva, B., Silveira, R., Silva Neto, M., Cortez, P., & Gomes, D. (2021). A comparative analysis of undersampling techniques for network intrusion detection systems design. Journal of Communication and Information Systems, 36(1), 31–43. https://doi.org/10.14209/jcis.2021.3
Singh, A., Ranjan, R. K., & Tiwari, A. (2022). Credit Card Fraud Detection under Extreme Imbalanced Data: A Comparative Study of Data-level Algorithms. Journal of Experimental & Theoretical Artificial Intelligence, 34(4), 571–598. https://doi.org/10.1080/0952813X.2021.1907795
Sisodia, D. S., Reddy, N. K., & Bhandari, S. (2017). Performance evaluation of class balancing techniques for credit card fraud detection. 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI), 2747–2752. https://doi.org/10.1109/ICPCSI.2017.8392219
Srinilta, C., & Kanharattanachai, S. (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), 217–220. https://doi.org/10.1109/ICEAST52143.2021.9426310
Srinivasan, S., Vallikannu, A. L., Manoharan, L., Deepthi, K., & Aravind Yadav, B. (2024). Identification of the Best Combination of Oversampling Technique and Machine Learning Algorithm for Credit Card Fraud Detection. In I. J. Jacob, S. Piramuthu, & P. Falkowski-Gilski (Eds.), Data Intelligence and Cognitive Informatics (pp. 557–571). Springer Nature Singapore. https://doi.org/10.1007/978-981-99-7962-2_41
Subudhi, S., & Panigrahi, S. (2018). Effect of Class Imbalanceness in Detecting Automobile Insurance Fraud. 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), 528–531. https://doi.org/10.1109/ICDSBA.2018.00104
Sujitha, R., & Paramasivan, B. (2023). Optimal progressive classification study using SMOTE-SVM for stages of lung disease. Automatika, 64(4), 807–814. https://doi.org/10.1080/00051144.2023.2218167
Sun, B., & Chen, H. (2021). A Survey of k Nearest Neighbor Algorithms for Solving the Class Imbalanced Problem. Wireless Communications and Mobile Computing, 2021, 1–12. https://doi.org/10.1155/2021/5520990
Tanimu, J. J., Hamada, M., Hassan, M., Kakudi, H., & Abiodun, J. O. (2022). A Machine Learning Method for Classification of Cervical Cancer. Electronics, 11(3), 463. https://doi.org/10.3390/electronics11030463
Tariq, M. A., Sargano, A. B., Iftikhar, M. A., & Habib, Z. (2023). Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques. Cybernetics and Information Technologies, 23(4), 199–212. https://doi.org/10.2478/cait-2023-0044
Tiwari, S., Wee, H. M., & Daryanto, Y. (2018). Big data analytics in supply chain management between 2010 and 2016: Insights to industries. Computers & Industrial Engineering, 115, 319–330. https://doi.org/10.1016/j.cie.2017.11.017
Vuttipittayamongkol, P., Elyan, E., & Petrovski, A. (2021). On the class overlap problem in imbalanced data classification. Knowledge-Based Systems, 212, 106631. https://doi.org/10.1016/j.knosys.2020.106631
Wang, K., Tian, J., Zheng, C., Yang, H., Ren, J., Li, C., Han, Q., & Zhang, Y. (2021). Improving Risk Identification of Adverse Outcomes in Chronic Heart Failure Using SMOTE+ENN and Machine Learning. Risk Management and Healthcare Policy, Volume 14, 2453–2463. https://doi.org/10.2147/RMHP.S310295
Wang, Z., Wu, C., Zheng, K., Niu, X., & Wang, X. (2019). SMOTETomek-Based Resampling for Personality Recognition. IEEE Access, 7, 129678–129689. https://doi.org/10.1109/ACCESS.2019.2940061
Xie, X., Liu, H., Zeng, S., Lin, L., & Li, W. (2021). A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowledge-Based Systems, 213, 106689. https://doi.org/10.1016/j.knosys.2020.106689
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. https://doi.org/10.1016/j.jbi.2020.103465
Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N., & Han, X. (2021). A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences, 572, 574–589. https://doi.org/10.1016/j.ins.2021.02.056
Xuan, S., Liu, G., Li, Z., Zheng, L., Wang, S., & Jiang, C. (2018). Random forest for credit card fraud detection. 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), 1–6. https://doi.org/10.1109/ICNSC.2018.8361343
Yang, F., Wang, K., Sun, L., Zhai, M., Song, J., & Wang, H. (2022). A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis. BMC Medical Informatics and Decision Making, 22(1), 344. https://doi.org/10.1186/s12911-022-02075-2
Yang, Y., Yang, X., Tang, W., & Li, L. (2023). A Undersampling-DoppelGANger based Data Generation Method for Unbalanced BGP Data. 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS), 100–105. https://doi.org/10.1109/CCIS59572.2023.10263221
Yap, B. W., Rani, K. A., Rahman, H. A. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. In T. Herawan, M. M. Deris, & J. Abawajy (Eds.), Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013) (Vol. 285, pp. 13–22). Springer Singapore. https://doi.org/10.1007/978-981-4585-18-7_2
Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), 225–228. https://doi.org/10.1109/ICOACS.2016.7563084
Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 20, 99–116. https://doi.org/10.1016/j.inffus.2013.12.003
Zhao, H., Chen, X., Nguyen, T., Huang, J. Z., Williams, G., & Chen, H. (2016). Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data. In M. Chau, G. A. Wang, & H. Chen (Eds.), Intelligence and Security Informatics (Vol. 9650, pp. 63–72). Springer International Publishing. https://doi.org/10.1007/978-3-319-31863-9_5
Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16–25. https://doi.org/10.1016/j.knosys.2012.12.007
Zhu, T., Lin, Y., & Liu, Y. (2017). Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognition, 72, 327–340. https://doi.org/10.1016/j.patcog.2017.07.024
Zuech, R., Hancock, J., & Khoshgoftaar, T. M. (2021). Detecting web attacks using random undersampling and ensemble learners. Journal of Big Data, 8(1), 75. https://doi.org/10.1186/s40537-021-00460-8

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Exploring Resampling Techniques in Credit Card Default Prediction

Status:

Version 1

Abstract

Figures

1 Introduction

2 Necessity of Data Preparation

3 Literature Review: Resampling Imbalanced Datasets

3.1 Undersampling Majority Class

3.1.2 Random undersampling

3.1.3 Tomek links

3.1.4 NearMiss

3.1.5 Condensed Nearest Neighbor (CNN)

3.2 Oversampling Minority Class

3.2.1 Random Oversampling

3.2.2 SMOTE

3.3.3 SVMSMOTE

3.3.4 SMOTE Tomek

3.3.5 K-MeansSMOTE

3.3.6 SMOTE + ENN

3.3.7 ADASYN

4 Addressing the Impact of Under and OverSampling

4.1 Negative and Positive Impact of Undersampling

4.2 Negative and Positive Impact of Oversampling

5 Research Design

5.1 Data Cleaning and Processing

5.2 Algorithm Selection

6 Findings and Analysis

6.1 Results of Undersampling Techniques

6.2 Results of Oversampling Techniques

7 Conclusion

7.1 Limitations and Future Research

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1