Ensemble Machine Learning Approach to Bankruptcy Prediction

doi:10.21203/rs.3.rs-4280527/v1

Download PDF

Article

Ensemble Machine Learning Approach to Bankruptcy Prediction

https://doi.org/10.21203/rs.3.rs-4280527/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Bankruptcy is among the biggest threats every company has in terms of its reputation. It can be explained as a situation in which a company is unable to pay back the outstanding debts to banks, lenders, and/or suppliers. An accurate and timely prediction could help the companies avoid and prevent this situation by incorporating remedial actions beforehand. In this regard, current research aims to investigate a combination of techniques to accurately predict bankruptcy. This is achieved by investigating different ensemble techniques to combine the best methods for accuracy enhancement. Comparative analysis has been carried out on the proposed ensemble models using the original but imbalanced dataset as well as the balanced dataset using the oversampling technique. The proposed method of using the balanced dataset on the proposed ensemble models outperformed the models using the original dataset in terms of accuracy and other performance metrics. As an outcome, Balance Bagging exhibits the highest accuracy at 98.77% followed by Random Forest at 98.68% and AdaBoost at 96.9%, respectively. This is considerable in contrast to the state-of-the-art techniques.

Physical sciences/Mathematics and computing/Computational science

Physical sciences/Mathematics and computing/Computer science

Physical sciences/Mathematics and computing/Information technology

Bankruptcy prediction

ensemble techniques

machine learning

Balance Bagging

Random Forest.

Bankruptcy is a crucial thing that every company or enterprise must take into consideration while operating in an economic condition that changes very frequently. It can be explained as a situation in which a person or a company is unable to repay the outstanding debts to banks, suppliers, and tax authorities [1]. Moreover, every bankruptcy has an adverse impact on many stakeholders, including employees, consumers, creditors, suppliers and even the local community [18]. When a company or enterprise goes bankrupt, the contracts of many employees are terminated, and the company is forced to sell its assets resulting in big losses for the company as well as for the employees. Therefore, to reduce risks of a company or enterprise going bankrupt, research must be conducted into the prediction of bankruptcy risks and impact. It will help the companies to take proactive plans to mitigate the effects of bankruptcy or to avoid it all together before it hits hard and destroys financial stability. In the United States, such types of research have been conducted since the beginning of the 20th century, while in other developing countries it was initiated in 70s and the interest in this area came much later to the countries in Central Europe like Poland [18]. It is an era of technological advancements where there are overwhelming applications of data science, machine learning, blockchain and similar technologies for sake of security, privacy, risk assessment, prediction its avoidance and prevention for companies as well as individuals [38,32,11]. Ensemble methods that train multiple learners and then combine them for further use with Boosting and Bagging as representatives, are a kind of new and advanced learning approaches [54]. It is well known that an ensemble is usually significantly more accurate than a single learner or equal to it. Ensemble methods have already achieved great success in many real-world tasks [3]. Every technique has its own strengths and weaknesses when used with different types of predictors. Some methods work well in solving regression problems while other methods excel in solving classification problems, but the fact remains that the combination of different techniques will work together in negating the weaknesses of single techniques while combining the strengths of every technique. Machine learning (ML) is among the hottest topics of research in several application areas like healthcare [14,15,16,31,47], classification [39,17] and engineering [15,16,17] etc. Boosting is an approach in ML which is based on the idea of creating a highly accurate prediction rule by combining many similar weak and inaccurate rules [48]. The AdaBoost algorithm introduced by (Freund & Schapire, 1995) was the first practical algorithm for boosting. It remains one of the most widely used with numerous applications in different fields. Likewise, an alternative to AdaBoost algorithm is Bagging, a different ensemble technique that is used for sampling [49]. Bagging is a technique used for generating multiple versions of a predictor and using them to get an aggregated predictor. The aggregation takes the average of different versions when predicting numerical outcome and does a majority vote when predicting a class [41]. Many variants of bagging technique have been introduced recently. Hido and Kashima [33] introduced a variant of bagging called “Roughly Balanced Bagging”, that alters bagging to highlight the minority class.

Therefore, in this research, ensemble-based models based on AdaBoost, Bagging and Ensemble Random Forest techniques have been proposed and implemented to find out which of them gives the best accuracy combined with classifiers like decision tree and random forest for accurately prediction bankruptcy. The performance of each ensemble technique in terms of accuracy and other measures were compared to understand the difference between them using the original dataset and the balanced dataset. No such studies were conducted earlier for this dataset that comprises of ensemble techniques with boosting mechanism. The proposed models using the proposed method gave better results. Balance Bagging gave the highest accuracy at 98.77% followed by Random Forest at 98.68% and AdaBoost at 96.9%.

The remaining part of this work is organized as follows. Section 2 contains a review of related literatures. Section 3 contains the proposed ML ensemble techniques. Section 4 contains empirical studies that include dataset description, experimental setup and the adopted optimization strategy or parameter search strategy. Section 5 presents results and discussion, and section 6 concludes the paper.

Bankruptcy prediction analysis has been made using both statistical techniques and ML techniques. Statistical techniques estimate bankruptcy based on financial ratios which are divided into two groups of single variables and multivariable models such as multiple discriminant analysis model (MDA) and legit models [23]. Furthermore, ML techniques have also been developed in this field which also gained popularity in recent years as it solves the complexity of statistical models and has been established as successful in financial applications such as bankruptcy prediction [43]. There are many ML methods that can be used for bankruptcy prediction, but each method has different prediction accuracy. According to the literature, some of the best methods are Support Vector Machines (SVM), Neural Networks (NN), Random Forest (RF), Decision Tree (DT) and Logistic Regression (LR). DT and RF gave better prediction accuracy compared with other methods [50]. Main idea behind SVM is to find a separation line (a hyperplane) with the greatest possible margin between the hyperplane and any point that best divides a dataset into two classes. In [43], the authors proposed an SVM classifier based on RBF kernel, and it performed well for bankruptcy prediction in a dataset containing 250 instances with 6 features and got an accuracy of 99.6%. In [40], the author integrated SVM with optimization algorithms such as GA and PSO. SVM with PSO achieved an accuracy of 95% while SVM with GA achieved an accuracy of only 92.5%. In [25], the author proposed an SVM classifier using dataset containing 250 instances obtained from the UCI ML repository. SVM with PSO achieved 94.41% accuracy and SVM with GA achieved an accuracy of 90.87%. In [43], the authors proposed SVM Linear and SVM based on the RBF kernel. Both were used to predict bankruptcy on American and Canadian companies. SVM Linear achieved an accuracy of 80.73% on the training sample while it achieved 71.52% accuracy in the testing sample. SVM with RBF kernel gave 88.75% accuracy on the training sample while it gave 79.77% accuracy on the testing sample. In [43], the authors proposed a linear and nonlinear SVM. Both the classifiers were employed to predict bankruptcy in companies available in the database of the Credit & Debt Markets Research Program of NYU’s Salomon Center. The classifiers achieved an accuracy of 92.98% and 88.57% respectively. Neural Networks is an ML technique that mimics the way a human brain works. Neural networks have many applications such as pattern recognition, speech recognition and financial modeling [23]. In [9], the author proposed and developed a neural network model with different truncate percentage and dropout rate. The model with a dropout rate of 0.5 yielded the highest accuracy rate of 99.33% with minimal loss. In [20], the authors developed a hybrid discriminant neural network (HDNN) based on multi-layer perceptron (MLP) and on radial basis function (RBF). The model produced superior testing results, achieving 94% prediction accuracy. The model developed was tested using inputs of various ratios. The highest accuracy was achieved by MLP based neural network at 96.67%. In (du Jardin, 2008), authors proposed and developed a neural network based on variable selection methods. The model was tested using different variable combinations like variables at random, all the variables and by pairs using a technique called “modeling method-selection technique”. The maximum accuracy obtained was 94.03%. Li [56] proposed an early-warning model for detecting financial risk through artificial neural networks (ANN). The ANN model contained two levels of early indicators, consisting of 15 and 8 containers respectively. Moreover, the model’s parameters for early-warning signs of financial risks were based of different business-related financial initial indicators, such as sales, business plans, and financial statements. The results of Li’s model displayed extremely high prediction accuracy of 94%. In [19], the authors developed an artificial neural network using a sigmoidal activation function and got an accuracy of 95.56% on predicting bankruptcy after 150 epochs. In [42], the author developed a neural network based on different variables like solvency, liquidity, and profitability ratios. The model developed was trained on 3,728 samples of Belgian Small and Medium enterprises dataset and got an accuracy of 81.5%. In [46], the authors proposed a classification model to predict the company’s bankruptcy in the Tehran stock exchange. The model was based on the CHAID decision tree algorithm. They preprocessed the dataset by applying factor analysis which helps to reduce the data, a few numbers of the factors that explain more variance between the variables. The model was trained using 8 chosen factors, with 146 companies out of which 54% were bankrupt and 46% were non-bankrupt. The total accuracy of the model was 77.6%. In [26], the authors used decision tree algorithms to predict a company’s bankruptcy using accounting ratios. The dataset was collected from the Wind Financial Terminal Database and CCER Economic and Financial database. 518 companies were selected as samples with 22 attributes. The total accuracy of the model was 92.9%. The same group of researchers used random forest to predict the company’s bankruptcy. Random Forest is an ensemble ML algorithm that follows the bagging technique. It is a classifier that combines existing classifiers to form classifiers with the strongest performance. The result shows that random forest with accuracy 95.9% has higher accuracy than the decision tree. Moreover, [9] the author proposed two models in his paper. The dataset used to train and test the model is called Qualitative Bankruptcy database, which was created by Martin, and it contains 250 samples with 6 features. The first model was based on a neural network with a decision tree and the second model was based on an autoencoder with a decision tree, both gave a total accuracy of 99.33%. CART is a supervised ML model. It is also a type of decision tree algorithm that is used for designing prediction models derived from the data collected in specific fields. In [35], the authors developed a logistic regression model for predicting bankruptcy and applied it on a financial dataset consisting of 72 companies with 5 attributes after performing the feature selection on the real set of attributes. The accuracy achieved on the training set was 74.31% and 59.72% on the testing set. However, the accuracy produced using this method is very low and it needs to be improved. Therefore, in [4], the author improved the accuracy of predicting bankruptcy using the same method called logistic regression and applied it to the dataset containing 274 companies with 15 features. The accuracy improved from 74.31–75% on the training set and from 59.72–67.5% on the testing set. In [51], the authors used logistic regression method and SVM method for predicting the financial failure of companies located in Tehran. The dataset collected from Tehran stock exchange consisted of 660 companies with 30 financial ratios and the accuracy produced using SVM was 93.30% while logistic regression achieved only 85% accuracy. In [12], the authors used three different methods: SVM, CART, and logistic regression. The dataset collected contained 153 Chinese companies. The two researchers discovered that SVM and CART produced very close accuracy and it was 90.30% for CART method and 89.41% for SVM. The logistic regression produced the lowest accuracy compared with the other two methods mentioned previously and it was 86.89%. In [28], CART was used to predict business failure using classification and regression tree. The model was trained using the dataset collected from the Shanghai Stock Exchange and Shenzhen Stock Exchange. The dataset contained 30 features. The hold out method was employed 30 times to assess the model and a mean accuracy of 90.30% was obtained by the model. In [30], Support Vector Machine was combined with Genetic Algorithm to improve the bankruptcy prediction of companies. The new method called GOSVM was applied on datasets collected in Korea from the H-commercial bank. The dataset containing 1548 records for different industrial companies was used with 33 features to predict bankruptcy. The model achieved 84.6% accuracy. Narvekar & Guha [22] presented a bankruptcy prediction model based on extreme gradient boosting (XGBoost) over US firms’ dataset comprising of several decades. Scheme was able to achieve 99% accuracy. Wang [58] created a prediction model for detecting financial crisis in GFC companies by combining the frequency statistical approach with machine learning algorithms. The results of their approach achieved 83% prediction accuracy, showing significant improvement from previous GFC prediction models. Additionally, the study showed SVM as the best performance compared to other ML algorithms, while NN showed more consistent results, displaying the highest average out of all ML models. Wang [10] presented ensemble model for under-sampling bankruptcy prediction. Naïve Bayes was better and edited kNN was proved as second best. The data for Taiwan banks was used for the study. In [57], the authors presented a dataset containing 8262 bankruptcy cases occurred between 1998 and 2018 of different American companies’ stock markets. Machine learning algorithms have been tested on the bankruptcy dataset, resulting in better performance for NN algorithms, thus demonstrating the effectiveness of the presented dataset when used in ML algorithms with higher computational costs. Summary of the literature review is given in Table 1. It shows that there is still room for improvement in certain areas especially when the dataset size is huge, and the number of features is significantly large.

Table 1

Summary of literature review
Study	Technique	Dataset Description	Result (accuracy)
[9]	Neural network with DT and autoencoder with DY	250 samples with 6 features Qualitative Bankruptcy database	Both gave a total accuracy of 99.33%
[10]	Naïve Bayesian; Edited kNN	Taiwan bankruptcy data.	89%; 86%
[12]	SVM, CART, Logistic regression	153 Chinese companies	89.41%, 90.30%, 86.89%
[19]	ANN using a sigmoidal activation function	NA	95.56%
[22]	XGBoost	US firms’ data for 3 decades	99%
[25]	SVM with PSO SVM with GA	250 instances obtained from the UCI ML repository	94.41% 90.87%.
[30]	GOSVM	Korean H-commercial bank 1548 records with 33 features	84.6%
[40]	SVM with PSO SVM with GA	NA	95% 92.5%
[43]	SVM Linear SVM with RBF kernel	American and Canadian companies	Training 80.73%, Testing 71.52% Training 88.75% Testing 79.77%
[44]	Linear SVM Nonlinear SVM	Credit & Debt Markets Research Program of NYU’s Salomon Center	92.98% 88.57%

This section describes the proposed ensemble ML techniques investigated in the current study.

3.1 Balanced bagging

Bagging is a technique used to improve the stability and accuracy of machine learning algorithms. It can also reduce the variance amongst predictions made by different classifiers. The main idea behind bagging is to resample the training dataset many times to get several training sets of the same size with replacement. The resampled training datasets are then used with the same or different machine learning algorithm such as decision tree or random forest to build the model. After that, the predictions made by different classifiers are combined by using voting for classification problems and by finding average for regression problems. More information about bagging can be found in [41]. Bagging Suppose the training sample is denoted by D= {(x1,y1),…,( xi,yi),…,( xn,yn)}, then from the training set D, we construct B bootstrap samples; ${D}_{b}^{\text{*}} ,b=\text{1,2},\dots .,B.$ The bootstrap samples are created by selecting the training samples at random from the set D. Therefore, it contains the same number as the training sample but some of the samples can be selected multiple times while others will not be selected at all. After that, we train the methods on${D}_{b}^{\text{*}}$ bootstrap samples to obtain vector ${\widehat{f}}_{1}\left(x\right),\dots ,{\widehat{f}}_{b}\left(x\right),\dots ,{\widehat{f}}_{B}\left(x\right).$Then the functions will be combined to get the final score as shown in Eq. (1).

$${\widehat{f}}_{bag}\left(x\right) =\frac{1}{B} \sum _{b=1}^{B}{\widehat{f}}_{b}^{*}\left(x\right)$$

Prediction of the compound classifier can then be carried out using Eq. (2) as follows:

$${\widehat{h}}_{bag}\left(x\right) = sign({\widehat{f}}_{bag}\left({x}_{i}\right)-{\tau }_{B})$$

${\tau }_{B}$ is the selected cut-off value equal to zero for proportional samples. The sample ${x}_{i}$ is classified to class in which most of the classifiers will vote [29].

3.2 AdaBoost

Boosting is a method used to improve the accuracy of any learning algorithm. AdaBoost is an ensemble boosting algorithm that builds a strong classifier by combining multiple classifiers to get an improved one [52]. The advantage of AdaBoost algorithm is that it needs fewer input parameters and little prior knowledge about the weak learners. Furthermore, it has flexibility for combining other methods to find weak hypotheses. The idea behind AdaBoost is to set weights for different classifiers and to train them using data in each iteration. After certain iterations, the final prediction is obtained by summing up the weighted prediction of each classifier [53]. Different machine learning algorithms can be used as base classifiers if they accept weights on the training set. However, in this study, the decision tree will be used as the base classifier. The algorithm works as follows; given a training set that has m samples of data, select training subsets randomly. The key idea is to maintain a set of weights over the training set. Initially, all the weights of samples will be initialized to 1 divided by the number of training samples. Each weak learner will be trained using a random subset from the training set. The goal of the learner is to find a weak hypothesis ${h}_{t}$: X → {-1,1} with a low weighted error which can be measured by using the following formula [27]:

$$\epsilon =P{r}_{i\sim{D}_{t }}\left[{h}_{t }\left({x}_{i}\right)\ne {y}_{i}\right]= \sum _{i: {h}_{t }\left({x}_{i}\right) \ne {y}_{i}}{D}_{t }\left(i\right)$$

Once a weak hypothesis h has been chosen for the classifier, AdaBoost assigns weight $\sigma$for the classifier using the following formula: ${\sigma }_{t}= \frac{1}{2 } \text{ln}\left(\frac{1-{ϵ}_{t}}{{ϵ}_{t}}\right)$. Accurate classifiers are assigned a higher weight so that it will have more effect on the final hypothesis. On each iteration, the weights of incorrectly classified samples are increased so that they appear on the training subset of the next classifier with high probability (Heo & Yang, 2014). The following formula is used to update the weight of samples [27]:

$${D}_{t+1 }\left(i\right)=\frac{{D}_{t }\left(i\right)exp(-{\sigma }_{t} {y}_{i }{h}_{t}({x}_{i}\left)\right)}{{Z}_{t}}$$

We can notice from the above formula that the weights are normalized by dividing each of them by the sum of all weights ${Z}_{t}$. The final classifier is a linear combination of the component classifiers calculated as:

$$H\left(x\right)= \text{sin}\left({\sum }_{t=1}^{T}{\sigma }_{t} {h}_{t }\left(x\right)\right)$$

Where T is the set of classifiers, ${\sigma }_{t}$ is the weight of the classifier and ${h}_{t}$ is the corresponding hypothesis of the classifier. The function will return + 1 if the variable is a positive number i.e., non-bankrupt and − 1 if the variable is a negative number i.e., bankrupt.

3.3 Random Forest

The random forest algorithm is considered an ensemble technique that is used for regression and classification problems. It uses a huge number of decision trees as base classifiers to construct a forest as multiple decision trees or forests can perform better than one. Therefore, using combined classifiers or combined learned models, desirable results can be obtained [55]. The model will be trained using samples from the training set selected randomly and with random replacements in each attempt to make sure that each sample has been trained at random and the best-trained model will be chosen to contribute to the final decision of model construction. For this step, the mean of each constructed model will be calculated, and the bagging concept (majority voting) will be applied to produce the output of the classification which is the class that has more weight [7]. Most of the ML techniques accomplish their target goals by using some mathematical formulae or equations. Therefore, for the random forest technique, the following equation will be used to produce the result [7]:

$${Y}_{i}=\frac{1}{C} \sum _{c=1}^{c}{f}_{i}\left({x}_{i}\right) ; i= 1, 2, 3, ....... , \text{n}$$

In Eq. (6), n is the whole training set while x₁, x₂, …., x_n represents the selected samples from the training set. C is the number of random samples selected, fi are the base classifiers (decision trees) that are used to determine the class of certain instance and Y is the result of classification while y₁, y₂, …., y_n represents the output of different classification results. The random forest is constructed using decision trees, and final classification results obtained after applying the majority voting concept among base classifiers [59–62].

4.1 Dataset description and Ensemble techniques

The dataset used for bankruptcy prediction has been obtained from the Kaggle website. Kaggle is an online community owned by Google that helps data scientists and machine learning researchers publish or use datasets to build models [1]. The data was collected from the Emerging Markets Information Service, which contains information on emerging markets around the world. The dataset is unbalanced and has duplicate samples. The dataset consists of 64 financial features and 10001 instances in which the total number of instances for bankrupt companies is 207 and non-bankrupt companies are 9794. Moreover, all attributes are of numeric type except the nominal class attribute indicating whether the company is bankrupt or not.

Different ensemble techniques proposed in this research work have been implemented successfully to predict bankruptcy. This section will give a detailed description of the experimental setup used to implement the machine learning ensemble techniques. Firstly, the dataset was divided into two sets; one was used as the training set and the other one was used as the testing set. The split ratio was 6:4 (60% of the dataset for the training set and 40% for the testing set). The division of the data was done using a stratified sampling approach. Each technique was trained and tested using this approach. Secondly, the experiments were carried out in a Python environment with the help of some libraries that are specially designed to assess the performance of the chosen techniques. The classifiers of different ensemble techniques were fed with the training set to start the training process that then produced the trained models of different techniques. After the training process, the models are tested through the validation process using the testing set to make sure that the models perform well with unseen data. Thirdly, the accuracy of the models will be improved by using a feature selection technique that is based on the correlation coefficient between the attributes and the class. The feature selection technique that will be used is the Recursive Feature Elimination technique. The technique will be used to find different subsets of features that can give the best accuracy for a specific classifier. Fourthly, performance measures must be computed to ensure that different techniques produce good accuracy. The precision and recall values were also computed to enhance the process of finding optimal parameters.

For Balance Bagging, the steps are defined as follows:

Step 1

Choose the number of estimators. As the number of estimators increases, the time required will also increase, but more accurate results can be achieved.

Step 2

Specify the maximum number of features parameter.

Step 3

Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.

For the AdaBoost technique, the steps are defined as follows:

Step 1

Choose the number of estimators. As the number of estimators increases, the time required will also increase but, more accurate result can be achieved.

Step 2

Change the random state value until the optimal result is achieved.

Step 3

Change the learning rate value until the optimal result is achieved.

Step 4

Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.

For Random Forest technique, the steps are defined as follows:

Step 1

Choose the number of estimators. As the number of estimators increases, the time required will also increase but, more accurate result can be achieved.

Step 2

Specify the maximum number of features.

Step 3

Define the maximum depth of the decision trees that are being used.

Step 4

Change the random state value until the optimal result is achieved.

Step 5

Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.

4.2 Optimization Strategy

Machine learning algorithms are rarely parameter-free. Parameter optimization is the process of choosing the optimal value of parameters for machine learning algorithms (Snoek et al., 2012). Table 2 shows the optimal parameter values chosen for balance bagging algorithm.

Table 2

Balance Bagging optimal parameters
Parameter	Optimal value chosen
Number of Estimators	70
Max Features	64

The number of estimators is the number of classifiers that will be used by the Balance Bagging classifier algorithm. Increasing the number of estimators may increase the performance but it will slow down the speed of the algorithm. Figure 1 illustrates how the performance accuracy of Balance Bagging algorithm changes when changing the number of estimators starting from 10 estimators to 100 estimators. The highest accuracy was obtained using 70 estimators.

Max features are the number of features that are drawn from the dataset to train each base estimator. Different number of features were used to get the optimal value starting with 5 features to maximum 64 features i.e., the whole dataset. The highest accuracy was obtained using all the 64 features. Figure 2 illustrates how the accuracy of Balance Bagging algorithm changes when changing the value of max features. This is listed in Table 3.

Table 3

Balance Bagging optimal parameters
Parameter	Optimal value chosen
Number of Estimators	200
Random State	130
Learning Rate	1.0

The number of estimators is the number of classifiers that will be used by the AdaBoost classifier algorithm. The default number of estimators using the Scikit Learn library in Python is 50. The number of estimators was increased to understand the performance of the algorithm. It gave the best accuracy when the number of estimators was 200. Figure 3 illustrates how the accuracy of the AdaBoost algorithm changes with the number of estimators.

A random state generates random numbers drawn from a variety of probability distributions. The random state parameter for the AdaBoost algorithm was tested using different values from 10 to 150 to get the optimal value. The highest accuracy was achieved when the random state value was 130. Figure 4 illustrates how the accuracy of AdaBoost algorithm changes when changing the value of random state.

Learning rate is a value used by the learning algorithm to determine how quickly the weights are adjusted. It also contributes to the weights of weak learners. The values tested for learning rate was from 0.000001 to 2. The highest accuracy was achieved when the value of the learning rate was 1. Figure 5 illustrates how the accuracy of AdaBoost algorithm changes when changing the value of the learning rate.

The number of estimators is the number of classifiers that will be used by the Random Forest classifier algorithm. Figure 6 illustrates how the accuracy of Random Forest algorithm changes when changing the value of number of estimators. The highest accuracy was obtained using 500 estimators.

Max features are the number of features that are drawn from the dataset to train each base estimator. Different number of features were used to get the optimal value starting with 5 max features to 64 features i.e., the whole dataset. The highest accuracy was obtained using all features. Figure 7 illustrates how the accuracy of Random Forest algorithm changes with the number of max features.

Max Depth is the maximum depth of a tree in Random Forest technique. The deeper the tree, the more splits it has, and it captures more information about the data. Figure 8 illustrates how the accuracy of Random Forest algorithm changes when changing the values of max depth. The optimal value obtained is 100 with the highest accuracy.

Random state generates random numbers drawn from a variety of probability distributions. Figure 9 illustrates how the accuracy of Random Forest algorithm changes when changing the values of random state. The optimal value obtained is 200 with the highest accuracy.

Table 4 shows the result of machine learning models using the full set of features. The performance evaluation of the models is assessed by accuracy, precision, and recall. It can be noted that the Random Forest and AdaBoost classifiers show high accuracy with 97.75% and 97.7% respectively. While the Balanced Bagging classifier achieved a lower accuracy with 83.85%.

Table 4

Results of Machine Learning Models.
Quality Measures	Random Forest	Balance Bagging	AdaBoost
Accuracy (%)	97.75	83.85	97.7
Precision (%)	63.64	9.52	52.0
Recall (%)	7.53	69.89	13.98

5.1 Effect of Feature Selection on the Dataset

Feature selection was performed using the following approach.

Build Random Forest / AdaBoost / Balance Bagging classifiers using V features (i.e., All features).

Compute the correlation of all features to the target (then arrange in descending order) and choose the best V/2 features.

Repeat until 1 feature is left.

Choose the feature subset that gives the best performance (using cross-validation) and use it to build the final models.

Table 5 shows the result of three classifiers namely, Random Forest, AdaBoost and Balance Bagging using recursive feature elimination procedure. The classifiers were first tested with all 64 features and then the number of features were halved in each recursive iteration. The process of eliminating features continues until one feature is left. The highest performance was achieved by all the classifiers using 8 of the features from the dataset and the lowest performance was achieved by the classifiers when using one feature except one classifier. AdaBoost gave the highest performance using just one feature namely the 26th attribute and with a combination of the 24th, 26th, 34th and 58th attributes in the dataset. We can conclude that using selected features is better than using all the features in terms of accuracy. Also, we found out that the accuracy of the AdaBoost technique does not fluctuate much even after using feature selection. According to the information in Table 5, Random Forest obtained the highest accuracy after feature selection followed by AdaBoost and then Balance Bagging.

Table 5

Results of different features subsets
Number of Features/ Attributes	Random Forest	Balance Bagging	AdaBoost
All	97.75%	83.85%	97.7%
5,6,9,11,13,15,17,19,20,21,24,25, 26,27,29,34,37,38,39,40,41,42,43, 46,48,53,55,56,57,58,61,62	97.8%	83.05%	97.7%
20,21,24,25,26,29,34,39,40,41,46 48,53,55,56,58	97.82%	83.88%	97.2%
20,21,24,26,34,39,46,58	97.85%	84.38%	97.5%
24,26,34,58	97.72%	83.15%	97.7%
26,34	97.3%	79.9%	97.6%
26	96.5%	72.4%	97.7%

According to the confusion matrices for Random Forest, Balance Bagging and Ada-Boost algorithms, given in Figs. 10, 11 and 12, respectively, it is apparent that the Random Forest classified 3907 companies as non-bankrupt and 93 companies as bankrupt with an accuracy of 97.85%. The matrices also confirm that Random Forest achieved higher performance than Balance Bagging and slightly better than AdaBoost.

Per Fig. 13, Random Forest and AdaBoost show the lowest area under curve (AUC) and Balance Bagging shows the highest area under curve compared to the other two. It is because of the class imbalance problem in the original dataset. Further investigations will be done to solve the issue of imbalance in the next sections.

5.2 Imbalance Class Problem Effect on the Performance of the Models

The dataset used for implementing different models using different ensemble techniques is a dataset that has a class imbalance problem. The reason behind the class imbalance is that 98% of the dataset samples belong to the class of non-bankrupt companies with a class value of 0 in the dataset while only 2% of the dataset samples belong to the class of bankrupt companies which are the samples with a class value of 1 in the dataset. According to studies done in solving the imbalance problem, two well-known techniques are used by most researchers to solve the problem. The first solution is called the under-sampling technique which refers to decreasing the number of samples from the class that has more instances to make it equal in number of instances to the other class. The second solution is called the oversampling technique which refers to generating synthetic samples based on the dataset and then adding them to the class with a lesser number of instances to make both classes equal in the number of instances. It was decided to use the oversampling technique. This is because, with the dataset generated using this technique, the ensemble models will have more samples available for training and testing. By doing so, the models will produce more accurate results. However, using the under-sampling technique will decrease the number of samples in the dataset resulting in a decrease of the model’s performance during training and testing processes. After performing over-sampling on the dataset, both the classes have the same number of samples. Exactly 50% of the dataset samples are classified as bankrupt companies and 50% of the dataset samples are classified as non-bankrupt companies. Table 6 shows the results of ensemble machine learning models using a balanced dataset with the full set of features. We have used the same performance measures used with the unbalanced dataset to understand the differences between both results. The balanced dataset results of the Random Forest technique and Balance Bagging technique have improved as compared to the results of the imbalanced dataset produced using the same techniques. But the accuracy of AdaBoost technique has decreased as compared with the accuracy of the same technique using imbalanced dataset while the precision and recall of Ada-Boost have been improved.

Table 6

Results after training the models using the balanced dataset.
Quality measures	Random Forest	Balance Bagging	AdaBoost
Accuracy (%)	98.68	98.77	96.90
Precision (%)	98.72	98.45	96.73
Recall (%)	98.64	99.10	97.11

The confusion matrices of different techniques using a balanced dataset are given below for more information about the distribution of samples classified by the models. According to the confusion matrices of Random Forest, Balance Bagging and AdaBoost given in Fig. 14, Fig. 15, and Fig. 16, we can see that Balance Bagging classified 3827 companies as non-bankrupt and 3869 companies as bankrupt with an accuracy of 98.77%. The precision and recall values have also improved greatly showing that the models have correctly classified that many instances out of all the possible values and the true values with 98.45% and 99.1% respectively. The matrices also confirm that the Balance Bagging achieved higher performance than Random Forest and AdaBoost.

The proposed method using a balanced dataset on ensemble models shows very good results in the ROC curve graph. All of them achieved good results as shown in Fig. 17.

5.3 Discussion

The results of comparing different evaluation metrics like accuracy, precision and recall on the proposed models using the original and balanced dataset have been shown in Figs. 18–20.

The performance results of models using the balanced dataset outperformed the models using the original dataset. The proposed models using the oversampling technique on the original dataset, compared to the untouched dataset results, showed high accuracy, precision and recall in predicting bankruptcy. Judging from the results summarized in Figs. 18–20, it is evident that the proposed models performed better on the balanced dataset than on the original dataset. It should also be noted that the recall and precision values using the balanced dataset are also much higher indicating that the models were able to predict the classes correctly more than using the original dataset. The Balance Bagging model gave the highest accuracy at 98.77%. The proposed method of using the balanced dataset increased the accuracy of the Balance Bagging model by 14.39%, precision by 88.93% and recall by 29.21%.

5.4 Comparison with state of the art

This section presents the comparison of the proposed scheme with the other techniques in the literature (Table 7). The schemes are selected based on their relevance to the proposed scheme in terms of dataset nature. The proposed approach outperformed all the approaches in terms of accuracy. However, the two approaches exhibit slightly better accuracy. The first one, [9] where the author investigated an autoencoder that applies different techniques for different segments of the dataset and then accumulates the best score. Moreover, the dataset size was very small (i.e., 250 instances with 6 features) compared to the proposed scheme (10001 instances with 64 features). The second one, [22] with XGBoost exhibits a closer performance to that of the proposed scheme. Here the dataset was comparable in terms of size and number of features.

Table 7. Comparison with state-of-the-art techniques.

Study	Technique	Result (Accuracy)
[40]	SVM with PSO SVM with GA	95% 92.5%
[25]	SVM with PSO SVM with GA	94.41% 90.87%.
[43]	SVM Linear SVM with RBF kernel gave	Training 80.73%, Testing 71.52% Training 88.75% Testing 79.77%
[9]	ANN with DT and autoencoder with DT	Accumulative accuracy of 99.33%
[12]	SVM, CART, Logistic regression	89.41%, 90.30%, 86.89%
[19]	ANN using a sigmoidal function	95.56%
[26]	Decision tree, Random Forest	92.9%, 95.9%
[30]	GOSVM	84.6%
[44]	Linear SVM Nonlinear SVM	92.98% 88.57%
[22]	XGBoost	99%
Proposed	Ensemble Approach with Balanced Bagging	Balanced Bagging at 98.77% followed by Random Forest 98.68% and AdaBoost 96.9%.

This research work proposed three models based on ensemble machine learning techniques to predict bankruptcy. The proposed models are Balance Bagging using Random Forest as the base classifier, AdaBoost using a decision tree as a classifier and Ensemble Random Forest using a decision tree as the classifier. All the models were trained and tested using the original dataset which was greatly imbalanced and was also trained and tested using the same dataset after solving the imbalance problem by oversampling technique on the dataset. The proposed method using the proposed models performed much better in training and testing on predicting bankruptcy. Therefore, the overall results showed that the models trained and tested using the balanced dataset outperformed the models trained and tested using the original dataset. The Balance Bagging model gave the highest accuracy at 98.77%. The precision and recall values also improved significantly after using the balanced dataset. This shows that the proposed models were able to classify the correct class values using the balanced dataset much better than the original one. The precision and recall values for Balance Bagging improved by 88.93% and 29.21%, respectively, after solving the class imbalance problem. In future, it is intended to investigate hybrid intelligent methods for prediction, especially with deep learning.

Ethical approval

Each author has reviewed and approved the publication of this work into scientific reports.

Consent to participate.

Not applicable.

Consent to publish.

Not applicable.

Data Availability

The dataset generated and/or analyzed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/data

Conflicts of Interest

The authors have no conflicts of interest to report regarding the study.

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Funding Statement

The author(s) received no specific funding for this study.

Authors’ Contribution

S.O.O., D.M., A.R. were responsible for the study's idea and design; I.A.S., A.M.S., S.D., M.O.F. collected the data; M.O.F., A.S.B., O.A.Q. analyzed and interpreted the findings; S.M. and M.A.S. edited and reviewed the manuscript. The final copy of the paper was approved by all authors after they had evaluated the findings.

Wikipedia, ‘Kaggle’, European Online Journal of Natural and Social Sciences, vol. 2, 2013.
I. Pervan, M. Pervan, and B. Vukoja, ‘PREDICTION OF COMPANY BANKRUPTCY USING STATISTICAL TECHNIQUES-CASE OF CROATIA’, 2011.
Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2012.
M. Niskanen, ‘School of Business and Governance Department of Business Administration BANKRUPTCY PREDICTION METHODS: A COMPARISON WITH FINNISH DATA’, 2017.
Y. Wang, ‘Financial Crisis Prediction Model of Listed Companies Based on Statistics and AI’, Scientific Programming, vol. 2022, pp. 1–10, 3 2022.
J. Li, ‘An Early Control Algorithm of Corporate Financial Risk Using Artificial Neural Networks’, Mobile Information Systems, vol. 2022, pp. 1–12, 5 2022.
W. Zhang, ‘Machine Learning Approaches to Predicting Company Bankruptcy’, Journal of Financial Risk Management, vol. 06, pp. 364–374, 2017.
Wikipedia, ‘Kaggle’, Computers, Materials & Continua, vol. 66, pp. 2397–2407, 2021.
N. Wang, ‘Bankruptcy Prediction Using Machine Learning’, Journal of Mathematical Finance, vol. 07, pp. 908–918, 2017.
H. Wang and X. Liu, ‘Undersampling bankruptcy prediction: Taiwan bankruptcy data’, PLOS ONE, vol. 16, p. e0254030, 7 2021.
C. L. Stergiou, K. E. Psannis, and B. B. Gupta, ‘IoT-Based Big Data Secure Management in the Fog Over a 6G Wireless Network’, IEEE Internet of Things Journal, vol. 8, pp. 5164–5171, 4 2021.
M. Staňková and D. Hampel, ‘Bankruptcy Prediction of Engineering Companies in the EU Using Classification Methods’, Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, vol. 66. pp. 1347–1356, 10 2018.
J. Snoek, H. Larochelle, and R. P. Adams, ‘Practical Bayesian Optimization of Machine Learning Algorithms’, Advances in Neural Information Processing Systems, 6 2012.
A. Rahman et al., ‘Rainfall Prediction System Using Machine Learning Fusion for Smart Cities’, Sensors, vol. 22, p. 3504, 5 2022.
A. Rahman, S. Dash, and A. K. Luhach, ‘Dynamic MODCOD and power allocation in DVB-S2: a hybrid intelligent approach’, Telecommunication Systems, vol. 76, pp. 49–61, 1 2021.
A. Rahman, ‘GRBF-NN based ambient aware realtime adaptive communication in DVB-S2’, Journal of Ambient Intelligence and Humanized Computing, 6 2020.
A. Rahman, ‘Optimum information embedding in digital watermarking’, Journal of Intelligent & Fuzzy Systems, vol. 37, pp. 553–564, 7 2019.
B. Prusak, ‘Review of Research into Enterprise Bankruptcy Prediction in Selected Central and Eastern European Countries’, International Journal of Financial Studies, vol. 6, p. 60, 6 2018.
G. P. Naidu and K. Govinda, ‘Bankruptcy prediction using neural networks’, 1 2018, pp. 248–251.
T. Pisula, G. Mentel, and J. Brożyna, ‘Non-Statistical Methods of Analysing of Bankruptcy Risk’, Folia Oeconomica Stetinensia, vol. 15, pp. 7–21, 6 2015.
L. Pavone, F. Mollica, M. L. Rosa, S. Borellini, and J. W. Bianchine, ‘Asymmetric crying facies and congenital eye defects in an infant’, Acta paediatrica Belgica, vol. 31, pp. 45–48, 1978.
A. Narvekar and D. Guha, ‘Bankruptcy prediction using machine learning and an application to the case of the COVID-19 recession’, Data Science in Finance and Economics, vol. 1, pp. 180–195, 2021.
K. Nagaraj and A. Sridhar, ‘A Predictive System for Detection of Bankruptcy Using Machine Learning Techniques’, International Journal of Data Mining & Knowledge Management Process, vol. 5, pp. 29–40, 1 2015.
Y. Qu, P. Quan, M. Lei, and Y. Shi, ‘Review of bankruptcy prediction using machine learning and deep learning techniques’, Procedia Computer Science, vol. 162, pp. 895–899, 2019.
Y. Lu, J. Zhu, N. Zhang, and Q. Shao, ‘A hybrid switching PSO algorithm and support vector machines for bankruptcy prediction’, 7 2014, pp. 1329–1333.
Y. Li and Y. Wang, ‘Machine Learning Methods of Bankruptcy Prediction Using Accounting Ratios’, Open Journal of Business and Management, vol. 06, pp. 1–20, 2018.
X. Li, L. Wang, and E. Sung, ‘AdaBoost with SVM-based component classifiers’, Engineering Applications of Artificial Intelligence, vol. 21, pp. 785–795, 8 2008.
H. Li, J. Sun, and J. Wu, ‘Predicting business failure using classification and regression tree: An empirical comparison with popular classical statistical methods and top classification mining methods’, Expert Systems with Applications, vol. 37, pp. 5895–5904, 8 2010.
A. Lemmens and C. Croux, ‘Bagging and Boosting Classification Trees to Predict Churn’, Journal of Marketing Research, vol. 43, pp. 276–286, 5 2006.
K.-J. Kim, K. Lee, and H. Ahn, ‘Predicting Corporate Financial Sustainability Using Novel Business Analytics’, Sustainability, vol. 11, p. 64, 12 2018.
M. B. S. Khan, A. Rahman, M. S. Nawaz, R. Ahmed, M. A. Khan, and A. Mosavi, ‘Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization’, Mathematical Biosciences and Engineering, vol. 19, pp. 7978–8002, 2022.
A. K. Jain and B. B. Gupta, ‘A machine learning based approach for phishing detection using hyperlinks information’, Journal of Ambient Intelligence and Humanized Computing, vol. 10, pp. 2015–2028, 5 2019.
S. Hido, H. Kashima, and Y. Takahashi, ‘Roughly balanced bagging for imbalanced data’, Statistical Analysis and Data Mining, vol. 2, pp. 412–426, 12 2009.
J. Heo and J. Y. Yang, ‘AdaBoost based bankruptcy forecasting of Korean construction companies’, Applied Soft Computing, vol. 24, pp. 494–499, 11 2014.
R. P. Hauser and D. Booth, ‘Predicting Bankruptcy with Robust Logistic Regression’, Journal of Data Science, vol. 9, pp. 565–584, 4 2021.
A. Charbonnel, ‘[The history of oto-neuro-ophthalmology]’, Revue d’oto-neuro-ophtalmologie, vol. 51, pp. 399–416, 1979.
E. Zibanezhad, D. Foroghi, and A. Monadjemi, ‘Applying decision tree to predict bankruptcy’, 6 2011, pp. 165–169.
C. Esposito, M. Ficco, and B. B. Gupta, ‘Blockchain-based authentication and authorization for smart city applications’, Information Processing & Management, vol. 58, p. 102468, 3 2021.
A. Dilawari, M. U. G. Khan, Y. D. Al-Otaibi, Z.-U. Rehman, A.-U. Rahman, and Y. Nam, ‘Natural Language Description of Videos for Smart Surveillance’, Applied Sciences, vol. 11, p. 3730, 4 2021.
M.-Y. Chen, ‘Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches’, Computers & Mathematics with Applications, vol. 62, pp. 4514–4524, 12 2011.
L. Breiman, ‘Bagging predictors’, Machine Learning, vol. 24, pp. 123–140, 8 1996.
X. Bredart, ‘A “User Friendly” Bankruptcy Prediction Model Using Neural Networks’, Accounting and Finance Research, vol. 3, pp. 124–128, 4 2014.
F. Barboza, H. Kimura, and E. Altman, ‘Machine learning models and bankruptcy prediction’, Expert Systems with Applications, vol. 83, pp. 405–417, 10 2017.
F. Barboza, L. F. C. Basso, and H. Kimura, ‘New metrics and approaches for predicting bankruptcy’, Communications in Statistics - Simulation and Computation, pp. 1–18, 9 2021.
H. Faris et al., ‘Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case from the Spanish market’, Progress in Artificial Intelligence, vol. 9, pp. 31–53, 3 2020.
H. R. Doolatabadi, S. M. Hoseini, and R. Tahmasebi, ‘Using Decision Tree Model and Logistic Regression to Predict Companies Financial Bankruptcy in Tehran Stock Exchanges’, 2013.
R. Zagrouba et al., ‘Modelling and Simulation of COVID-19 Outbreak Prediction Using Supervised Machine Learning’, Computers, Materials & Continua, 66, pp. 2397–2407, 2021.
R. E. Schapire, Explaining AdaBoost. Springer Berlin Heidelberg, 2013, pp. 37–52.
T. R. Hoens and N. V. Chawla, Imbalanced Datasets: From Sampling to Classifiers. John Wiley & Sons, Inc., 6 2013, pp. 43–59.
D. Alrasheed and D. Che, "Improving Bankruptcy Prediction Using Oversampling and Feature Selection Techniques," Proceedings on the International Conference on Artificial Intelligence (ICAI), pp. 440-446, 2018.
A. Niknya, R. Darabi, H. Reza, and V. Fard, ‘Financial Distress Prediction of Tehran Stock Exchange Companies Using Support Vector Machine’, European Online Journal of Natural and Social Sciences, vol. 2, 2013.
O. M. Mozos, C. Stachniss, and W. Burgard, ‘Supervised Learning of Places from Range Data using AdaBoost’, Proceedings of the 2005 IEEE International Conference on Robotics and Automation. pp. 1730–1735, 2005.
S. Shan, P. Yang, X. Chen, and W. Gao, AdaBoost Gabor Fisher Classifier for Face Recognition. 2005, pp. 279–292.
S. M. Alotaibi, A. Rahman, M. I. Basheer, and M. A. Khan, ‘Ensemble Machine Learning Based Identification of Pediatric Epilepsy’, Computers, Materials & Continua, vol. 68. pp. 149–165, 2021.
B. Mattsson and O. Steinert, ‘Corporate bankruptcy prediction using Machine Learning techniques’, 2017.
J. Li, ‘An Early Control Algorithm of Corporate Financial Risk Using Artificial Neural Networks’, Mobile Information Systems, vol. 2022, pp. 1–12, 5 2022.
G. Lombardo, M. Pellegrino, G. Adosoglou, S. Cagnoni, P. M. Pardalos, and A. Poggi, ‘Machine Learning for Bankruptcy Prediction in the American Stock Market: Dataset and Benchmarks’, Future Internet, vol. 14, p. 244, 8 2022.
Y. Wang, “Financial Crisis Prediction Model of Listed Companies Based on Statistics and AI”, Scientific Programming, vol. 2022, pp. 1–10, 3, 2022.
Musleh, D.; Alotaibi, M.; Alhaidari, F.; Rahman, A. et al., Intrusion Detection System Using Feature Extraction with Machine Learning Algorithms in IoT. J. Sens. Actuator Netw. 2023, 12, 29.
Ibrahim, N.M.; Gabr, D.G.; Rahman, A.; Musleh, D.; AlKhulaifi, D.; AlKharraa, M. Transfer Learning Approach to Seed Taxonomy: A Wild Plant Case Study. Big Data Cogn. Comput. 2023, 7, 128.
Dash, S. Hybrid Ensemble Learning Methods for Classification of Microarray Data, Handbook of Research on Computational Intelligence Applications in Bioinformatics, IGI Global, pp.17-36, DOI: 10.4018/978-1-5225-0427-6, 2016.
Dash, S. Meta-heuristic based ensemble for feature selection and classification of gene-expression datasets, Handbook of Research on the Modeling, Analysis and Application on Nature-Inspired Metaheuristic Algorithms, pp.1-22, 2017. DOI: 10.4018/978-1-5225-2857-9, IGI-Global, USA,

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Ensemble Machine Learning Approach to Bankruptcy Prediction

Status:

Version 1

Abstract

Figures

1 Introduction

2 Review of Literature

3 Materials and Methods

3.1 Balanced bagging

3.2 AdaBoost

3.3 Random Forest

4 Empirical Studies

4.1 Dataset description and Ensemble techniques

4.2 Optimization Strategy

5 Results and Discussion

5.1 Effect of Feature Selection on the Dataset

5.2 Imbalance Class Problem Effect on the Performance of the Models

5.3 Discussion

5.4 Comparison with state of the art

Conclusions

Declarations

References

Additional Declarations

Status:

Version 1