4.1 Dataset description and Ensemble techniques
The dataset used for bankruptcy prediction has been obtained from the Kaggle website. Kaggle is an online community owned by Google that helps data scientists and machine learning researchers publish or use datasets to build models [1]. The data was collected from the Emerging Markets Information Service, which contains information on emerging markets around the world. The dataset is unbalanced and has duplicate samples. The dataset consists of 64 financial features and 10001 instances in which the total number of instances for bankrupt companies is 207 and non-bankrupt companies are 9794. Moreover, all attributes are of numeric type except the nominal class attribute indicating whether the company is bankrupt or not.
Different ensemble techniques proposed in this research work have been implemented successfully to predict bankruptcy. This section will give a detailed description of the experimental setup used to implement the machine learning ensemble techniques. Firstly, the dataset was divided into two sets; one was used as the training set and the other one was used as the testing set. The split ratio was 6:4 (60% of the dataset for the training set and 40% for the testing set). The division of the data was done using a stratified sampling approach. Each technique was trained and tested using this approach. Secondly, the experiments were carried out in a Python environment with the help of some libraries that are specially designed to assess the performance of the chosen techniques. The classifiers of different ensemble techniques were fed with the training set to start the training process that then produced the trained models of different techniques. After the training process, the models are tested through the validation process using the testing set to make sure that the models perform well with unseen data. Thirdly, the accuracy of the models will be improved by using a feature selection technique that is based on the correlation coefficient between the attributes and the class. The feature selection technique that will be used is the Recursive Feature Elimination technique. The technique will be used to find different subsets of features that can give the best accuracy for a specific classifier. Fourthly, performance measures must be computed to ensure that different techniques produce good accuracy. The precision and recall values were also computed to enhance the process of finding optimal parameters.
For Balance Bagging, the steps are defined as follows:
Step 1
Choose the number of estimators. As the number of estimators increases, the time required will also increase, but more accurate results can be achieved.
Step 2
Specify the maximum number of features parameter.
Step 3
Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.
For the AdaBoost technique, the steps are defined as follows:
Step 1
Choose the number of estimators. As the number of estimators increases, the time required will also increase but, more accurate result can be achieved.
Step 2
Change the random state value until the optimal result is achieved.
Step 3
Change the learning rate value until the optimal result is achieved.
Step 4
Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.
For Random Forest technique, the steps are defined as follows:
Step 1
Choose the number of estimators. As the number of estimators increases, the time required will also increase but, more accurate result can be achieved.
Step 2
Specify the maximum number of features.
Step 3
Define the maximum depth of the decision trees that are being used.
Step 4
Change the random state value until the optimal result is achieved.
Step 5
Compute the accuracy, precision, and recall values of each constructed model to ensure that the best results are achieved.
4.2 Optimization Strategy
Machine learning algorithms are rarely parameter-free. Parameter optimization is the process of choosing the optimal value of parameters for machine learning algorithms (Snoek et al., 2012). Table 2 shows the optimal parameter values chosen for balance bagging algorithm.
Table 2
Balance Bagging optimal parameters
Parameter | Optimal value chosen |
Number of Estimators | 70 |
Max Features | 64 |
The number of estimators is the number of classifiers that will be used by the Balance Bagging classifier algorithm. Increasing the number of estimators may increase the performance but it will slow down the speed of the algorithm. Figure 1 illustrates how the performance accuracy of Balance Bagging algorithm changes when changing the number of estimators starting from 10 estimators to 100 estimators. The highest accuracy was obtained using 70 estimators.
Max features are the number of features that are drawn from the dataset to train each base estimator. Different number of features were used to get the optimal value starting with 5 features to maximum 64 features i.e., the whole dataset. The highest accuracy was obtained using all the 64 features. Figure 2 illustrates how the accuracy of Balance Bagging algorithm changes when changing the value of max features. This is listed in Table 3.
Table 3
Balance Bagging optimal parameters
Parameter | Optimal value chosen |
Number of Estimators | 200 |
Random State | 130 |
Learning Rate | 1.0 |
The number of estimators is the number of classifiers that will be used by the AdaBoost classifier algorithm. The default number of estimators using the Scikit Learn library in Python is 50. The number of estimators was increased to understand the performance of the algorithm. It gave the best accuracy when the number of estimators was 200. Figure 3 illustrates how the accuracy of the AdaBoost algorithm changes with the number of estimators.
A random state generates random numbers drawn from a variety of probability distributions. The random state parameter for the AdaBoost algorithm was tested using different values from 10 to 150 to get the optimal value. The highest accuracy was achieved when the random state value was 130. Figure 4 illustrates how the accuracy of AdaBoost algorithm changes when changing the value of random state.
Learning rate is a value used by the learning algorithm to determine how quickly the weights are adjusted. It also contributes to the weights of weak learners. The values tested for learning rate was from 0.000001 to 2. The highest accuracy was achieved when the value of the learning rate was 1. Figure 5 illustrates how the accuracy of AdaBoost algorithm changes when changing the value of the learning rate.
The number of estimators is the number of classifiers that will be used by the Random Forest classifier algorithm. Figure 6 illustrates how the accuracy of Random Forest algorithm changes when changing the value of number of estimators. The highest accuracy was obtained using 500 estimators.
Max features are the number of features that are drawn from the dataset to train each base estimator. Different number of features were used to get the optimal value starting with 5 max features to 64 features i.e., the whole dataset. The highest accuracy was obtained using all features. Figure 7 illustrates how the accuracy of Random Forest algorithm changes with the number of max features.
Max Depth is the maximum depth of a tree in Random Forest technique. The deeper the tree, the more splits it has, and it captures more information about the data. Figure 8 illustrates how the accuracy of Random Forest algorithm changes when changing the values of max depth. The optimal value obtained is 100 with the highest accuracy.
Random state generates random numbers drawn from a variety of probability distributions. Figure 9 illustrates how the accuracy of Random Forest algorithm changes when changing the values of random state. The optimal value obtained is 200 with the highest accuracy.