All of the classifiers were subjected to two separate experiments: the first one, the data were imbalanced, and the second one, the issue of imbalanced data was addressed and handled using SMOTE. Table 1 shows evaluation metric, including accuracy, precision, recall, F-score, and MCC before applying SMOTE.
Table 1
Results of ML models before applying SMOTE
Model | Class | Precision | Recall | F-Score | Accuracy | MCC |
KNN | 0 | 0.76 | 0.69 | 0.72 | 0.80 | 0.571 |
1 | 0.83 | 0.87 | 0.85 |
Macro avg | 0.79 | 0.78 | 0.78 |
RF | 0 | 0.95 | 0.92 | 0.93 | 0.95 | 0.894 |
1 | 0.95 | 0.97 | 0.96 |
Macro avg | 0.95 | 0.94 | 0.95 |
DT | 0 | 0.89 | 0.92 | 0.90 | 0.93 | 0.847 |
1 | 0.95 | 0.94 | 0.94 |
Macro avg | 0.92 | 0.93 | 0.92 |
LR | 0 | 0.96 | 0.95 | 0.95 | 0.97 | 0.928 |
1 | 0.97 | 0.98 | 0.97 |
Macro avg | 0.97 | 0.96 | 0.96 |
NB | 0 | 0.85 | 0.86 | 0.85 | 0.89 | 0.762 |
1 | 0.91 | 0.91 | 0.91 |
Macro avg | 0.88 | 0.88 | 0.88 |
XGBoost | 0 | 0.93 | 0.83 | 0.88 | 0.91 | 0.816 |
1 | 0.91 | 0.96 | 0.93 |
Macro avg | 0.92 | 0.90 | 0.91 |
AdaBoost | 0 | 0.88 | 0.54 | 0.67 | 0.80 | 0.572 |
1 | 0.78 | 0.96 | 0.96 |
Macro avg | 0.83 | 0.75 | 0.76 |
MLP | 0 | 0.97 | 0.97 | 0.97 | 0.98 | 0.956 |
1 | 0.98 | 0.98 | 0.98 |
Macro avg | 0.98 | 0.98 | 0.98 |
In the first experiment, it can be shown that MLP and LR are performed superiorly to all of the other ML classifiers in terms of accuracy, with scores of 0.98 and 0.97, respectively. Besides that, RF, DT, and XGBoost are followed with scores of 0.95, 0.93, and 0.91 correspondingly. Regarding the models lie on the left side, NB obtained the highest accuracy, which was 0.89, followed by KNN and AdaBoost with the lowest accuracy of 0.80. In addition to accuracy, the MCC has been recognized in the literature as a comprehensive performance evaluation for binary classification issues, especially true when using imbalanced and balanced datasets as an evaluating criteria. In this regard, the MCC scored the most for MLP with a value of 0.956, followed by LR with a value of 0.928, and MLP with a value of 0.956. However, the score had the lowest for AdaBoost and KNN with values of 0.572 and 0.571 correspondingly.
Next, the same classifiers were applied again after balancing the distributed dataset using SMOTE. This experiment was carried out to demonstrate how SMOTE can improve the performance of classifiers after they had been applied to an imbalanced dataset. As they are involved here, earlier employed evaluation measures can also be found in Table 2.
Table 2
Results of ML models after applying SMOTE
Model | Class | Precision | Recall | F-Score | Accuracy | MCC |
KNN | 0 | 0.62 | 0.99 | 0.76 | 0.69 | 0.423 |
1 | 0.97 | 0.40 | 0.57 |
Macro avg | 0.79 | 0.69 | 0.66 |
RF | 0 | 0.95 | 0.96 | 0.96 | 0.96 | 0.910 |
1 | 0.96 | 0.95 | 0.96 |
Macro avg | 0.96 | 0.96 | 0.96 |
DT | 0 | 0.95 | 0.94 | 0.94 | 0.94 | 0.884 |
1 | 0.94 | 0.95 | 0.94 |
Macro avg | 0.94 | 0.94 | 0.94 |
LR | 0 | 0.97 | 0.97 | 0.97 | 0.97 | 0.944 |
1 | 0.97 | 0.97 | 0.97 |
Macro avg | 0.97 | 0.97 | 0.97 |
NB | 0 | 0.86 | 0.91 | 0.89 | 0.88 | 0.767 |
1 | 0.91 | 0.85 | 0.88 |
Macro avg | 0.88 | 0.88 | 0.88 |
XGBoost | 0 | 0.95 | 0.90 | 0.92 | 0.93 | 0.853 |
1 | 0.90 | 0.95 | 0.93 |
Macro avg | 0.93 | 0.93 | 0.93 |
AdaBoost | 0 | 0.92 | 0.59 | 0.72 | 0.77 | 0.577 |
1 | 0.70 | 0.95 | 0.80 |
Macro avg | 0.81 | 0.77 | 0.76 |
MLP | 0 | 0.99 | 0.98 | 0.99 | 0.99 | 0.970 |
1 | 0.98 | 0.99 | 0.98 |
Macro avg | 0.99 | 0.99 | 0.99 |
For the following experiment, it can be observed that the best performance was attributed for two MLP with 0.99 accuracy, followed by LR, RF, then DT with accuracies of 0.97, 0.96, and 0.94, respectively. Worst accuracy performance as attributed to AdaBoost with 0.77, then KNN with 0.71. As for MCC results, MLP was the highest classifier with 0.97, followed by LR with 0.944, then RF with 0.910. Worst MCC performance was observed at 0.477 in KNN classifier. It is indicated from the results that some classifiers' accuracies have improved after SMOTE was applied to the imbalanced dataset. At the same time, some classifiers' performance has degraded. Still, it is confirmed that MCC across all classifiers has improved, which shows its suitability in performance evaluation after balancing the dataset.
4.1. Comparative Analysis
This subsection compares the accuracy and MCC values of the results for all ML classifiers before and after using SMOTE. The comparison is illustrated in Figs. 3 and 4, respectively.
It is observed that accuracy and MCC are among the most important measures used to evaluate the performance of ML classifiers. Based on the analysis and the analyzed case in this research, it is evident that when the SMOTE technique was applied, the performance of four of the classifiers increased: RF from 0.95 to 0.96, DT from 0.93 to 0.94, XGBoost from 0.91 to 0.93, and MLP from 0.98 to 0.99. With 0.97, only LR kept its accuracy before and after SMOTE. However, the remaining three ML classifiers, Adaboost, KNN, and NB, did not demonstrate any gain in the accuracy. These results indicate the suitability of the SMOTE technique in terms of accuracy. However, another important measure known as MCC introduced in the literature is more robust and trustworthy than balanced accuracy in F1 score and binary classification analysis [47]. From the MCC data, it is evident that most classifiers exhibited an increase before and after implementing SMOTE, with the greatest improvement reported for MLP 0.97 MCC score, followed by LR 0.944 MCC score. The MCC scores for RF, DT, and XGBoost are 0.91, 0.884, and 0.853, respectively. Only the AdaBoost classifier showed a minor gain in the MCC score, bringing it to 0.577. However, the MCC score of the KNN classifier decreased after applying SMOTE, which is also consistent with the accuracy.
4.2. Result Discussion
In classifying Russian-Ukrainian conflict-related discussion on Twitter, it is evident that most basic ML classifiers improved their performance, which was confirmed by measuring the MCC score as identified in the literature to be one of the best approaches for classification problems, particularly when data are balanced utilizing techniques such as SMOTE. The only classifier that does not enhance by the used approach was the KNN classifier, which was validated by the MCC score and the accurate result. Even so, the KNN algorithm performed far higher when the data were imbalanced than when they were balanced. This demonstrates that despite the promise of data balancing methodologies, its application in producing a balanced dataset could not always be applicable across all ML classifiers. As a result, it is worthwhile to investigate the possibility of determining the performance of these various classifiers by employing additional data balancing methods to evaluate and compare them in terms of their performance.