The distribution of case difficulty for each dataset is displayed in Figure 2 and Figure 3. The simulated datasets are plotted in Figure 2, and the real-world datasets are shown in Figure 3.
Figure 2 shows that the simulated datasets were mainly composed of low-difficulty cases. When the dataset contained a larger overlap area and was linearly non-separable, the number of high-difficulty cases increased.
The real-world datasets required dimensionality reduction methods to plot them in a two-dimensional feature space. Therefore, t-distributed Stochastic Neighbour Embedding (t-SNE) was applied to the breast cancer data with nine features, and the Factor Analysis of Mixed Data (FAMD) was used for the Telco and Customer data with nineteen and nine features, respectively [21, 22].
Figure 3 shows that the case difficulty calculated from CDmc is primarily concentrated around 0 and 1. Furthermore, the Telco and Customer datasets exhibit broader distributions of case difficulty compared to the breast cancer data. This difference is particularly noticeable in the CDdm and CDpu plots.
The prediction performance evaluation results for the simulated and real-world datasets are given in Table 2. The multi-class classification datasets c, d, and Customer were evaluated using the micro-averaging metrics. See the Supplementary Table S2-S10 for the evaluation results based on CDdm, CDpu, macro-averaging, PPV, and NPV).
Table 2. Prediction performance results of classification algorithms across various datasets evaluated using conventional and case difficulty-based metrics based on CDmc. In the table, the bold font indicates the highest values among the models for each evaluation metric.
Dataset
|
Classifiers
|
accuracy
|
sensitivity
|
specificity
|
AUC
|
d_accuracy
(CDmc)
|
d_sensitivity
(CDmc)
|
d_specificity
(CDmc)
|
d_AUC
(CDmc)
|
a
|
knn
|
0.975
|
0.967
|
0.983
|
0.995
|
0.839
|
0.758
|
0.942
|
0.955
|
lr
|
0.980
|
0.980
|
0.980
|
0.999
|
0.915
|
0.941
|
0.890
|
0.975
|
svm
|
0.980
|
0.980
|
0.980
|
0.999
|
0.915
|
0.941
|
0.890
|
0.975
|
nb
|
0.980
|
0.980
|
0.980
|
0.999
|
0.915
|
0.941
|
0.890
|
0.975
|
rf
|
0.975
|
0.977
|
0.973
|
0.992
|
0.842
|
0.888
|
0.799
|
0.947
|
snn
|
0.973
|
0.971
|
0.976
|
0.998
|
0.819
|
0.797
|
0.843
|
0.965
|
dnn
|
0.977
|
0.974
|
0.980
|
0.997
|
0.864
|
0.840
|
0.890
|
0.973
|
b
|
knn
|
0.990
|
0.983
|
0.997
|
0.993
|
0.968
|
0.968
|
0.969
|
0.983
|
lr
|
0.908
|
0.906
|
0.911
|
0.976
|
0.569
|
0.591
|
0.549
|
0.654
|
svm
|
0.990
|
0.983
|
0.997
|
0.998
|
0.968
|
0.968
|
0.969
|
0.988
|
nb
|
0.898
|
0.903
|
0.894
|
0.974
|
0.508
|
0.567
|
0.458
|
0.621
|
rf
|
0.988
|
0.977
|
1.000
|
0.995
|
0.983
|
0.965
|
1.000
|
0.978
|
snn
|
0.990
|
0.983
|
0.997
|
0.997
|
0.968
|
0.968
|
0.969
|
0.988
|
dnn
|
0.985
|
0.983
|
0.987
|
0.998
|
0.966
|
0.968
|
0.965
|
0.994
|
Breast cancer
|
knn
|
0.967
|
0.925
|
0.986
|
0.997
|
0.974
|
0.920
|
1.000
|
1.000
|
lr
|
0.952
|
0.881
|
0.986
|
0.998
|
0.922
|
0.769
|
1.000
|
1.000
|
svm
|
0.962
|
0.955
|
0.965
|
0.997
|
0.973
|
1.000
|
0.958
|
1.000
|
nb
|
0.962
|
0.970
|
0.958
|
0.984
|
0.948
|
1.000
|
0.919
|
0.958
|
rf
|
0.967
|
0.940
|
0.979
|
0.995
|
0.974
|
0.927
|
1.000
|
1.000
|
snn
|
0.971
|
0.955
|
0.979
|
0.996
|
0.934
|
0.889
|
0.962
|
0.972
|
dnn
|
0.981
|
0.985
|
0.979
|
0.996
|
0.954
|
0.942
|
0.962
|
0.982
|
Telco
|
knn
|
0.772
|
0.639
|
0.822
|
0.822
|
0.442
|
0.787
|
0.222
|
0.487
|
lr
|
0.811
|
0.528
|
0.916
|
0.856
|
0.737
|
0.529
|
0.875
|
0.810
|
svm
|
0.808
|
0.542
|
0.907
|
0.852
|
0.664
|
0.584
|
0.743
|
0.732
|
nb
|
0.700
|
0.869
|
0.637
|
0.785
|
0.347
|
1.000
|
0.034
|
0.530
|
rf
|
0.807
|
0.517
|
0.915
|
0.855
|
0.612
|
0.476
|
0.725
|
0.610
|
snn
|
0.807
|
0.551
|
0.903
|
0.856
|
0.682
|
0.657
|
0.704
|
0.783
|
dnn
|
0.805
|
0.563
|
0.895
|
0.853
|
0.653
|
0.738
|
0.590
|
0.740
|
c
|
knn
|
0.881
|
0.821
|
0.911
|
0.932
|
0.773
|
0.6
|
0.841
|
0.8
|
lr
|
0.864
|
0.796
|
0.898
|
0.932
|
0.776
|
0.508
|
0.855
|
0.802
|
svm
|
0.874
|
0.811
|
0.906
|
0.939
|
0.81
|
0.594
|
0.876
|
0.863
|
nb
|
0.864
|
0.797
|
0.898
|
0.932
|
0.78
|
0.515
|
0.858
|
0.801
|
rf
|
0.88
|
0.82
|
0.91
|
0.94
|
0.78
|
0.602
|
0.848
|
0.82
|
snn
|
0.871
|
0.807
|
0.903
|
0.944
|
0.778
|
0.56
|
0.852
|
0.834
|
dnn
|
0.87
|
0.806
|
0.903
|
0.944
|
0.764
|
0.55
|
0.84
|
0.83
|
d
|
knn
|
0.982
|
0.973
|
0.987
|
0.998
|
0.861
|
0.801
|
0.893
|
0.938
|
lr
|
0.527
|
0.291
|
0.646
|
0.465
|
0.106
|
0.049
|
0.157
|
0.051
|
svm
|
0.978
|
0.967
|
0.983
|
0.998
|
0.828
|
0.757
|
0.867
|
0.925
|
nb
|
0.955
|
0.932
|
0.966
|
0.913
|
0.68
|
0.576
|
0.743
|
0.313
|
rf
|
0.982
|
0.973
|
0.987
|
0.997
|
0.854
|
0.793
|
0.887
|
0.894
|
snn
|
0.979
|
0.968
|
0.984
|
0.998
|
0.834
|
0.764
|
0.871
|
0.925
|
dnn
|
0.972
|
0.958
|
0.979
|
0.997
|
0.791
|
0.708
|
0.838
|
0.893
|
Customer
|
knn
|
0.743
|
0.487
|
0.829
|
0.757
|
0.769
|
0.357
|
0.859
|
0.76
|
lr
|
0.752
|
0.505
|
0.835
|
0.776
|
0.827
|
0.326
|
0.901
|
0.823
|
svm
|
0.763
|
0.526
|
0.842
|
0.79
|
0.845
|
0.397
|
0.911
|
0.853
|
nb
|
0.747
|
0.494
|
0.831
|
0.741
|
0.79
|
0.349
|
0.875
|
0.768
|
rf
|
0.768
|
0.537
|
0.846
|
0.8
|
0.844
|
0.438
|
0.91
|
0.853
|
snn
|
0.761
|
0.523
|
0.841
|
0.791
|
0.831
|
0.397
|
0.902
|
0.831
|
dnn
|
0.766
|
0.532
|
0.844
|
0.797
|
0.834
|
0.428
|
0.903
|
0.837
|
The performance evaluation results from the conventional and case difficulty-based metrics are compared in Figures 4 and 5. The results for the binary classification datasets are shown in Figure 4, whereas the micro-averaging results for the multi-class classification datasets are shown in Figure 5.
Figure 4 shows that considering case difficulty increases the performance differences among the models. For example, all seven ML models exhibited similar accuracy, sensitivity, specificity, and AUC for dataset a.
However, when case difficulty was considered, the performance differences between the models became more pronounced. Similarly, the dataset b results showed that applying case difficulty increased the performance gaps between the models. Furthermore, considering case difficulty sometimes led to different performance rankings across the models. For example, conventional evaluation metrics identified dnn as the best model for the breast cancer dataset, while d_accuracy (CDmc) suggested that knn and rf were the best models.
Dataset c in Figure 5 shows the change in the best model when d_accuracy (CDmc) and d_accuracy (CDdm) was used. While rf was the best model based on accuracy, d_accuracy (CDmc), d_accuracy (CDdm), and d_accuracy (CDpu) indicated knn, svm, and knn were the best models, respectively.
A detailed example of how the case difficulty affects prediction performance is shown in Figure 6. In Figure 6 (a), both knn and rf made 11 incorrect predictions. As a result, both models resulted in the same accuracy of 0.982. When case difficulty was considered, knn showed a higher d_accuracy (CDmc) value of 0.861 compared to the d_accuracy (CDmc) value of 0.854 from rf. Similarly, both rf and snn made 408 incorrect predictions in Figure 6 (b). The same number of incorrect predictions resulted in the same accuracy of 0.807. However, snn had a d_accuracy (CDmc) value of 0.682, while rf’s d_accuracy (CDmc) was 0.612.