Prediction Performance Metrics Considering the Difficulty of Individual Cases

doi:10.21203/rs.3.rs-3736323/v1

Download PDF

Article

Prediction Performance Metrics Considering the Difficulty of Individual Cases

https://doi.org/10.21203/rs.3.rs-3736323/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Prediction performance evaluation is an essential step in machine learning model development. Model performance is generally assessed based on the number of correct and incorrect predictions it makes. However, this evaluation metric has a limitation in that it treats all cases equally, regardless of their varying levels of prediction difficulty. In this paper, we propose novel prediction performance metrics considering the prediction difficulty. The novel performance metrics reward models for correct predictions on difficult cases and penalize them for incorrect predictions on easy cases. The prediction difficulty of individual cases is measured using three case difficulty calculation metrics developed by neural networks. We conducted experiments using a variety of datasets and seven machine learning models to compare prediction performance with and without considering the difficulty of individual cases. The experimental results demonstrate that our novel prediction performance metrics enhance the understanding of model performance from various aspects and provide a more detailed explanation of model performance than conventional performance metrics.

Physical sciences/Mathematics and computing/Computer science

Physical sciences/Mathematics and computing/Information technology

Prediction performance evaluation is one of the most important steps in the machine learning (ML) model development process. It allows comparisons between models and guides research by determining whether the stated goal of the modeling activity has been achieved [1]. Most commonly used metrics for evaluating models are based on the numbers of correct and incorrect predictions made by the model. For example, the confusion matrix-based metrics (e.g., accuracy, sensitivity, specificity) and the area under the receiver operating characteristic curve (AUC) are based on this concept [2]. Since these metrics are straightforward and easy to understand, they have been widely used by researchers.

Researchers have investigated the importance of handling individual cases in a dataset to improve prediction performance metrics. It has been proposed that the importance and difficulty of cases can be utilized as weights in performance evaluation. For instance, one study suggested a method to handle individual cases differently by grouping them into typical, informative, and noisy [3]. Another study argued that cases could be sorted into easy and hard ones and handled differently [4]. These previous studies have emphasized the necessity of assigning different weights to individual cases based on their importance or difficulty.

The way to delineate different levels of difficulty is a challenging problem. In previous research, the difficulty of individual cases, referred to as case difficulty or instance hardness, has been explored in various ways [5]. In this paper, we use the term ‘case difficulty’ to indicate the difficulty of individual cases. The concept of case difficulty has been applied to various ML studies [6–8]. For example, one study proposed a new classification algorithm that incorporated case difficulty, quantified based on the number of ML models making correct predictions [6]. This new algorithm demonstrated better performance compared to existing ML models. Another study demonstrated that filtering out difficult cases, which were challenging cases that made it difficult for ML models to make correct predictions for, can improve ML performance [7]. One study used case difficulty to develop an ensemble model and reduce computational time without compromising prediction performance [8].

Despite various attempts to incorporate case difficulty into ML research, there has been a lack of attention to applying case difficulty to the evaluation process. Although different cases tend to exhibit varying levels of difficulty in most datasets, conventional performance metrics treat all cases equally [9]. As a rare example, some studies developed performance metrics based on case difficulty [10, 11]. One study developed new receiver operating characteristic curve-based metrics that rely on case difficulty, determined by measuring the distance between each case and the decision boundary [10]. Another study proposed new performance evaluation metrics by weighting individual cases differently based on the number of ML models that made correct predictions [11]. Except for these studies, there has been little research at the intersection between case difficulty and prediction performance evaluation. The development of new metrics that consider case difficulty can provide a more detailed understanding of how the model truly performs at the individual case level.

In this paper, we develop novel prediction performance metrics considering case difficulty. Our performance metrics reward models for correct predictions on difficult cases and penalize them for incorrect predictions on easy cases. To calculate case difficulty, we used three different case difficulty metrics developed using neural networks (NNs) in our previous research [12]. We then compared the models' performance with and without considering case difficulty. Prediction performance was assessed through experiments conducted on simulated and real-world datasets using seven different ML models. We focused on classification problems in this study.

Datasets and Pre-processing

Prediction models were evaluated using simulated and real-world datasets from the UCI machine learning repository and Kaggle, respectively. To evaluate the models with diverse datasets, we generated four simulated datasets that have different amounts of overlap between classes (see Fig. 2). We generated isotropic Gaussian blobs and interleaving crescent moon datasets using make_blobs and make_circles functions in python scikit-learn library [13]. For the interleaving crescent moon dataset, we created a custom function named moon-shape with controlled noise levels and parameters. The simulated datasets included two binary classification problems and two three-class classification problems. The binary and three-class classification problems consisted of 2,000 and 3,000 cases (1,000 per class) each, respectively. Each simulated dataset contained two features.

The real-world datasets were the Original UCI Wisconsin Breast Cancer data (breast cancer data), Telco Customer Churn data (Telco data), and customer segmentation data (Customer data). The breast cancer data is a binary classification dataset that includes measurements of breast cancer cases of benign and malignant [14, 15]. The data has 699 cases (458 benign and 241 malignant cases) and nine features. The features, which are integer values between 1 and 10, include clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The 16 missing values in the bare nuclei column were imputed with the mean values.

The Telco data is another binary classification data that involves customer information [16]. This data supports the development of models that can predict whether a given customer left within the last month. The dataset has 7,043 cases (1,869 churned customers and 5,174 non-churned customers) and nineteen features. The features are a mix of continuous variables (tenure, total charges, and monthly charges) and categorical variables (gender, senior citizen, partner, dependents, phone service, multiple lines, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract, paperless billing, and payment method). There were 11 missing values in the total charge variable, and they were replaced with 0 since it only occurred when the customer used 0 months of service. The standard scaler in scikit-learn was used for continuous columns and one-hot encoding was applied to the categorical columns.

The Customer data is a multiclass classification data having 8,068 cases grouped into four categories (1972 instances in group A, 1858 instances in group B, 1970 instances in group C, and 2268 instances in group D) [17]. The data has nine features including gender, ever married, age, graduated, profession, work experience, spending score, family size, and an anonymized variable. There were varying numbers of missing values in six variables (ever married: 140 missing values, graduated: 78 missing values, profession: 124 missing values, work experience: 829 missing values, family size: 335 missing values, and the anonymized variable: 76 missing values). The missing values were imputed with the most frequent values. The data underwent preprocessing of a standard scaler for the continuous variables (age, work experience, and family size) and one-hot encoding for the remaining categorical variables.

Case Difficulty Calculation Metrics

Three case difficulty calculation metrics were used to measure the difficulty of individual cases in both the simulated and real-world datasets. For more details on these metrics, see our previous work [12].

Case Difficulty Model Complexity (CDmc)

CDmc uses NN complexity to calculate case difficulty. CDmc assumes that difficult cases require complex models to be predicted correctly. Therefore, the number of neurons needed to make correct predictions in an NN with one hidden layer determines the case difficulty of each individual case.

Case Difficulty Double Model (CDdm)

CDdm is based on two NNs. The first NN makes predictions the usual way, while the second NN is trained to predict whether the first NN’s prediction would be incorrect. The output of the second NN represents case difficulty.

Case Difficulty Predictive Uncertainty (CDpu)

CDpu assesses the variability of NN predictions. CDpu assumes that the prediction probabilities of easy cases would have peaked distributions near the correct label. Conversely, difficulty cases would have wide prediction probability distributions far from the correct label. CDpu calculates case difficulty based on the spread and location of the prediction probability distribution.

Machine Learning Models

To ensure that the results are not biased towards specific ML models, we used seven different models including k-nearest neighbors (knn), logistic regression (lr), support vector machine (svm), naive bayes (nb), random forest (rf), shallow neural network (snn), and deep neural network (dnn). The snn had an input layer, a 64-neuron hidden layer, a 32-neuron hidden layer, and an output layer. The dnn had an input layer, a 128-neuron hidden layer, a 64-neuron hidden layer, a 32-neuron hidden layer, two dropouts, and an output layer. Both the snn and dnn used the ReLu activation function and Adam optimizer. Hyperparameters were tuned using a 3-fold cross-validation grid (GridSearchCV). GridSearchCV is the technique which builds a model for every combination of hyperparameters and finds the optimal combination of parameters [18]. (See the Supplementary Table S1 for the hyperparameter space for each model).

Model evaluation proceeded as follows: first, we split the datasets into 70% training and 30% test data. Next, we found the best hyperparameters using GridSearchCV on the training data. Then, we evaluated the trained models using the test set. Model performance was assessed using conventional evaluation metrics as well as our novel case difficulty-based metrics which are defined in Fig. 1.

d_TP, d_FP, d_FN, and d_TN represent true positive, false positive, false negative, and true negative, respectively, weighted by case difficulty. d_FP and d_FN are designed to penalize incorrect predictions of easy cases, whereas d_TP and d_TN reduce rewards on making correct predictions on easy cases.

For the multi-class classification problems, confusion matrices were calculated to consider all classes. There are two ways to evaluate multi-class classification performance [19]. Macro-averaging calculates metrics separately for each class and assigns equal weights to each class. Micro-averaging weights all cases equally and calculates metrics to be proportional to the sample size in each class [20]. All macro and micro versions of our case difficult-based metrics are shown in Table 1.

Table 1

Case difficulty-based metrics used to evaluate multi-class classification performance. Metrics with subscript k correspond to class k.
Macro-averaging	Micro-averaging
\({d\_accuracy}_{macro}= \frac{{\sum }_{k=1}^{n}{d\_accuracy}_{k}}{n}\)	\({d\_accuracy}_{micro}=\frac{{\sum }_{k=1}^{n}{d\_TP}_{k}+{\sum }_{k=1}^{n}{d\_TN}_{k}}{{\sum }_{k=1}^{n}{d\_TP}_{k}+{\sum }_{k=1}^{n}{d\_FP}_{k}+{\sum }_{k=1}^{n}{d\_TN}_{k}+{\sum }_{k=1}^{n}{d\_FN}_{k}}\)
\({d\_sensitivity}_{macro}=\frac{{\sum }_{k=1}^{n}{d\_sensitivity}_{k}}{n}\)	\({d\_sensitivity}_{micro}=\frac{{\sum }_{k=1}^{n}{d\_TP}_{k}}{{\sum }_{k=1}^{n}{d\_TP}_{k}+{\sum }_{k=1}^{n}{d\_FN}_{k}}\)
\({d\_specificity}_{macro}=\frac{{\sum }_{k=1}^{n}{d\_specificity}_{k}}{n}\)	\({d\_specificity}_{micro}=\frac{{\sum }_{k=1}^{n}{d\_TN}_{k}}{{\sum }_{k=1}^{n}{d\_TN}_{k}+{\sum }_{k=1}^{n}{d\_FP}_{k}}\)
\({d\_PPV}_{macro}=\frac{{\sum }_{k=1}^{n}{d\_PPV}_{k}}{n}\)	\({d\_PPV}_{micro}=\frac{{\sum }_{k=1}^{n}{d\_TP}_{k}}{{\sum }_{k=1}^{n}{d\_TP}_{k}+{\sum }_{k=1}^{n}{d\_FP}_{k}}\)
\({d\_NPV}_{macro}=\frac{{\sum }_{k=1}^{n}{d\_NPV}_{k}}{n}\)	\({d\_NPV}_{micro}=\frac{{\sum }_{k=1}^{n}{d\_TN}_{k}}{{\sum }_{k=1}^{n}{d\_TN}_{k}+{\sum }_{k=1}^{n}{d\_FN}_{k}}\)
\({d\_AUC}_{macro}=\frac{{\sum }_{k=1}^{n}{d\_AUC}_{k}}{n}\)	\({d\_AUC}_{micro}=\text{AUC(}1-{d\_specificity}_{micro}\text{,}{ d\_sensitivity}_{micro}\text{)}\)

n: number of classes; TP: true positive; TN: true negative; FP: false positive; FN: false negative; AUC: area under the curve; PPV: positive predictive value; NPV: negative predictive value.

For balanced datasets, the results from micro-averaging and macro-averaging are identical. Since the datasets used in this study were mostly balanced, we only present the micro-averaging results in this paper. (See the Supplementary Table S6, S8, and S10 for the macro-averaging results).

The distribution of case difficulty for each dataset is displayed in Figure 2 and Figure 3. The simulated datasets are plotted in Figure 2, and the real-world datasets are shown in Figure 3.

Figure 2 shows that the simulated datasets were mainly composed of low-difficulty cases. When the dataset contained a larger overlap area and was linearly non-separable, the number of high-difficulty cases increased.

The real-world datasets required dimensionality reduction methods to plot them in a two-dimensional feature space. Therefore, t-distributed Stochastic Neighbour Embedding (t-SNE) was applied to the breast cancer data with nine features, and the Factor Analysis of Mixed Data (FAMD) was used for the Telco and Customer data with nineteen and nine features, respectively [21, 22].

Figure 3 shows that the case difficulty calculated from CDmc is primarily concentrated around 0 and 1. Furthermore, the Telco and Customer datasets exhibit broader distributions of case difficulty compared to the breast cancer data. This difference is particularly noticeable in the CDdm and CDpu plots.

The prediction performance evaluation results for the simulated and real-world datasets are given in Table 2. The multi-class classification datasets c, d, and Customer were evaluated using the micro-averaging metrics. See the Supplementary Table S2-S10 for the evaluation results based on CDdm, CDpu, macro-averaging, PPV, and NPV).

Table 2. Prediction performance results of classification algorithms across various datasets evaluated using conventional and case difficulty-based metrics based on CDmc. In the table, the bold font indicates the highest values among the models for each evaluation metric.

Dataset	Classifiers	accuracy	sensitivity	specificity	AUC	d_accuracy (CDmc)	d_sensitivity (CDmc)	d_specificity (CDmc)	d_AUC (CDmc)
a	knn	0.975	0.967	0.983	0.995	0.839	0.758	0.942	0.955
	lr	0.980	0.980	0.980	0.999	0.915	0.941	0.890	0.975
	svm	0.980	0.980	0.980	0.999	0.915	0.941	0.890	0.975
	nb	0.980	0.980	0.980	0.999	0.915	0.941	0.890	0.975
	rf	0.975	0.977	0.973	0.992	0.842	0.888	0.799	0.947
	snn	0.973	0.971	0.976	0.998	0.819	0.797	0.843	0.965
	dnn	0.977	0.974	0.980	0.997	0.864	0.840	0.890	0.973
b	knn	0.990	0.983	0.997	0.993	0.968	0.968	0.969	0.983
	lr	0.908	0.906	0.911	0.976	0.569	0.591	0.549	0.654
	svm	0.990	0.983	0.997	0.998	0.968	0.968	0.969	0.988
	nb	0.898	0.903	0.894	0.974	0.508	0.567	0.458	0.621
	rf	0.988	0.977	1.000	0.995	0.983	0.965	1.000	0.978
	snn	0.990	0.983	0.997	0.997	0.968	0.968	0.969	0.988
	dnn	0.985	0.983	0.987	0.998	0.966	0.968	0.965	0.994
Breast cancer	knn	0.967	0.925	0.986	0.997	0.974	0.920	1.000	1.000
	lr	0.952	0.881	0.986	0.998	0.922	0.769	1.000	1.000
	svm	0.962	0.955	0.965	0.997	0.973	1.000	0.958	1.000
	nb	0.962	0.970	0.958	0.984	0.948	1.000	0.919	0.958
	rf	0.967	0.940	0.979	0.995	0.974	0.927	1.000	1.000
	snn	0.971	0.955	0.979	0.996	0.934	0.889	0.962	0.972
	dnn	0.981	0.985	0.979	0.996	0.954	0.942	0.962	0.982
Telco	knn	0.772	0.639	0.822	0.822	0.442	0.787	0.222	0.487
	lr	0.811	0.528	0.916	0.856	0.737	0.529	0.875	0.810
	svm	0.808	0.542	0.907	0.852	0.664	0.584	0.743	0.732
	nb	0.700	0.869	0.637	0.785	0.347	1.000	0.034	0.530
	rf	0.807	0.517	0.915	0.855	0.612	0.476	0.725	0.610
	snn	0.807	0.551	0.903	0.856	0.682	0.657	0.704	0.783
	dnn	0.805	0.563	0.895	0.853	0.653	0.738	0.590	0.740
c	knn	0.881	0.821	0.911	0.932	0.773	0.6	0.841	0.8
	lr	0.864	0.796	0.898	0.932	0.776	0.508	0.855	0.802
	svm	0.874	0.811	0.906	0.939	0.81	0.594	0.876	0.863
	nb	0.864	0.797	0.898	0.932	0.78	0.515	0.858	0.801
	rf	0.88	0.82	0.91	0.94	0.78	0.602	0.848	0.82
	snn	0.871	0.807	0.903	0.944	0.778	0.56	0.852	0.834
	dnn	0.87	0.806	0.903	0.944	0.764	0.55	0.84	0.83
d	knn	0.982	0.973	0.987	0.998	0.861	0.801	0.893	0.938
	lr	0.527	0.291	0.646	0.465	0.106	0.049	0.157	0.051
	svm	0.978	0.967	0.983	0.998	0.828	0.757	0.867	0.925
	nb	0.955	0.932	0.966	0.913	0.68	0.576	0.743	0.313
	rf	0.982	0.973	0.987	0.997	0.854	0.793	0.887	0.894
	snn	0.979	0.968	0.984	0.998	0.834	0.764	0.871	0.925
	dnn	0.972	0.958	0.979	0.997	0.791	0.708	0.838	0.893
Customer	knn	0.743	0.487	0.829	0.757	0.769	0.357	0.859	0.76
	lr	0.752	0.505	0.835	0.776	0.827	0.326	0.901	0.823
	svm	0.763	0.526	0.842	0.79	0.845	0.397	0.911	0.853
	nb	0.747	0.494	0.831	0.741	0.79	0.349	0.875	0.768
	rf	0.768	0.537	0.846	0.8	0.844	0.438	0.91	0.853
	snn	0.761	0.523	0.841	0.791	0.831	0.397	0.902	0.831
	dnn	0.766	0.532	0.844	0.797	0.834	0.428	0.903	0.837

The performance evaluation results from the conventional and case difficulty-based metrics are compared in Figures 4 and 5. The results for the binary classification datasets are shown in Figure 4, whereas the micro-averaging results for the multi-class classification datasets are shown in Figure 5.

Figure 4 shows that considering case difficulty increases the performance differences among the models. For example, all seven ML models exhibited similar accuracy, sensitivity, specificity, and AUC for dataset a.

However, when case difficulty was considered, the performance differences between the models became more pronounced. Similarly, the dataset b results showed that applying case difficulty increased the performance gaps between the models. Furthermore, considering case difficulty sometimes led to different performance rankings across the models. For example, conventional evaluation metrics identified dnn as the best model for the breast cancer dataset, while d_accuracy (CDmc) suggested that knn and rf were the best models.

Dataset c in Figure 5 shows the change in the best model when d_accuracy (CDmc) and d_accuracy (CDdm) was used. While rf was the best model based on accuracy, d_accuracy (CDmc), d_accuracy (CDdm), and d_accuracy (CDpu) indicated knn, svm, and knn were the best models, respectively.

A detailed example of how the case difficulty affects prediction performance is shown in Figure 6. In Figure 6 (a), both knn and rf made 11 incorrect predictions. As a result, both models resulted in the same accuracy of 0.982. When case difficulty was considered, knn showed a higher d_accuracy (CDmc) value of 0.861 compared to the d_accuracy (CDmc) value of 0.854 from rf. Similarly, both rf and snn made 408 incorrect predictions in Figure 6 (b). The same number of incorrect predictions resulted in the same accuracy of 0.807. However, snn had a d_accuracy (CDmc) value of 0.682, while rf’s d_accuracy (CDmc) was 0.612.

Strengths of Case Difficulty-based Prediction Performance Metrics

This study aimed to develop new prediction performance evaluation metrics based on case difficulty. To enhance the evaluation metrics, we utilized newly calculated case difficulty values derived from metrics devised using different definitions of difficulty. The comparisons between conventional metrics and our novel case difficulty-based metrics demonstrate the advantages of considering case difficulty.

The results showed that prediction performance comparison became easier with evaluation metrics considering case difficulty. It is especially apparent when the data contains a large number of easy cases. When the data were simple and mainly composed of low-difficulty cases, essentially all ML models showed high performance, making it difficult to compare and rank them. However, evaluation metrics considering case difficulty were able to better differentiate the models' performances in general. For instance, the performances on dataset a in Fig. 4 and dataset c in Fig. 5 showed substantial differences between the models when the case difficulty-based metrics were used. This was more pronounced in dataset a than dataset b since dataset a contained more easy cases than dataset c.

The case difficulty-based metrics can amplify performance differences between models. For example, the poor performances of lr and nb on dataset b were more pronounced in terms of all three d_accuracy metrics in comparison with conventional accuracy (Fig. 4). Since dataset b mainly consisted of low-difficulty cases, the models received a high penalty when they made incorrect predictions for easy cases.

Our novel evaluation metrics can be particularly informative when the numbers of incorrect and correct predictions are similar. The same numbers of incorrect and correct predictions result in the same accuracy, which makes it challenging for researchers to compare the models. Considering other metrics, such as sensitivity, specificity, and AUC, can be a solution, but they might yield similar values when the data set is balanced. Our novel performance metrics can help address these limitations of conventional evaluation metrics. Even when multiple models yield the same number of incorrect predictions, their performances can be differentiated when case difficulty is considered. For example, rf made more incorrect predictions for the low-difficulty cases in Fig. 6 (a). This resulted in rf receiving a higher penalty than knn in terms of d_accuracy (CDmc). In contrast, snn made more incorrect predictions for difficult cases than rf in Fig. 6 (b). This led to snn receiving less penalty than rf in d_accuracy (CDmc).

Furthermore, the case difficulty-based metrics can change performance rankings across ML models. The accuracy on dataset c in Table 2 showed that knn was the best model, while svm was the best model based on d_accuracy (CDmc). The performance ranking changed because knn made more incorrect predictions for easy cases. Specifically, knn made 161 incorrect predictions, and svm made 170 incorrect predictions. Based on these numbers alone, knn was a better model than svm. However, knn made 51 incorrect predictions for easy cases (case difficulty values below 0.5), while svm made 35 incorrect predictions for easy cases. This demonstrates that considering case difficulty can alter performance rankings.

Our case difficulty-based metrics can provide new guidelines for researchers to select the best models from a new perspective. Since case difficulty can be calculated in different ways, researchers can choose the most appropriate performance metrics based on their data. For example, in medical data such as the breast cancer data, it may be more important to assess how consistently the model makes correct predictions (CDpu) rather than making a single accurate prediction (CDdm). Therefore, researchers can consider using dnn as the best model chosen by CDpu, instead of selecting rf as the best model based on CDdm.

Since we have made our code and models publicly available online, researchers can calculate new prediction performance evaluation metrics for a new dataset. Before using our code, researchers need to calculate case difficulty values from CDmc, CDdm, and CDpu for the new dataset. After utilizing the code, researchers will obtain two sets of results. The first set includes model performance evaluated using conventional and case difficulty-based metrics. The second set contains information about which case the model correctly or incorrectly predicts. From the second set, researchers can analyze which specific case affects the model's performance.

Limitations

There are several limitations to consider in the application of our case difficulty-based prediction performance metrics.

First, computational costs are high to calculate our case difficulty-based prediction performance metrics. These performance metrics require case difficulty values from CDmc, CDdm, and CDpu for individual cases. The additional steps to calculate the case difficulty increase the computing resource than conventional evaluation metrics.

Second, there are situations when case difficulty-based prediction performance metrics cannot be utilized. One such situation occurs when the model makes correct predictions for all cases and the case difficulty is uniformly zero. In this case, the denominators of the performance metrics are zero, resulting in an undefined result. Another situation arises when the model makes incorrect predictions for all cases and the case difficulty is uniformly one. In this case, all cases receive equal weights, and the case difficulty-based metrics produce the same results as the conventional metrics.

Third, the magnitudes of the case difficulty-based performance metrics do not convey as much meaning as conventional metrics. For example, while an accuracy of 0.6 would be considered low, it is much less clear if a d_accuracy of 0.6 corresponds to poor performance. This is largely due to the fact that we are readily able to associate the magnitudes of the conventional metrics with performance, whereas the case difficulty-based weighting makes it difficult to appreciate the magnitudes of our new metrics without reference. Hence, our new metrics should be interpreted in relative sense.

Fourth, we visualized real-world data using dimensionality reduction methods such as t-SNE and FAMD. Applying these methods enables the visualization of datasets with more than two features. However, it remains uncertain whether the feature space with reduced dimensionality accurately represents the original data.

Fifth, the generalizability of the presented results to other datasets is limited because we only conducted experiments using a few datasets.

Lastly, our study only investigated linearly weighted case difficulty-based metrics and the effects of non-linear weighting on the metrics are unclear. Non-linear weighting functions can attenuate or reinforce the degrees of penalties and rewards.

Future Work

Future research includes validating the case difficulty-based metrics on additional datasets from various domains. New case difficulty-based metrics can be developed as well using other case difficulty metrics and/or modified performance metrics (e.g., non-linear weighting). Additionally, this research could be extended to develop an adaptive predictive performance evaluation method. Case difficulty metrics can be used to selectively utilize cases in a dataset for model training and testing (e.g., only utilizing easy cases may not be beneficial). This would help reduce the amount of data required to test the model and enhance the application of ML models in domains where large-scale data collection is challenging.

In this study, we analyzed the utility of considering newly calculated case difficulty in prediction performance evaluation metrics. While conventional metrics treat all cases equally, our novel case difficulty-based performance metrics assign different weights to each case during evaluation. These novel metrics have made it easier for researchers to compare models with an interpretation of performance at the case level. Additionally, case difficulty calculated from three different metrics has provided new insights and added diversity to model selection. We believe our novel metrics could serve as supplementary evaluation metrics for model assessment and contribute to the advancement of ML research.

Acknowledgements

This study was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2021-02588), Alberta Graduate Excellence Scholarship, and Biomedical Engineering Graduate Program Research Excellence Award.

Author contributions

HK: conceptualization; data analysis; model development; visualization; writing—original draft. MG: conceptualization; writing—review and editing. CJ: conceptualization; writing—review and editing. JL: conceptualization; project administration; resources; supervision; writing—review and editing.

Competing interests

The authors declare no competing interests.

Data Availability Statement

The datasets generated and analysed during the current study are available in the GitHub repository,
https://github.com/data-intelligence-for-health-lab/Evaluation_metrics_considering_case_difficulty.

Reich, Y., & Barai, S. V. Evaluating machine learning models for engineering problems. Artificial Intelligence in Engineering, 13, 257-272 (1999).
Tharwat, A. Classification assessment methods. Applied computing and informatics, 17, 168-192 (2020).
Li, L., Pratap, A., Lin, H. T., & Abu-Mostafa, Y. S. Improving generalization by data categorization. In: Proceedings of AAAI Workshop on Knowledge Discovery in Databases, 157-168 (2005).
Merler, S., Caprile, B., & Furlanello, C. Bias-variance control via hard points shaving. International Journal of Pattern Recognition and Artificial Intelligence, 18, 891-903 (2004).
Smith, M. R., Martinez, T., & Giraud-Carrier, C. An instance level analysis of data complexity. Machine learning, 95, 225-256 (2014).
Yu, S., Li, X., Wang, H., Zhang, X., & Chen, S. BIDI: A classification algorithm with instance difficulty invariance. Expert Systems with Applications, 165, 113920 (2021).
Abad, Z. S. H., & Lee, J. Detecting uncertainty of mortality prediction using confident learning. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 1719-1722 (2021).
Dantas, C., Nunes, R., Canuto, A., & Xavier-Júnior, J. Instance hardness as a decision criterion on dynamic ensemble structure. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 108-113 (2019).
Hernández-Orallo, J., Flach, P., & Ferri Ramírez, C. A unified view of performance metrics: Translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13, 2813-2869 (2012).
Zhang, X., Li, X., & Feng, Y. A classification performance measure considering the degree of classification difficulty. Neurocomputing, 193, 81-91 (2016).
Abad, Z. S. H., Kline, A., & Lee, J. Evaluation of machine learning-based patient outcome prediction using patient-specific difficulty and discrimination indices. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 5446-5449 (2020).
Kwon, H., Greenberg, M., Josephson, C., Lee, J. Measuring the Prediction Difficulty of Individual Cases in a Dataset using Machine Learning. Manuscript submitted for publication (2023).
Pedregosa, Fabian, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12, 2825-2830 (2011).
Wolberg, W. H. & Mangasarian, O. L. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences 87, 9193–9196 (1990).
Bennett, K. P. & Mangasarian, O. L. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1, 23–34 (1992).
Rahman, M. S., Alam, M. S., & Hosen, M. I. To Predict Customer Churn By Using Different Algorithms. In 2022 International Conference on Decision Aid Sciences and Applications (DASA), 601-604 (2022).
Gankidi, N., Gundu, S., viqar Ahmed, M., Tanzeela, T., Prasad, C. R., & Yalabaka, S. Customer Segmentation Using Machine Learning. In 2022 2nd International Conference on Intelligent Technologies (CONIT), 1-5 (2022).
Liashchynskyi, P., & Liashchynskyi, P. Grid search, random search, genetic algorithm: a big comparison for NAS. arXiv preprint arXiv:1912.06059, (2019).
Grandini, M., Bagli, E., & Visani, G. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756, (2020).
Vateekul, P., & Kubat, M. Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. In 2009 IEEE International Conference on Data Mining Workshops, 320-325 (2009).
Hinton, G., & van der Maaten, L. Visualizing data using t-sne journal of machine learning research, (2008).
Pagès, J. Multiple factor analysis by example using R. CRC Press, (2014).

No competing interests reported.

Appendix.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Prediction Performance Metrics Considering the Difficulty of Individual Cases

Status:

Version 1

Abstract

Figures

Introduction

Methods

Datasets and Pre-processing

Case Difficulty Calculation Metrics

Case Difficulty Model Complexity (CDmc)

Case Difficulty Double Model (CDdm)

Case Difficulty Predictive Uncertainty (CDpu)

Machine Learning Models

Results

Discussion

Strengths of Case Difficulty-based Prediction Performance Metrics

Limitations

Future Work

Conclusions

Declarations

Acknowledgements

Author contributions

Competing interests

Data Availability Statement

References

Additional Declarations

Supplementary Files

Status:

Version 1