Framework for Benefit-Based Multiclass Classification

doi:10.21203/rs.3.rs-2252453/v1

Download PDF

Research Article

Framework for Benefit-Based Multiclass Classification

https://doi.org/10.21203/rs.3.rs-2252453/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Health datasets typically comprise of data that are heavily skewed towards the healthy class, thus resulting in classifiers being biased towards this majority class. Due to this imbalance of data, traditional performance metrics, such as accuracy, are not appropriate for evaluating the performance of classifiers with the minority class (disease-affected/unhealthy individuals). In addition, classifiers are trained under the assumption that the costs or benefits associated with different decision outcomes are equal. However, this is usually not the case with health data since it is more important to identify disease affected/unhealthy persons rather than healthy individuals. In this paper we address these problems by examining benefits/costs when evaluating the performance of classifiers. Furthermore, we focus on multiclass classification where the outcome can be one of three or more options. We propose modifications to the Naive Bayes and Logistic Regression algorithms to incorporate costs and benefits for the multiclass scenario as well as compare these to an existing algorithm, hierarchical cost-sensitive kernel logistic regression, and also an adapted hierarchical approach with our cost-benefit based logistic regression model. We demonstrate the effectiveness of all approaches for fetal health classification but the proposed approaches can be applied to any imbalance dataset where benefits and costs are important.

In the healthcare industry vast amounts of data such as patient demographics, dietary patterns, medical history, lab test results and even medical imaging records \cite{Jackins} are constantly being produced through various methods such as sensors, surveys, AHS (advance healthcare systems), cameras, mobile applications and online applications \cite{Akbar}. This data can be used for early detection of diseases which can aid in improving the survival rate of patients and overall better quality of life. One challenge with disease detection is the distinction of various diseases from the same data. Machine learning (ML) algorithms have proven to show high predicting capabilities and have already been used by various researchers for disease diagnosis. For example, \cite{Carmen} combined multiple ML techniques for the multiclass classification of Alzheimer's disease from other related neurological diseases, \cite{Akbar} used a multiclass Random Forest approach to classify Asthma severity and \cite{Kim} utilized a Gaussian kernel-based approach for the multiclass classification of glaucoma progression.

However, the main issue with health datasets is that they are typically skewed towards the healthy class, that is, the dataset contains a much higher percentage of healthy persons rather than disease affected persons. In the multiclass scenario, the classes for the various diseases being predicted typically contain fewer samples than that of the healthy class. In addition, the more severe the disease classification, the less samples the dataset contains. This imbalance of data results in ML algorithms being bias towards the majority class (class belonging to healthy persons) which results in most of the minority classes (disease affected samples) being incorrectly classified. This bias from the ML algorithms results from the assumption that there are equal costs associated with the misclassification of the different classes and equal benefits for correct classification, and hence the goal of the ML algorithms is to achieve high accuracy. However, in the medical setting, this is not practical since it is more important to correctly identify disease affected individuals than healthy persons so that appropriate treatment can be administered. For example, misclassification of a disease affected person can result in delayed treatment which can result in lower quality of life and in some cases death. On the other hand, misclassification of a healthy person as having a disease can lead to the unnecessary administering of treatment. Hence, we can say that the former case is more severe and hence needs to be associated with a higher cost than the latter case. The same is true for correct classification of both classes, that is, they should be associated with different benefit values. In the multiclass scenario, the identification of some of the disease affected classes are usually more severe than others therefore there are different costs associated to the misclassification of the various minority classes as well as different benefits for correct classification.

In this paper, we address the issue of cost-sensitive multiclass disease classification. In our approach, the ML algorithms are aimed towards maximizing benefits or reducing costs. In order to achieve this, we first assign benefits and costs to the different types of accurate classifications and misclassifications associated with the different classes in the dataset. Since, typical ML evaluation metrics such as accuracy do not consider benefits and costs, we propose a modification of the benefit performance metric, first presented in \cite{Sooklal}, for the multiclass scenario. In addition, we propose modifications to both the multiclass Logistic Regression (LR) and Naive Bayes algorithms to incorporate the benefits and costs while training. We compare our approach to traditional accuracy-based LR and also to a cost-sensitive kernel LR presented by \cite{Xu}. We also show how our benefit-based LR for binary classification from \cite{Sooklal} can be incorporated into \cite{Xu}'s approach for improved results. We test and compare these algorithms using a fetal health dataset and also demonstrate how benefit and cost values can be derived for this use case.

The rest of the paper is organized as follows. Sect. 2 provides a summary of previous approaches to cost-sensitive multiclass classification. In Sect. 3, we first recap our benefit performance metric from \cite{Sooklal} and illustrate how it can be adapted for the multiclass scenario. Then, we present our approaches for including benefits and costs into the multiclass LR and Naive Bayes algorithms as well as introduce \cite{Xu}'s hierarchical cost-sensitive kernel LR and describe how it can be modified with our benefit-based LR. Then, in Sect. 10 we explain how these algorithms were applied to a fetal health dataset and compare the results of each algorithm to that of traditional accuracy based LR.

Cost-sensitive multiclass classification has already been applied to other fields KIM201632, Zhou2016, chung2016, agarwal13, wang2013, Liu2012. However, in this section, we present summaries of some of the more recent works.

cite{Rojarath} proposed a cost-sensitive probability weighted ensemble learning method for multiclass classification. The ensemble model combines various classifiers such as decision trees, Naive Bayes, K-nearest neighbour and multilayer perceptron and the authors use True Positive weighting (TPweight) to perform experiments on 3TP Ensemble, 4TP Ensemble, 5TP Ensemble and 6TP Ensemble models. The authors achieve improved performance by combining weights using a weight voting method.

cite{Tian} stated that the two main approaches to dealing with imbalance datasets are the cost-sensitive approach and the Neyman-Pearson (NP) approach. The authors explained that the NP approach does not require the derivation of costs values but to their knowledge it has mainly been used for binary classification. The authors proposed two algorithms which combines both NP and cost-sensitive learning in order to solve the NP multiclass (NPMC) classification problem. The first algorithm NPMC-CX, which is based on convex optimization, showed strong consistency and performed well with parametric models but it performed poorly with non-parametric models. On the other hand, the second algorithm, NPMC-ER which employed empirical error rates, performed well with non-parametric models but required the dataset to be of a certain size.

Furthermore, \cite{So} proposed a cost-sensitive AdaBoost classifier, SAMME.C2, in order to solve the multiclass classification problem of predicting number of yearly claims based on telematics data on driving behaviours. SAMME.C2 is a combination of SAMME \cite{hastie} (a model for multiclass classification with AdaBoost) and Ada.C2 \cite{SUN} (a method for introducing costs into the AdaBoost classifier) and proved to perform better than other models such as SAMME, SMOTEBoost, SAMME with SMOTE and RUSBoost.

Similarly, \cite{FERNANDEZ} also proposed a cost-sensitive multiclass classifier utilizing the AdaBoost classifier. The proposed algorithm BAdaCost, combines multiple algorithms in order to obtain improved results. In order to achieve this, the algorithm CMEL (Cost-sensitive Multiclass Exponential Loss) was proposed which takes the loss values from multiple classifiers such as SAMME, AdaBoost, PIBoost and Cost-sensitive AdaBoost and combines them under one framework. The BAdaCost classifier proved to perform better than other multiclass cost-sensitive algorithms.

On the other hand, \cite{Jia} presented a 3 way decision theory rough set model for multiclass classification. The author explained that in order to minimize the cost of a Bayesian decision function, the model must consider acceptance, rejection and deferment. However, typical cost matrices cannot include values for all 3 categories, hence the authors proposed a method for incorporating all 3 categories by performing calculations based on a given cost matrix and also the relationships between the cost functions and the different classes, in order to derive cost functions for these 3 categories. The authors employed a multiphase approach for the cost-sensitive classification problem. They explained that through iterative training and prediction, the majority of rejection and deferment instances can be correctly classified and a K-nearest neighbour approach can then be used on the remaining samples with low acceptance confidence. The authors achieved improved accuracy and lower costs for misclassification when compared to other cost-sensitive approaches for multiclass classification.

In addition, \cite{Sami} explained that previously, only statistical methods were used to predict bond ratings for financial institutions, therefore, they proposed using machine learning models for improved results. Some of the classifiers that the authors used to test their hypothesis were logistic regression, support vector machine and cost-sensitive decision tree. They performed experiments using 3 different datasets and results confirmed that machine learning models can be used as a replacement to traditional approaches.

Finally, \cite{Xu} proposed a hierarchical cost-sensitive kernel logistic regression learning algorithm which they demonstrated using a face recognition scenario. Their work extends that of \cite{Zhang} and basically simplifies the multiclass problem into two steps. First all minority classes are grouped into one class (in-group) and cost-sensitive learning is used to classify the data into the two groups (out-group, in-group). The samples classified as in-group are then passed through a cost-blind algorithm to further separate the samples into the various minority classes, since the authors assumed equal costs for all minority classes.

In summary, one of the main issues with these approaches is that they use traditional metrics such as accuracy, F1-Score and area under the curve (AUC) in order to evaluate the performance of the classifiers. However, these metrics consider equal costs for misclassification of the different classes and similarly equal benefits for correct classification. Hence, they are not suitable for evaluating imbalance data where benefits and costs are essential in the evaluation of the classifier's performance. Therefore, for our evaluations, we propose and use a modification to our benefit performance score, first presented in \cite{Sooklal}, which is suitable for multiclass classification.

In addition, the previous works presented in this Section proposed complex approaches to solving the benefit/cost multiclass classification problem with imbalanced datasets. Our proposed approaches make simple adjustments to the model's cost functions in order to achieve the desired results, making it easy to interpret and use. We also compare our approaches to \cite{Xu} since their work was the closest to ours and we show how their algorithm can be improved by using our benefit-based LR from \cite{Sooklal}.

In this section we first recap the formulations for our benefit-based performance metric and benefit-based Logistic Regression algorithm that were first presented in \cite{Sooklal} and then present how these can be adapted for the multiclass scenario. We also propose an approach for multiclass cost-based Naive Bayes. \cite{Xu}'s hierarchical cost-sensitive kernel Logistic Regression (HCSKLR) is then described since we will compare our algorithms to this approach. Finally, we explain how the HCSKLR can be altered to include our benefit-based LR for improved results.

Benefit-Based Performance Metric

Let us consider a binary classification problem where $N$ samples belong to either class 0 or class 1. Let $\vec{x}$ represent the feature vector of a given sample. The classifier will generate a continuous score $s(\vec{x})$ which will be used to place the sample in either class 0 or class 1. Assuming that the classifier produces scores that are lower for samples of class 0 than class 1, we can define a threshold $t$, where instances with scores $s(\vec{x}) \leq t$ are placed in class 0 and instances with scores $s(\vec{x}) > t$ are placed in class 1. Let us denote the probability density function of the scores for class 0 and class 1 instances by $f_0(s)$ and $f_1(s)$, respectively. Similarly, we can denote the cumulative distribution functions for both classes by $F_0(s)$ and $F_1(s)$.

We can also define the costs and benefits associated with each class. If we denote $b_{ij}$ as the benefit of classifying a sample which belongs to class $i$ as class $j$ , then we would achieve a positive benefit ( $b \geq 0$ ) when $i=j$ since the sample is correctly classified. On the other hand, if $i \neq j$ then we would achieve a negative benefit or cost ( $b < 0$ ) since the sample is incorrectly classified. We can also let $\pi_j$ denote the prior probability of class $j \in \{0,1\}$ . Hence, $\pi_0 F_0(t) N$ represents the the probability of a sample belonging to class 0 times the probability of the sample being correctly classified multiplied by $N$ , the total number of samples. To simplify, $\pi_0 F_0(t) N$ is the number of samples that are expected to be classified correctly, given a threshold $t$ . Therefore, the overall benefit is calculated as

$$\begin{aligned} B(t) = & \; \pi_0 F_0(t) b_{00} + \pi_0 (1 - F_0(t)) b_{01} + \\ & \; \pi_1 F_1(t) b_{10} + \pi_1 (1 - F_1(t)) b_{11} \end{aligned}$$

given threshold $t$.

$b_{00}$ and $b_{11}$ are the only benefit values which are positive, therefore we can maximize the expected benefit when $F_0(t)=1$ and $F_1(t)=0$. This only occurs when there is no overlap between the two distributions. Hence, we can define an upper bound for benefit as $\pi_0 b_{00} + \pi_1 b_{11}$. If $B_\gamma$ represents the expected benefit for the classifier $\gamma$ then its performance metric can be defined as

$$\mu_{\gamma} \equiv \frac{B_\gamma}{\pi_0 b_{00} + \pi_1 b_{11}}S$$

One should note that if $\mu_\gamma \approx 1$ then the performance of the classifier is approximately optimal. Hence, in general, the benefit $B(t)$ should be maximized by a classifier for given benefits. We can obtain optimality by first calculating the derivative of $B(t)$ with respect to $t$ and then by setting this calculation to zero. This result of this modification is

$$f_0(t^*) \pi_0 (b_{00} - b_{01}) = f_1(t^*) \pi_1 (b_{11} - b_{10})$$

Benefit Objective with Logistic Regression (Binary Classification)

Although in Sect. 4 we described how varying the threshold of a classifier can optimize the expected benefit produced by the classifier, the LR classifier itself does not aim to optimize the benefit function while training. Therefore, we need to modify the Logistic Regression cost function in order to accommodate for costs and benefits. Hence, let us consider the posterior probability of the positive class as calculated by LR, that is, the logistic sigmoid of a linear function of the feature vector. For example, the probability for feature vector $\vec{x}_i$ is

$$\begin{aligned} p_i = P(y=1 \vert \vec{x}_i) = h_\theta(\vec{x}_i) = g(\vec{\theta}^T \vec{x}_i) \end{aligned}$$

where $h_\theta(\vec{x}_i)$ represents the classification outcome for $\vec{x}_i$ based on $\vec{\theta}$, the parameter vector. $g(\cdot)$ refers to the logistic sigmoid function which is defined as

$$g(z) = \frac{1}{1 + e^{-z}}.$$

The LR cost function

$$J(\vec{\theta}) \equiv \frac{1}{N} \sum_{i=1}^N J_i(\vec{\theta})$$

is minimized in order to establish parameters $\vec{\theta}$. $J_i(\vec{\theta})$ is defined as

$$J_i(\vec{\theta}) = -y_i \log(h_{\theta}(\vec{x}_i)) - (1 - y_i) \log (1 - h_{\theta} (\vec{x}_i))$$

This cost function, however, does not take into consideration different costs for different types of errors, that is, false negatives and false positives. Therefore, we instead consider this cost function

$$J^B(\vec{\theta}) \equiv \frac{1}{N} \sum_{i=1}^N J^B_i(\vec{\theta})$$

which maximizes benefits $b_{ij}$. $J^B_i(\vec{\theta})$ is defined as

$$\begin{aligned} J^B_i(\vec{\theta}) = & \;\;y_i[h_{\theta}(\vec{x}_i) b_{11} + (1 - h_{\theta}(\vec{x}_i))b_{10}] + \\ & \;\;(1 - y_i)[h_{\theta}(\vec{x}_i) b_{01} + (1 - h_{\theta}(\vec{x}_i))b_{00}]. \end{aligned}$$

Equation \ref{JBi} can be rewritten as

$$\begin{aligned} J^B_i(\vec{\theta}) = & \;\; y_ih_{\theta}(\vec{x}_i) (b_{11} - b_{10}) + b_{10}\\&\;\; (1 - y_i)(1 -h_{\theta}(\vec{x}_i))(b_{00} - b_{01}) + b_{01} \end{aligned}$$

This function will be maximized with respect to $\vec{\theta}$ therefore we can safely remove $b_{10}$ and $b_{01}$ since they are constants. Also, we can multiply by -1 to convert the problem into a minimization problem. The resulting function can then be divided by $(b_{11} - b_{10})$ since it will not affect the optimal $\vec{\theta}$. Once these changes are apply, we get the following function

$$J^B_i = -y_i h_{\theta}(\vec{x}_i) - (1 - y_i)(1 - h_{\theta}(\vec{x}_i))\eta$$

where

$$\eta \equiv \frac{b_{00} - b_{01}}{b_{11} - b_{10}}.$$

Equation \ref{benCostFunction} takes on a comparable form to that of the LR cost function, however, we now scale the error for instances of class 0 by the factor $\eta$ . Therefore, we can use an altered version of the LR cost function which includes the scaling by $\eta$, as shown below

$$J_i(\vec{\theta}) = -y_i \log(h_{\theta}(\vec{x}_i)) - \eta (1 - y_i) \log (1 - h_{\theta} (\vec{x}_i))$$

in order to account for benefits while training.

Multiclass Benefit-Based Logistic Regression

One common approach for dealing with multiclass LR is by using the one-vs-all or one-vs-rest method where for a dataset with $K$ classes, $K$ binary LR classifiers are used \cite{mlm, tds}. For example, if a dataset contains 3 classes, $A$ , $B$ and $C$ then 3 binary LR classifiers would be created as follows:

LR classifier 2: B vs \{A,C\}

LR classifier 3: C vs \{A,B\}

If $\vec{x}$ represents the feature vector for a given sample, a score will be generated for each classifier, $s_i(\vec{x})$ where $i \in \{0,...,K\}$ . This score represents the probability of the sample belonging to class $i$ . For example, for LR classifier 1 above, $s_1(\vec{x})$ represents the probability of $\vec{x}$ belonging to class A.

The scores from all $K$ classifiers are then compared and the classifier with the maximum score is chosen in order to determine the class with the highest probability and hence the predicted class for $\vec{x}$ . That is, if the classifier for class $j$ has the highest probability, then $\vec{x}$ will be classified as class $j$.

In order to incorporate benefits and costs into this approach we first need to replace the traditional binary LR with our benefit-based logistic regression. Let us now consider the benefit matrix in Table 1.

begin{table}\centering\caption{Benefit-Matrix for Benefit-Based Classification}\label{table:benefit-matrix}\renewcommand{\arraystretch}{1.2}\begin{tabular}{|c|c|c|c|c|c|}\hline\backslashbox{\bf A\footnotemark[1]}{\bf P\footnotemark[2]} & {\bf 0} & {\bf 1} & {\bf 2} & {\bf ...} & {\bf K}\\\hline{\bf 0} & $b_{00}$ & $b_{01}$ & $b_{02}$ & ... & $b_{0k}$\hline{\bf 1} & $b_{10}$ & $b_{11}$ & $b_{12}$ & ... & $b_{1k}$\hline{\bf 2} & $b_{20}$ & $b_{21}$ & $b_{22}$ & ... & $b_{2k}$\hline{\bf ...} & ... & ... & ... & ... & ...\\\hline{\bf K} & $b_{k0}$ & $b_{k1}$ & $b_{k2}$ & ... & $b_{kk}$\hline\end{tabular}\footnotetext[1]{Actual Class}\footnotetext[2]{Predicted Class}\end{table}

All classes have associated benefits for correct classification ($b_{ij} \geq 0$, $i = j$) and different costs for misclassification ($b_{ij} < 0$, $i \neq j$), depending on the class it is misclassified as. For example, following from the scenario above, if an instance of class A is correctly classified then there would be a benefit associated with it. However, if the sample is misclassified as either B or C then their would be two different costs for the two types of misclassification errors. If misclassifying an instance of class A as class B is more severe than misclassifying it as class C, then the cost for misclassifying the sample as B would be higher than the cost of misclassifying it as C, and vice versa. Similarly, there are different costs for misclassifying instances of B and C as class A.

Hence, when applying benefits and costs to the one-vs-all method, we must combine the benefits and costs of the classes in the ``all/rest'' part of the classifier in order to determine values for $b_{00}, b_{01}$ and $b_{10}$. These values can be calculated for all K classifiers as follows

$$b_{00} = \sum_{i=1}^{K} \pi_i b_{ii}, \text{ for } i \neq k,$$

$$b_{01} = \sum_{i=1}^{K} \pi_i b_{ik}, \text{ for } i \neq k,$$

and

$$b_{10} = \sum_{i=1}^{K} \pi_i b_{ki}, \text{ for } i \neq k$$

where $k$ represents the classifier for class $k$ and $k \in \{0,..,K\}$ . Note that $b_{11} = b_{kk}$ . $\eta$ and $B$ can then be calculated using Equations \ref{eta} and \ref{B}, respectively.

Note that for the overall benefit of the multiclass benefit-based LR classifier, Equation \ref{B} can be adjusted to include all options from Table 1 as follows

$$\begin{aligned} B(t) = \sum_{i=1}^{K} \sum_{j=1}^{K} \pi_i F_{ij}(t) b_{ij} \end{aligned}$$

Multiclass Cost-Based Naive Bayes

We consider a dataset consisting of $N$ samples with each sample having $M$ attributes. Each sample belongs to exactly one of $K$ classes. Let $V_{ij}$ be used to denote the cost of predicting a category $j$ when the true category is $i$. We assume categorical attributes and that continuous attributes are clustered to form categories. Suppose that we need to classify some new sample with attributes $\vec{x}$. Using Naive Bayes, the probability of this sample lying in category $C_k$ is given by

$$\begin{aligned} p(C_k \vert \vec{x}) = \prod_{i=1}^{M} \frac{p(x_i \vert C_k)p(C_k)}{p(x_i)} \end{aligned}$$

The probability $p(x_i \vert C_k)$ is computed based on the probability that category $x_i$ occurs in category $C_k$ in the given samples. We now compute the expected cost $F(C_k)$ of classifying the sample in category $C_k$. This is given by

$$F(C_k) = \sum_{q=1}^{K} p(C_q \vert \vec{x})V_{qk}$$

Therefore is we want to minimize expected cost then we would have to minimize $F(C_k)$ over $k$. Therefore the prediction would be

$$C_{k*} = \underset{k}{\arg \min} \mspace{5mu} F(C_k) = \underset{k}{\arg \min} \sum_{q=1}^{K} p(C_{q} \vert \vec{x})V_{qk}$$

This can be simplified to

$$C_{k*} = \underset{k}{\arg \min} \sum_{q=1}^{K} \left\{ V_{qk} \mspace{5mu} p(C_q) \prod_{i=1}^{M} p(x_i \vert C_q) \right\}$$

Let us consider some special cases. Consider the case $V_{ij} = 0$ for $i = j$ and $1$ otherwise which corresponds tomaximizing accuracy. Consider Equation \ref{ck}.

$$\begin{aligned} C_{k\ast} &= \underset{k}{\arg \min} \sum_{q=1}^{K} p(C_q \vert \vec{x})V_{qk} \\ &= \underset{k}{\arg \min} \{1 - p(C_q \vert \vec{x})\} \\ &= \underset{k}{\arg \max} \{p(C_q \vert \vec{x})\} \end{aligned}$$

Hence the category with highest probability is chosen as one would expect.

Hierarchical Cost-Sensitive Kernel Logistic Regression

cite{Xu} presented a hierarchical cost-sensitive kernel Logistic Regression where they tested their approach using a face recognition system. However, for generality, we will describe their approach without using this use case. Let us consider $K$ which represents all instances belonging to the minority classes (minority group) and $M$ which represents all instances belonging to the majority classes (majority group). The labels for both groups can then be defined as $y = D_1, D_2, ..., D_K$ for minority group and $y = H_1, H_2, ..., H_M$ for majority group. The costs of misclassifying an instance, $\vec{x}$, can be one of the following:

$C_{HD}$ - cost of misclassifying a majority group instance as minority group
$C_{DH}$ - cost of misclassifying a minority group instance as majority group
$C_{DD}$ - cost of misclassifying an minority group instance as the wrong minority group from $K$ groups

There is no cost for correctly classifying instances and the cost $C_{DD}$ is equal for misclassifications between all $K$ groups. There is only one group for majority group instances therefore the total number of classes is $(K + 1)$ . For a cost-sensitive scenario with $(K + 1)$ classes, the hypothesis $\phi(\vec{x})$ , the loss function

$$\begin{aligned} \text{loss}(\vec{x}, \phi(\vec{x}))= \\ \begin{cases} \sum\limits_{k=1 \atop k \neq \nu}^{K} \begin{cases}\boldsymbol{P}\left(D_{k}\vert\vec{x}\right) C_{D D}+ \\ \boldsymbol{P}(H \vert \vec{x}) C_{HD} \end{cases} & \text{ if } \phi(\vec{x})=D_{\nu} \\ \sum\limits_{k=1}^{K} \boldsymbol{P}\left(D_{k} \vert \vec{x}\right) C_{D H} & \text{ if } \phi(\vec{x})=I \end{cases} \end{aligned}$$

where $\boldsymbol{P}(D_n \vert \vec{x})$ and $\boldsymbol{P}(H \vert \vec{x})$ represent $\boldsymbol{P}(y = D_k \vert \vec{x})$ and $\boldsymbol{P}(y = H \vert \vec{x})$, respectively, should be minimized. If the misclassification errors happen to be equal, then Equation \ref{Xu_eq} essentially becomes the case of classifying an instance based on highest posterior probability, that is, the same as traditional classification approaches.

Let us now consider the multiclass cost-sensitive kernel Logistic Regression proposed by \cite{Zhang}. If

$$\boldsymbol{P}(D \vert \vec{x}) = \sum\limits_{k=1}^{K} \boldsymbol{P}(D_{k} \vert \vec{x})$$

then, from Equation \ref{Xu_eq} we get

$$\begin{aligned} \text{loss} (x,D_{\nu})= &\;\;\boldsymbol{P}(D\vert \vec{x})C_{DD}-\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD} \\& \;\; +\boldsymbol{P}(H\vert \vec{x})C_{HD}. \end{aligned}$$

Considering $\boldsymbol{P}(D \vert \vec{x}) + \boldsymbol{P}(H \vert \vec{x}) = 1$, then Equation \ref{loss} becomes

$$\begin{aligned} \text{loss} (x, D_{\nu}) = &\;\;(1-\boldsymbol{P}(H\vert \vec{x}))C_{DD}-\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD}\\& \;\; +\boldsymbol{P}(H\vert \vec{x})C_{HD}\\ = & \;\;C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD}) \\& \;\; -\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD}. \end{aligned}$$

We also get

$$\text{loss} (\vec{x},H)= \sum_{k=1}^{K}\boldsymbol{P}(D_{k}\vert \vec{x})D_{DH}=\boldsymbol{P}(D\vert \vec{x})C_{DH}.$$

Consequently, for $(K + 1)$ classes, the objective functions become

$$\begin{cases} C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})-\boldsymbol{P}(D_{1}\vert \vec{x})C_{DD}\\ :\\ C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})-\boldsymbol{P}(D_{K}\vert \vec{x})C_{DD}\\ \boldsymbol{P}(D\vert \vec{x})C_{DH} \end{cases}$$

If from every option listed in Equation \ref{obj_fn} we subtract $C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})$ and then divide by $-C_{DD}$, then we can solve the classification problem by selecting the maximum value from

$$\begin{cases} \boldsymbol{P}(D_{1} \vert \vec{x}) \\ \vdots \\ \boldsymbol{P}(D_{K} \vert \vec{x}) \\ \beta \boldsymbol{P}(H \vert \vec{x})-\Delta \end{cases}$$

where we define $\beta$ as

$$\beta=\frac{C_{DH}+C_{HD}-C_{DD}}{C_{DD}}$$

and $\Delta$ as

$$\Delta=\frac{C_{DH}-C_{DD}}{C_{DD}}.$$

Let's now consider the cost-blind approach to multiclass logistic regression proposed by \cite{Zhu}. For a given sample, $\vec{x}$, the condition probability is defined as

$$\boldsymbol{P}(D_{\nu}\vert \vec{x})= \frac{e^{f(x_{i})}}{1+\sum_{k=1}^{K}e^{f_{k}(x_{i})}}, \text{ for } \nu = 1,...,K$$

and

$$\boldsymbol{P}(H\vert \vec{x})= \frac{1}{1+\sum_{k=1}^{K}e^{f_{k}(\vec{x}_{i})}}.$$

For each class, the decision function now becomes

$$\begin{cases} f_{\nu}= \ln\frac{\boldsymbol{P}(D_{\nu}\vert \vec{x})}{\boldsymbol{P}(H\vert \vec{x})}, & \text{ for } f_1,...,f_K\\ f_{0}= \ln\frac{\boldsymbol{P}(H\vert \vec{x})}{\boldsymbol{P}(H\vert \vec{x})}=0. \end{cases}$$

In order to classify an instance, the maximum value from Equation \ref{dec_fn} is selected. In order to include cost-sensitivity, the function $f_{o}^{\ast}$ is defined as

$$f_{h}^{\ast}= \ln\frac{\beta \boldsymbol{P}(H\vert \vec{x})-\Delta}{\boldsymbol{P}(H\vert \vec{x})}.$$

Bayes decision rule

$$\phi(\vec{x})=\begin{cases} D_{\nu} & \text{if}\ f_{\nu}\ \text{is the maximum} \\ H & \text{if}\ f_{h}^{\ast}\ \text{is the maximum} \end{cases}$$

can then be used to classify an instance.

It should be noted that selecting the maximum value from Equation \ref{cost_sen} is equal to selecting the maximum value from Equation \ref{bayes}, therefore, for both cost-sensitive and cost-blind Logistic Regression, the training steps are the same and cost-sensitivity is achieved in the testing step using Equation \ref{bayes}.

However, using the observations from Equation \ref{Xu_eq} and the fact that the minority instances are analyzed in the later stages of the above algorithms, these approaches can be simplified using a hierarchical approach. In particular, a binary cost-sensitive classification algorithm can first be used to separate minority group from majority group samples. Then, since the cost of misclassifying minority group instances as other minority groups are the same, a cost-blind multiclass classification can be used on the minority group samples.

For binary cost-sensitive classification, a sample $\vec{x}$ is classified as minority group if $\boldsymbol{P}(D\vert \vec{x}) \geq p^\ast$, where $p^\ast$ is defined as

$$p^{\ast}= \frac{C_{HD}}{C_{HD}+C_{DH}}$$

Hierarchical Approach with Benefit-Based Logistic Regression

In order to merge the hierarchical approach proposed in Sect. 8 with the benefit-based Logistic Regression from Sect. 5, we can perform the following steps:

The minority group samples can then be classified using traditional cost-blind Logistic Regression.

In this section we consider the problem of fetal health classification where, based on a number of features extracted from Cardiotocogram exams, fetal health is classified as normal, suspect or pathological.We first determine appropriate benefit values and then apply these to the four multiclass algorithms described in Sections 6 to 9. We perform 5-fold cross validation and compare the results for all 4 classifiers to traditional LR which is based on optimizing accuracy rather than benefits. In order to test the robustness of the proposed benefit-based algorithms, we include a wide range of benefit values which will produce varying $\eta$ values. Again, we compare the results of varying $\eta$ of all 4 classifiers in Sections 6 to 9 against traditional accuracy based LR.

Dataset Description

The fetal health dataset, presented by \cite{dataset}, was used for our analysis. The dataset contains 21 features such as `fetal movement', `uterine contractions' and `abnormal short term variability' extracted from 2126 records of Cardiotocogram exams. The records were classified by three expert obstetritians into the classes `Normal', `Suspect' and `Pathological'. Based on their classifications, the dataset consists of 1655 cases of `Normal' fetal health, 295 cases of `Suspect' pathological and 176 cases of confirmed `Pathological'.

Benefit Model Based on Life Expectancy

begin{table}\centering\caption{Benefit-Matrix for Fetal Health Classification}\label{table:benefit-matrix-life}\renewcommand{\arraystretch}{1.2}\begin{tabular}{|c|c|c|c|}\hline\multirow{2}{*}{\backslashbox{\bf A\footnotemark[1]}{\bf P\footnotemark[2]}} & {\bf Normal} & {\bf Suspect} & {\bf Patho-}\\ & & & {\bf logical}\\\hline{\bf Normal} & $b_{00}$ & $b_{01}$ & $b_{02}$ \\\hline{\bf Suspect} & $b_{10}$ & $b_{11}$ & $b_{12}$\hline{\bf Pathological} & $b_{20}$ & $b_{21}$ & $b_{22}$\hline\end{tabular}\footnotetext[1]{Actual Class}\footnotetext[2]{Predicted Class}\end{table}

Table 2 illustrates benefits $b_{ij}$ for the fetal health dataset. As illustrated in \cite{SD}, there are many different types of abnormalities associated with fetal pathology, and survival rate due to early detection depends on the type of abnormality. Therefore, for simplicity of our example, we will consider a hypothetical scenario in order to determine our benefit values. We can determine these benefits using a baseline case. The baseline case represents the scenario where nothing is done, that is, the fetus was never examined for abnormalities. In this case, if nothing was done for a fetus that is normal then there would be no cost or benefit with regards to the fetus' survival from not doing an examination. Therefore, we can set $b_{00} = 0$. However, if nothing is done for a suspected case then this can reduce life expectancy by let's say 15%. Then we can set $b_{10} = -15$. Similarly, if nothing is done for a pathological case then the survival rate of the fetus will be affected. Let's say it reduces survival rate by 32%, then $b_{20} = -32$.

Let's now consider if a normal fetus was suspected as being pathological. This would cause treatment to be administered when it is not needed. Therefore, this can have an effect on the fetus' overall health and survival rate. Let's say it reduces life expectancy by just 1% since the fetus will eventually be classified as normal via further testing. Therefore, we can set $b_{01} = -1$. On the other hand, if a suspected fetus was correctly classified as suspected then correct treatment would be administered. Let's say this treatment increases survival rate by 25%, then $b_{11} = 25$. If however, a pathological fetus was classified as suspected, then treatment will be administered but not to the extent to which is needed by an actual pathological case. Therefore, if we consider this resulting in the survival rate to be reduced by 20%, then we can set $b_{21} = -20$. Note that classifying a pathological case as suspected is less severe than classifying the case as normal since some treatment will be administered since there is some detection of abnormality.

Additionally, if a normal fetus was classified as being pathological, then intense treatment would be administered before detection of the fetus being normal. Let's consider this treatment affecting the survival rate of the fetus by 2%, then we can set $b_{02} = -2$. Furthermore, if a suspected case is considered as pathological then, again, more intense treatment would be administered than needed. Let's say this affects the fetus' survival rate by 5%, then we can set $b_{12} = -5$. Finally, if a pathological case was correctly classified as pathological then the correct treatment would be administered. Let's say this increases the fetus' survival rate by 50%, the $b_{22} = 50$. These benefit values are depicted in Table 3.

begin{table}\centering\caption{Life Expectancy Benefit-Matrix for Fetal Health Classification}\label{table:fetal_ben}\renewcommand{\arraystretch}{1.2}\begin{tabular}{|c|c|c|c|}\hline\multirow{2}{*}{\backslashbox{\bf A\footnotemark[1]}{\bf P\footnotemark[2]}} & {\bf Normal} & {\bf Suspect} & {\bf Patho-}\\ & & & {\bf logical}\\\hline{\bf Normal} & 0 & -1 & -2 \\\hline{\bf Suspect} & -15 & 25 & -5\\\hline{\bf Pathological} & -32 & -20 & 50\\\hline\end{tabular}\footnotetext[1]{Actual Class}\footnotetext[2]{Predicted Class}\end{table}

Numerical Results

In Table 4 we see average the benefit values ($B$), benefit performance scores ($\mu$) and accuracy scores achieved by all 4 classifiers from Sections 6 to 9 with the life expectancy benefit matrix for fetal health, across all 5 folds. Here we can see that the proposed Multiclass Benefit Based LR and also the Hierarchical Approach with our Benefit-Based LR gave the best performance and benefit scores but accuracy was also sacrificed to achieved higher benefits.

begin{table}\centering\caption{Average Benefit, Performance and Accuracy Results with the Life Expectancy Benefit Values Across All 5 Folds}\label{table:life_exp_results}\renewcommand{\arraystretch}{1.2}\begin{tabular}{|c|c|c|c|}\hline{\bf Algorithm} & {\bf $\boldsymbol{B}$footnotemark[1]} & {\bf $\boldsymbol{\mu}$footnotemark[2]} & {\bf Acc\footnotemark[3]}\\\hlineMulticlass & 4.46 & 0.58 & 0.43 \\Benefit-Based LR & & & \\\hlineMulticlass Cost- & 2.10 & 0.26 & 0.16\\Based Naive Bayes & & & \\\hlineHierarchical Cost- & -4.80 & -0.62 & 0.77\\Sensitive Kernel LR & & & \\\hlineHierarchical Approach& 5.44 & 0.70 & 0.20\\with Benefit-Based LR & & & \\\hlineTraditional & 2.41 & 0.31 & 0.85\\Accuracy-Based LR & & & \\\hline\end{tabular}\footnotetext{Note: LR = Logistic Regression}\footnotetext[1]{Benefit}\footnotetext[2]{Performance}\footnotetext[3]{Accuracy}\end{table}

Figure 1 illustrates the benefit performance scores for all 4 classifiers from Sections 6 to 9 across all 5 folds and compares these to the benefit performance scores achieved by using traditional accuracy based LR. Similar to Table 4, we see improved performance with both the proposed Multiclass Benefit Based LR and also the Hierarchical Approach with our Benefit-Based LR. However, the Multiclass Cost-Based Naive Bayes only outperformed traditional accuracy in 2 out of the 5 folds and the Hierarchical Cost-Sensitive Kernel LR approach performed poorly in all 5 folds.

Ten variations in benefit matrices were used to test the robustness of the classifiers, with the 10th benefit matrix containing values identical to that of accuracy based classification. In Fig. 2 we plot the average benefit performance scores for all 5 classifiers achieved for all 10 benefit matrices across 5 folds. Here we see similar trends as that of Fig. 1.

Figure 3 depicts the average accuracy scores for all 5 classifiers for the 10 benefit matrices across the 5 folds. We again find that accuracy is sacrificed in order to achieve improved benefits.

Discussion

In Sect. 13 above, we illustrate the performance of the 4 classifiers from Sections 6 to 9 with a Fetal Health dataset and compare the results to that of traditional accuracy-based LR. From the results with both the life expectancy benefit matrix and also the sensitivity analysis performed with various benefit matrices, we notice that the Hierarchical Cost-Sensitive Kernel LR approach presented by \cite{Xu} performed poorly. Both the benefit performance scores and accuracy scores were lower than that of traditional accuracy-based LR. On inspection on the confusion matrices, all instances in the dataset were classified as the majority class ("healthy" persons) for all runs of the classifier. Hence, this resulted in the low scores produced by the classifier. From these results, we deduce that this classifier is not a viable option for fetal health classification.

On the other hand, when we replace the cost-sensitive algorithm from \cite{Xu}'s approach with our Benefit-Based LR classifier and then use the same approach of classifying the minority class separately using traditional LR, we achieve much improved results. For the Life-Expectancy model we see an improvement from of benefit, $B$, from 2.41 to 5.44 and an increase in overall benefit performance of from 0.31 to 0.58 when compared to traditional LR. Hence the benefit in life expectancy is approximately doubled when using this method compared to traditional accuracy. However, on inspection of the confusion matrices, all samples were classified as the minority classes (either suspected pathological or pathological). For the sensitivity analysis testing, we see that there is better distribution in the confusion matrices when either the cost values are low compared to the benefit values or there are only small differences in the cost values amongst the various classes. For all cases, the benefit performance scores were higher than that of traditional accuracy. Hence, although there were good results with some of the tested cases in terms of classifications for all 3 classes, this classifier did not perform consistently for all test cases, hence, it is not robust to varying benefits and costs.

Furthermore, the proposed Multiclass Cost-Based Naive Bayes classifier showed slightly reduced benefits when compared to traditional LR with the life expectancy model. From Table 4, we see that, on average, the benefit, $B$, reduced from 2.41 to 2.10 and the overall performance metric reduced from 0.31 to 0.26. However, the accuracy value also dropped. Figure 1 illustrates that the Cost-Based Naive Bayes classifier only out-performed the traditional LR classifier in 2 out of the 5 folds. On inspection of the confusion matrix we saw that the classifier highly favored the minority classes, especially the "Suspected Pathological" class. In the sensitivity analysis testing, we see similar trends in from Fig. 2 and also in the confusion matrices. However, for test cases where the cost values are low compared to the benefit values or there are only small differences in the cost values amongst the various classes, the classifier highly favored the majority class. Hence, the Multiclass Cost-Based Naive Bayes classifier is also not a feasible classifier for the Fetal Health data.

In contrast, for the proposed Multiclass Benefit Based LR classifier, we discern improved benefits with the life expectancy model when compared to that of traditional LR. From Table 4, we observe that there was an improvement in Benefit, $B$, from 2.41 to 4.46 and an increase in overall benefit performance from 0.31 to 0.58. Hence, we achieve an increase in life-expectancy of approximately double that obtained by traditional LR. We also notice that we have to sacrifice accuracy to realize improved benefits. The results from the sensitivity analysis (Fig. 2 and 3) support these trends. In addition, on inspection of the confusion matrices, we observed satisfactory distribution of the samples across all three classes. This is also supported by the fact that in Fig. 3, this LR classifier sacrifices less accuracy than the other classifiers in order to achieve satisfactory benefit scores. Hence, from all of these observations, we can conclude that the Multiclass Benefit Based LR is a robust and practical solution for Fetal Health classification.

In many different fields datasets are typically skewed towards a particular class. This is especially true with health datasets where the data is skewed towards healthy individuals. Typical machine learning classifiers assume equal costs for misclassification of healthy persons and disease affected persons and equal benefits for correct classification of both classes. However, in reality, this is is not the case. It is more costly to misclassify a disease affected individual than a healthy person and also more beneficial to correctly classify these disease affected individuals since identification of these persons can lead to early treatment and higher chances of recovery. This scenario becomes even more tricky when dealing with multiclass datasets, where there are multiple minority classes but some may have more severe consequences for misclassifcation and higher benefits for correct classification than others.

In this paper, we address the issue of multiclass classification of imbalance datasets where the benefits and costs of correct and incorrect classification varies among the classes. We proposed a Multiclass Benefit Based LR classifier, a Multiclass Cost-Based Naive Bayes classifier and also a modification to a Hierarchical Cost-Sensitive Kernel LR classifier proposed by \cite{Xu} using our Benefit-Based LR from \cite{Sooklal}. We compare these three algorithms to the original classifier proposed by \cite{Xu} and also traditional accuracy-based LR. The comparison was performed using a Fetal Health classifier containing data labeled as "Normal", "Suspected Pathological" and "Pathological". We demonstrated how benefits and costs can be established using a life-expectancy model and applied these benefits and costs to all 5 classifiers. We access the performance of all 5 classifiers based on a benefit-based performance metric originally introduced in \cite{Sooklal} and modified for the multiclass case, and we also show how these modifications to account for benefits and costs in the classifiers affect overall accuracy. The results proved that the proposed Multiclass Benefit Based LR was the most robust and also the only viable solution out of all of the classifiers.

For future work, we plan to extend the benefit-based approach to other classifiers.

Declarations

Funding No funding was received to assist with the preparation of this manuscript.

Conflict of Interest The authors have no relevant financial or non-financial interests to disclose.

Authors' Contributions All authors wrote the main manuscript text. Shellyann Sooklal implemented and tested the methods presented in the manuscript and hence prepared Figs. 1-3. All authors reviewed the manuscript.

bibliographystyle{spbasic}\bibliography{references}

Sooklal, Shellyann and Hosein, Patrick (2020) A Benefit Optimization Approach to the Evaluation of Classification Algorithms. Springer International Publishing, Cham, 978-3-030-36178-5, We address the problem of binary classification when applied to non-communicable diseases. In such problems the data are typically skewed towards samples of healthy subjects. Because of this, traditional performance metrics (such as accuracy) are not suitable. Furthermore, classifiers are typically trained with the assumption that the benefit or cost associated with decision outcomes are the same. In the case of non-communicable diseases this is not necessarily the case since it is more important to err on the side of treatment of the disease rather on the side of over-diagnosis. In this paper we consider the use of benefits/costs for evaluation of classifiers and we also propose how the Logistic Regression cost function can be modified to account for these benefits and costs for better training to achieve the desired goal. We then illustrate the advantage of the approach for the case of identifying diabetes and breast cancer., 35--46, Artificial Intelligence and Applied Mathematics in Engineering Problems, Hemanth, D. Jude and Kose, Utku
Xu, Huan (2021) Hierarchical Cost-Sensitive Techniques for Class Imbalance Learning. 10.1109/ICAIBD51990.2021.9459083, 604-609, , , 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD)
Zhang, Yin and Zhou, Zhi-Hua (2010) Cost-Sensitive Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(10): 1758-1769 https://doi.org/10.1109/TPAMI.2009.195
Zhu, Ji and Hastie, Trevor (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3): 427--443 Oxford University Press
Machine Learning Mastery. One-vs-Rest and One-vs-One for Multi-Class Classification. https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/, 2021, april
Towards Data Science. Multi-class Classification — One-vs-All & One-vs-One. https://towardsdatascience.com/multi-class-classification-one-vs-all-one-vs-one-94daed32a87b, 2020, May
Ayres-de-Campos, Diogo and Bernardes, Jo ão and Garrido, Antonio and Marques-de-S á, Joaquim and Pereira-Leite, Luis (2000) Sisporto 2.0: A program for automated analysis of cardiotocograms. The Journal of Maternal-Fetal Medicine 9(5): 311-318 Wiley Online Library
ScienceDirect. Fetal Pathology. https://www.sciencedirect.com/topics/medicine-and-dentistry/fetal-pathology
Akbar, Wasif and Wu, Wei-Ping and Faheem, Muhammad and Saleem, Sehrish and Javed, Arslan and Saleem, Muhammad Asim (2020) Predictive Analytics Model Based on Multiclass Classification for Asthma Severity by Using Random Forest Algorithm. 10.1109/ICECCE49384.2020.9179467, 1-4, , , 2020 International Conference on Electrical, Communication, and Computer Engineering (ICECCE)
Jackins, V. and Vimal, S. and Kaliappan, M. and Lee, Mi Young (2021) Prediction of Clinical Disease with AI-Based Multiclass Classification Using Na{\"i}ve Bayes and Random Forest Classifier. Springer International Publishing, Cham, 978-3-030-70296-0, 841--849, Advances in Artificial Intelligence and Applied Cognitive Computing, Arabnia, Hamid R. and Ferens, Ken and de la Fuente, David and Kozerenko, Elena B. and Olivas Varela, Jos{\'e} Angel and Tinetti, Fernando G.
Jim énez-Mesa, Carmen and Ill án, Ignacio Alvarez and Mart ín-Mart ín, Alberto and Castillo-Barnes, Diego and Martinez-Murcia, Francisco Jesus and Ram írez, Javier and G órriz, Juan M. (2020) Optimized One vs One Approach in Multiclass Classification for Early Alzheimer ’s Disease and Mild Cognitive Impairment Diagnosis. IEEE Access 8(): 96981-96993 https://doi.org/10.1109/ACCESS.2020.2997736
Kim, Paul Y. and Iftekharuddin, Khan M. and Davey, Pinakin G. and T óth, M árta and Garas, Anita and Holl ó, Gabor and Essock, Edward A. (2013) Novel Fractal Feature-Based Multiclass Glaucoma Detection and Progression Prediction. IEEE Journal of Biomedical and Health Informatics 17(2): 269-276 https://doi.org/10.1109/TITB.2012.2218661
Rojarath, Artittayapron and Songpan,Wararat (2021) Cost-sensitive probability for weighted voting in an ensemble model for multi-class classification problems. Applied Intelligence 51: 4908 –4932 https://doi.org/https://doi.org/10.1007/s10489-020-02106-3
Tian, Ye and Feng, Yang. Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. arXiv.org perpetual, non-exclusive license, 2021, arXiv, Machine Learning (stat.ML), Machine Learning (cs.LG), Methodology (stat.ME), FOS: Computer and information sciences, FOS: Computer and information sciences, https://arxiv.org/abs/2111.04597, 10.48550/ARXIV.2111.04597
So, Banghee and Boucher, Jean-Philippe and Valdez, Emiliano A.. Cost-sensitive Multi-class AdaBoost for Understanding Driving Behavior with Telematics. arXiv.org perpetual, non-exclusive license, 2020, arXiv, Applications (stat.AP), Machine Learning (cs.LG), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences, 62P05, https://arxiv.org/abs/2007.03100, 10.48550/ARXIV.2007.03100
Hastie, Trevor and Rosset, Saharon and Zhu, Ji and Zou, Hui (2009) Multi-class adaboost. Statistics and its Interface 2(3): 349--360 International Press of Boston
Yanmin Sun and Mohamed S. Kamel and Andrew K.C. Wong and Yang Wang (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40(12): 3358-3378 https://doi.org/https://doi.org/10.1016/j.patcog.2007.04.009, Classification, Class imbalance problem, AdaBoost, Cost-sensitive learning, https://www.sciencedirect.com/science/article/pii/S0031320307001835, 0031-3203
Xiuyi Jia and Weiwei Li and Lin Shang (2019) A multiphase cost-sensitive learning method based on the multiclass three-way decision-theoretic rough set model. Information Sciences 485: 248-262 https://doi.org/https://doi.org/10.1016/j.ins.2019.01.067, Three-way decision-theoretic rough set, Three-way decisions, Multiphase cost-sensitive learning, Multiclass classification, https://www.sciencedirect.com/science/article/pii/S0020025519300866, 0020-0255
Antonio Fern ández-Baldera and Jos é M. Buenaposada and Luis Baumela (2018) BAdaCost: Multi-class Boosting with Costs. Pattern Recognition 79: 467-479 https://doi.org/https://doi.org/10.1016/j.patcog.2018.02.022, Boosting, Multi-class classification, https://www.sciencedirect.com/science/article/pii/S0031320318300748, 0031-3203
Yeonkook J. Kim and Bok Baik and Sungzoon Cho (2016) Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Systems with Applications 62: 32-43 https://doi.org/https://doi.org/10.1016/j.eswa.2016.06.016, Financial misstatement detection, Financial restatements, Fraud intention, Multi-class cost sensitive learning, https://www.sciencedirect.com/science/article/pii/S0957417416302986, 0957-4174
Zhou, Siyuan and Zhang, Ya (2016) Active learning for cost-sensitive classification using logistic regression model. 10.1109/ICBDA.2016.7509840, 1-4, , , 2016 IEEE International Conference on Big Data Analysis (ICBDA)
Yu-An Chung and Hsuan-Tien Lin and Shao-Wen Yang. Cost-aware Pre-training for Multiclass Cost-sensitive Deep Learning. cs.LG, arXiv, 1511.09337, 2016
Agarwal, Alekh (2013) Selective sampling algorithms for cost-sensitive multiclass prediction. PMLR, Atlanta, Georgia, USA, https://proceedings.mlr.press/v28/agarwal13.html, http://proceedings.mlr.press/v28/agarwal13.pdf, 17--19 Jun, Proceedings of Machine Learning Research, 28, Dasgupta, Sanjoy and McAllester, David, 1220--1228, Proceedings of the 30th International Conference on Machine Learning
Junhui Wang (2013) Boosting the Generalized Margin in Cost-Sensitive Multiclass Classification. Journal of Computational and Graphical Statistics 22(1): 178-192 https://doi.org/10.1080/10618600.2011.643151, https://doi.org/10.1080/10618600.2011.643151 , https://doi.org/10.1080/10618600.2011.643151 , Taylor & Francis
Liu, Xu-Ying and Zhou, Zhi-Hua (2012) Towards Cost-Sensitive Learning for Real-World Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 978-3-642-28320-8, 494--505, New Frontiers in Applied Data Mining, Cao, Longbing and Huang, Joshua Zhexue and Bailey, James and Koh, Yun Sing and Luo, Jun
Sami Ben Jabeur and Amir Sadaaoui and Asma Sghaier and Riadh Aloui (2020) Machine learning models and cost-sensitive decision trees for bond rating prediction. Journal of the Operational Research Society 71(8): 1161-1179 https://doi.org/10.1080/01605682.2019.1581405, https://doi.org/10.1080/01605682.2019.1581405 , https://doi.org/10.1080/01605682.2019.1581405 , Taylor & Francis

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Framework for Benefit-Based Multiclass Classification

Status:

Version 1

Abstract

Introduction

Related Work

Methodology

Benefit-Based Performance Metric

Benefit Objective with Logistic Regression (Binary Classification)

Multiclass Benefit-Based Logistic Regression

Multiclass Cost-Based Naive Bayes

Hierarchical Cost-Sensitive Kernel Logistic Regression

Hierarchical Approach with Benefit-Based Logistic Regression

Application to Fetal Health Classification

Dataset Description

Benefit Model Based on Life Expectancy

Numerical Results

Discussion

Conclusions

Declarations

References

Additional Declarations

Status:

Version 1