In this section we first recap the formulations for our benefit-based performance metric and benefit-based Logistic Regression algorithm that were first presented in \cite{Sooklal} and then present how these can be adapted for the multiclass scenario. We also propose an approach for multiclass cost-based Naive Bayes. \cite{Xu}'s hierarchical cost-sensitive kernel Logistic Regression (HCSKLR) is then described since we will compare our algorithms to this approach. Finally, we explain how the HCSKLR can be altered to include our benefit-based LR for improved results.
Benefit-Based Performance Metric
Let us consider a binary classification problem where \(N\) samples belong to either class 0 or class 1. Let \(\vec{x}\) represent the feature vector of a given sample. The classifier will generate a continuous score \(s(\vec{x})\) which will be used to place the sample in either class 0 or class 1. Assuming that the classifier produces scores that are lower for samples of class 0 than class 1, we can define a threshold \(t\), where instances with scores \(s(\vec{x}) \leq t\) are placed in class 0 and instances with scores \(s(\vec{x}) > t\) are placed in class 1. Let us denote the probability density function of the scores for class 0 and class 1 instances by \(f_0(s)\) and \(f_1(s)\), respectively. Similarly, we can denote the cumulative distribution functions for both classes by \(F_0(s)\) and \(F_1(s)\).
We can also define the costs and benefits associated with each class. If we denote \(b_{ij}\) as the benefit of classifying a sample which belongs to class \(i\) as class \(j\) , then we would achieve a positive benefit ( \(b \geq 0\) ) when \(i=j\) since the sample is correctly classified. On the other hand, if \(i \neq j\) then we would achieve a negative benefit or cost ( \(b < 0\) ) since the sample is incorrectly classified. We can also let \(\pi_j\) denote the prior probability of class \(j \in \{0,1\}\) . Hence, \(\pi_0 F_0(t) N\) represents the the probability of a sample belonging to class 0 times the probability of the sample being correctly classified multiplied by \(N\) , the total number of samples. To simplify, $\pi_0 F_0(t) N$ is the number of samples that are expected to be classified correctly, given a threshold \(t\) . Therefore, the overall benefit is calculated as
$$\begin{aligned} B(t) = & \; \pi_0 F_0(t) b_{00} + \pi_0 (1 - F_0(t)) b_{01} + \\ & \; \pi_1 F_1(t) b_{10} + \pi_1 (1 - F_1(t)) b_{11} \end{aligned}$$
given threshold \(t\).
\(b_{00}\) and \(b_{11}\) are the only benefit values which are positive, therefore we can maximize the expected benefit when \(F_0(t)=1\) and \(F_1(t)=0\). This only occurs when there is no overlap between the two distributions. Hence, we can define an upper bound for benefit as \(\pi_0 b_{00} + \pi_1 b_{11}\). If \(B_\gamma\) represents the expected benefit for the classifier \(\gamma\) then its performance metric can be defined as
$$\mu_{\gamma} \equiv \frac{B_\gamma}{\pi_0 b_{00} + \pi_1 b_{11}}S$$
One should note that if \(\mu_\gamma \approx 1\) then the performance of the classifier is approximately optimal. Hence, in general, the benefit \(B(t)\) should be maximized by a classifier for given benefits. We can obtain optimality by first calculating the derivative of $B(t)$ with respect to \(t\) and then by setting this calculation to zero. This result of this modification is
$$f_0(t^*) \pi_0 (b_{00} - b_{01}) = f_1(t^*) \pi_1 (b_{11} - b_{10})$$
Benefit Objective with Logistic Regression (Binary Classification)
Although in Sect. 4 we described how varying the threshold of a classifier can optimize the expected benefit produced by the classifier, the LR classifier itself does not aim to optimize the benefit function while training. Therefore, we need to modify the Logistic Regression cost function in order to accommodate for costs and benefits. Hence, let us consider the posterior probability of the positive class as calculated by LR, that is, the logistic sigmoid of a linear function of the feature vector. For example, the probability for feature vector $\vec{x}_i$ is
$$\begin{aligned} p_i = P(y=1 \vert \vec{x}_i) = h_\theta(\vec{x}_i) = g(\vec{\theta}^T \vec{x}_i) \end{aligned}$$
where \(h_\theta(\vec{x}_i)\) represents the classification outcome for \(\vec{x}_i\) based on \(\vec{\theta}\), the parameter vector. \(g(\cdot)\) refers to the logistic sigmoid function which is defined as
$$g(z) = \frac{1}{1 + e^{-z}}.$$
The LR cost function
$$J(\vec{\theta}) \equiv \frac{1}{N} \sum_{i=1}^N J_i(\vec{\theta})$$
is minimized in order to establish parameters \(\vec{\theta}\). \(J_i(\vec{\theta})\) is defined as
$$J_i(\vec{\theta}) = -y_i \log(h_{\theta}(\vec{x}_i)) - (1 - y_i) \log (1 - h_{\theta} (\vec{x}_i))$$
This cost function, however, does not take into consideration different costs for different types of errors, that is, false negatives and false positives. Therefore, we instead consider this cost function
$$J^B(\vec{\theta}) \equiv \frac{1}{N} \sum_{i=1}^N J^B_i(\vec{\theta})$$
which maximizes benefits \(b_{ij}\). \(J^B_i(\vec{\theta})\) is defined as
$$\begin{aligned} J^B_i(\vec{\theta}) = & \;\;y_i[h_{\theta}(\vec{x}_i) b_{11} + (1 - h_{\theta}(\vec{x}_i))b_{10}] + \\ & \;\;(1 - y_i)[h_{\theta}(\vec{x}_i) b_{01} + (1 - h_{\theta}(\vec{x}_i))b_{00}]. \end{aligned}$$
Equation \ref{JBi} can be rewritten as
$$\begin{aligned} J^B_i(\vec{\theta}) = & \;\; y_ih_{\theta}(\vec{x}_i) (b_{11} - b_{10}) + b_{10}\\&\;\; (1 - y_i)(1 -h_{\theta}(\vec{x}_i))(b_{00} - b_{01}) + b_{01} \end{aligned}$$
This function will be maximized with respect to \(\vec{\theta}\) therefore we can safely remove \(b_{10}\) and \(b_{01}\) since they are constants. Also, we can multiply by -1 to convert the problem into a minimization problem. The resulting function can then be divided by \((b_{11} - b_{10})\) since it will not affect the optimal $\vec{\theta}$. Once these changes are apply, we get the following function
$$J^B_i = -y_i h_{\theta}(\vec{x}_i) - (1 - y_i)(1 - h_{\theta}(\vec{x}_i))\eta$$
where
$$\eta \equiv \frac{b_{00} - b_{01}}{b_{11} - b_{10}}.$$
Equation \ref{benCostFunction} takes on a comparable form to that of the LR cost function, however, we now scale the error for instances of class 0 by the factor \(\eta\) . Therefore, we can use an altered version of the LR cost function which includes the scaling by $\eta$, as shown below
$$J_i(\vec{\theta}) = -y_i \log(h_{\theta}(\vec{x}_i)) - \eta (1 - y_i) \log (1 - h_{\theta} (\vec{x}_i))$$
in order to account for benefits while training.
Multiclass Benefit-Based Logistic Regression
One common approach for dealing with multiclass LR is by using the one-vs-all or one-vs-rest method where for a dataset with \(K\) classes, $K$ binary LR classifiers are used \cite{mlm, tds}. For example, if a dataset contains 3 classes, \(A\) , \(B\) and \(C\) then 3 binary LR classifiers would be created as follows:
LR classifier 2: B vs \{A,C\}
LR classifier 3: C vs \{A,B\}
If \(\vec{x}\) represents the feature vector for a given sample, a score will be generated for each classifier, \(s_i(\vec{x})\) where \(i \in \{0,...,K\}\) . This score represents the probability of the sample belonging to class \(i\) . For example, for LR classifier 1 above, \(s_1(\vec{x})\) represents the probability of $\vec{x}$ belonging to class A.
The scores from all \(K\) classifiers are then compared and the classifier with the maximum score is chosen in order to determine the class with the highest probability and hence the predicted class for \(\vec{x}\) . That is, if the classifier for class \(j\) has the highest probability, then $\vec{x}$ will be classified as class $j$.
In order to incorporate benefits and costs into this approach we first need to replace the traditional binary LR with our benefit-based logistic regression. Let us now consider the benefit matrix in Table 1.
begin{table}\centering\caption{Benefit-Matrix for Benefit-Based Classification}\label{table:benefit-matrix}\renewcommand{\arraystretch}{1.2}\begin{tabular}{|c|c|c|c|c|c|}\hline\backslashbox{\bf A\footnotemark[1]}{\bf P\footnotemark[2]} & {\bf 0} & {\bf 1} & {\bf 2} & {\bf ...} & {\bf K}\\\hline{\bf 0} & \(b_{00}\) & \(b_{01}\) & \(b_{02}\) & ... & \(b_{0k}\)\hline{\bf 1} & \(b_{10}\) & \(b_{11}\) & \(b_{12}\) & ... & \(b_{1k}\)\hline{\bf 2} & \(b_{20}\) & \(b_{21}\) & \(b_{22}\) & ... & \(b_{2k}\)\hline{\bf ...} & ... & ... & ... & ... & ...\\\hline{\bf K} & \(b_{k0}\) & \(b_{k1}\) & \(b_{k2}\) & ... & \(b_{kk}\)\hline\end{tabular}\footnotetext[1]{Actual Class}\footnotetext[2]{Predicted Class}\end{table}
All classes have associated benefits for correct classification (\(b_{ij} \geq 0\), \(i = j\)) and different costs for misclassification (\(b_{ij} < 0\), \(i \neq j\)), depending on the class it is misclassified as. For example, following from the scenario above, if an instance of class A is correctly classified then there would be a benefit associated with it. However, if the sample is misclassified as either B or C then their would be two different costs for the two types of misclassification errors. If misclassifying an instance of class A as class B is more severe than misclassifying it as class C, then the cost for misclassifying the sample as B would be higher than the cost of misclassifying it as C, and vice versa. Similarly, there are different costs for misclassifying instances of B and C as class A.
Hence, when applying benefits and costs to the one-vs-all method, we must combine the benefits and costs of the classes in the ``all/rest'' part of the classifier in order to determine values for \(b_{00}, b_{01}\) and \(b_{10}\). These values can be calculated for all K classifiers as follows
$$b_{00} = \sum_{i=1}^{K} \pi_i b_{ii}, \text{ for } i \neq k,$$
$$b_{01} = \sum_{i=1}^{K} \pi_i b_{ik}, \text{ for } i \neq k,$$
and
$$b_{10} = \sum_{i=1}^{K} \pi_i b_{ki}, \text{ for } i \neq k$$
where \(k\) represents the classifier for class $k$ and \(k \in \{0,..,K\}\) . Note that \(b_{11} = b_{kk}\) . \(\eta\) and \(B\) can then be calculated using Equations \ref{eta} and \ref{B}, respectively.
Note that for the overall benefit of the multiclass benefit-based LR classifier, Equation \ref{B} can be adjusted to include all options from Table 1 as follows
$$\begin{aligned} B(t) = \sum_{i=1}^{K} \sum_{j=1}^{K} \pi_i F_{ij}(t) b_{ij} \end{aligned}$$
Multiclass Cost-Based Naive Bayes
We consider a dataset consisting of \(N\) samples with each sample having \(M\) attributes. Each sample belongs to exactly one of \(K\) classes. Let \(V_{ij}\) be used to denote the cost of predicting a category \(j\) when the true category is \(i\). We assume categorical attributes and that continuous attributes are clustered to form categories. Suppose that we need to classify some new sample with attributes \(\vec{x}\). Using Naive Bayes, the probability of this sample lying in category \(C_k\) is given by
$$\begin{aligned} p(C_k \vert \vec{x}) = \prod_{i=1}^{M} \frac{p(x_i \vert C_k)p(C_k)}{p(x_i)} \end{aligned}$$
The probability \(p(x_i \vert C_k)\) is computed based on the probability that category \(x_i\) occurs in category \(C_k\) in the given samples. We now compute the expected cost \(F(C_k)\) of classifying the sample in category $C_k$. This is given by
$$F(C_k) = \sum_{q=1}^{K} p(C_q \vert \vec{x})V_{qk}$$
Therefore is we want to minimize expected cost then we would have to minimize \(F(C_k)\) over \(k\). Therefore the prediction would be
$$C_{k*} = \underset{k}{\arg \min} \mspace{5mu} F(C_k) = \underset{k}{\arg \min} \sum_{q=1}^{K} p(C_{q} \vert \vec{x})V_{qk}$$
This can be simplified to
$$C_{k*} = \underset{k}{\arg \min} \sum_{q=1}^{K} \left\{ V_{qk} \mspace{5mu} p(C_q) \prod_{i=1}^{M} p(x_i \vert C_q) \right\}$$
Let us consider some special cases. Consider the case \(V_{ij} = 0\) for \(i = j\) and \(1\) otherwise which corresponds tomaximizing accuracy. Consider Equation \ref{ck}.
$$\begin{aligned} C_{k\ast} &= \underset{k}{\arg \min} \sum_{q=1}^{K} p(C_q \vert \vec{x})V_{qk} \\ &= \underset{k}{\arg \min} \{1 - p(C_q \vert \vec{x})\} \\ &= \underset{k}{\arg \max} \{p(C_q \vert \vec{x})\} \end{aligned}$$
Hence the category with highest probability is chosen as one would expect.
Hierarchical Cost-Sensitive Kernel Logistic Regression
cite{Xu} presented a hierarchical cost-sensitive kernel Logistic Regression where they tested their approach using a face recognition system. However, for generality, we will describe their approach without using this use case. Let us consider \(K\) which represents all instances belonging to the minority classes (minority group) and \(M\) which represents all instances belonging to the majority classes (majority group). The labels for both groups can then be defined as \(y = D_1, D_2, ..., D_K\) for minority group and \(y = H_1, H_2, ..., H_M\) for majority group. The costs of misclassifying an instance, \(\vec{x}\), can be one of the following:
\(C_{HD}\) - cost of misclassifying a majority group instance as minority group
\(C_{DH}\) - cost of misclassifying a minority group instance as majority group
\(C_{DD}\) - cost of misclassifying an minority group instance as the wrong minority group from \(K\) groups
There is no cost for correctly classifying instances and the cost \(C_{DD}\) is equal for misclassifications between all \(K\) groups. There is only one group for majority group instances therefore the total number of classes is \((K + 1)\) . For a cost-sensitive scenario with $(K + 1)$ classes, the hypothesis \(\phi(\vec{x})\) , the loss function
$$\begin{aligned} \text{loss}(\vec{x}, \phi(\vec{x}))= \\ \begin{cases} \sum\limits_{k=1 \atop k \neq \nu}^{K} \begin{cases}\boldsymbol{P}\left(D_{k}\vert\vec{x}\right) C_{D D}+ \\ \boldsymbol{P}(H \vert \vec{x}) C_{HD} \end{cases} & \text{ if } \phi(\vec{x})=D_{\nu} \\ \sum\limits_{k=1}^{K} \boldsymbol{P}\left(D_{k} \vert \vec{x}\right) C_{D H} & \text{ if } \phi(\vec{x})=I \end{cases} \end{aligned}$$
where \(\boldsymbol{P}(D_n \vert \vec{x})\) and \(\boldsymbol{P}(H \vert \vec{x})\) represent \(\boldsymbol{P}(y = D_k \vert \vec{x})\) and \(\boldsymbol{P}(y = H \vert \vec{x})\), respectively, should be minimized. If the misclassification errors happen to be equal, then Equation \ref{Xu_eq} essentially becomes the case of classifying an instance based on highest posterior probability, that is, the same as traditional classification approaches.
Let us now consider the multiclass cost-sensitive kernel Logistic Regression proposed by \cite{Zhang}. If
$$\boldsymbol{P}(D \vert \vec{x}) = \sum\limits_{k=1}^{K} \boldsymbol{P}(D_{k} \vert \vec{x})$$
then, from Equation \ref{Xu_eq} we get
$$\begin{aligned} \text{loss} (x,D_{\nu})= &\;\;\boldsymbol{P}(D\vert \vec{x})C_{DD}-\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD} \\& \;\; +\boldsymbol{P}(H\vert \vec{x})C_{HD}. \end{aligned}$$
Considering \(\boldsymbol{P}(D \vert \vec{x}) + \boldsymbol{P}(H \vert \vec{x}) = 1\), then Equation \ref{loss} becomes
$$\begin{aligned} \text{loss} (x, D_{\nu}) = &\;\;(1-\boldsymbol{P}(H\vert \vec{x}))C_{DD}-\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD}\\& \;\; +\boldsymbol{P}(H\vert \vec{x})C_{HD}\\ = & \;\;C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD}) \\& \;\; -\boldsymbol{P}(D_{\nu}\vert \vec{x})C_{DD}. \end{aligned}$$
We also get
$$\text{loss} (\vec{x},H)= \sum_{k=1}^{K}\boldsymbol{P}(D_{k}\vert \vec{x})D_{DH}=\boldsymbol{P}(D\vert \vec{x})C_{DH}.$$
Consequently, for \((K + 1)\) classes, the objective functions become
$$\begin{cases} C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})-\boldsymbol{P}(D_{1}\vert \vec{x})C_{DD}\\ :\\ C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})-\boldsymbol{P}(D_{K}\vert \vec{x})C_{DD}\\ \boldsymbol{P}(D\vert \vec{x})C_{DH} \end{cases}$$
If from every option listed in Equation \ref{obj_fn} we subtract \(C_{DD}+\boldsymbol{P}(H\vert \vec{x})(C_{HD}-C_{DD})\) and then divide by \(-C_{DD}\), then we can solve the classification problem by selecting the maximum value from
$$\begin{cases} \boldsymbol{P}(D_{1} \vert \vec{x}) \\ \vdots \\ \boldsymbol{P}(D_{K} \vert \vec{x}) \\ \beta \boldsymbol{P}(H \vert \vec{x})-\Delta \end{cases}$$
where we define \(\beta\) as
$$\beta=\frac{C_{DH}+C_{HD}-C_{DD}}{C_{DD}}$$
and \(\Delta\) as
$$\Delta=\frac{C_{DH}-C_{DD}}{C_{DD}}.$$
Let's now consider the cost-blind approach to multiclass logistic regression proposed by \cite{Zhu}. For a given sample, \(\vec{x}\), the condition probability is defined as
$$\boldsymbol{P}(D_{\nu}\vert \vec{x})= \frac{e^{f(x_{i})}}{1+\sum_{k=1}^{K}e^{f_{k}(x_{i})}}, \text{ for } \nu = 1,...,K$$
and
$$\boldsymbol{P}(H\vert \vec{x})= \frac{1}{1+\sum_{k=1}^{K}e^{f_{k}(\vec{x}_{i})}}.$$
For each class, the decision function now becomes
$$\begin{cases} f_{\nu}= \ln\frac{\boldsymbol{P}(D_{\nu}\vert \vec{x})}{\boldsymbol{P}(H\vert \vec{x})}, & \text{ for } f_1,...,f_K\\ f_{0}= \ln\frac{\boldsymbol{P}(H\vert \vec{x})}{\boldsymbol{P}(H\vert \vec{x})}=0. \end{cases}$$
In order to classify an instance, the maximum value from Equation \ref{dec_fn} is selected. In order to include cost-sensitivity, the function \(f_{o}^{\ast}\) is defined as
$$f_{h}^{\ast}= \ln\frac{\beta \boldsymbol{P}(H\vert \vec{x})-\Delta}{\boldsymbol{P}(H\vert \vec{x})}.$$
Bayes decision rule
$$\phi(\vec{x})=\begin{cases} D_{\nu} & \text{if}\ f_{\nu}\ \text{is the maximum} \\ H & \text{if}\ f_{h}^{\ast}\ \text{is the maximum} \end{cases}$$
can then be used to classify an instance.
It should be noted that selecting the maximum value from Equation \ref{cost_sen} is equal to selecting the maximum value from Equation \ref{bayes}, therefore, for both cost-sensitive and cost-blind Logistic Regression, the training steps are the same and cost-sensitivity is achieved in the testing step using Equation \ref{bayes}.
However, using the observations from Equation \ref{Xu_eq} and the fact that the minority instances are analyzed in the later stages of the above algorithms, these approaches can be simplified using a hierarchical approach. In particular, a binary cost-sensitive classification algorithm can first be used to separate minority group from majority group samples. Then, since the cost of misclassifying minority group instances as other minority groups are the same, a cost-blind multiclass classification can be used on the minority group samples.
For binary cost-sensitive classification, a sample \(\vec{x}\) is classified as minority group if \(\boldsymbol{P}(D\vert \vec{x}) \geq p^\ast\), where \(p^\ast\) is defined as
$$p^{\ast}= \frac{C_{HD}}{C_{HD}+C_{DH}}$$
Hierarchical Approach with Benefit-Based Logistic Regression
In order to merge the hierarchical approach proposed in Sect. 8 with the benefit-based Logistic Regression from Sect. 5, we can perform the following steps:
The minority group samples can then be classified using traditional cost-blind Logistic Regression.