2.2. Prediction accuracy metrics
In order to measure the predictive performance (i.e., calibration and discrimination ability) with respect to various sampling strategies, several classification accuracy measures are adopted in this study. One limitation of LL based measure is that it does not provide absolute measure of accuracy i.e., the model with comparatively higher LL can still be low in absolute predictive power. Hence, other absolute accuracy measures discussed and used in our study are useful as they can be utilized to produce benchmark values which then can be used to evaluate other models with similar characteristics. We provide a brief overview for each accuracy measure in this section.
One of the simplest and most common measures used in transportation literature is McFadden’s pseudo rho-squared value, which is determined on the log-likelihood scale. It provides a measure of variations explained by the fitted MNL model for a given dataset and varies between interval [0,1]. The higher the value, the better is the model. In our study, we estimated rho-squared measure where initial loglikelihood is considered with respect to the “equal shares for all alternatives” model which otherwise would not allow for undertaking split-sample and cross-validation strategies. It can be expressed by the following equation:
$${\rho ^2}_{{McFadden}}\,=\,\,1\,\, - \,\,\frac{{LL\left( {\hat {\theta }} \right)}}{{LL\left( 0 \right)}}$$
1
where \(LL\left( {\hat {\theta }} \right)\) is log-likelihood for fitted model and \(LL\left( 0 \right)\)is initial log-likelihood for equal-share model.
Another very popular classification accuracy measure is percentage right classified by the model, which evaluates the calibration ability of the model. It compares the overall predicted market shares against the overall observed shares for the given alternatives. In this approach, the alternative with the highest probability is defined as a predicted choice and it then compared with actual chosen alternative for that observation. The pair is considered rightly classified if predicted alternative is same as the one observed. Thus, the percentage right classified for a given model is defined as an average of the number of correct pairs over the number of samples. Considering a cross-classification table, where entry Nij for row i and column j denotes the number of observations for which model predicts alternative i where the actual chosen alternative is j. In such way, the column-wise sum gives the expected shares for each alternative and row-wise sum illustrates the observed shares for every alternative. Then the diagonal sum of this table depicts the total number of correctly classified observations, division of which by the size of sample estimates the proportion of the right predictions. One of the primary limitations of this approach is that it does not consider the predicted probability while matching the pairs, that is if the predicted probability of the outcome variable is significantly higher/lower than the observed/actual probability, this approach cannot differentiate between the different probabilities assigned to a chosen alternative (de Luca & Cantarella, 2009).
A more robust approach to approximate the correct classification is defined by (McFadden, 1978) which examines the proportion of successful predictions (i.e., calibration ability), by alternative and overall. This measure can be calculated as follows:
The index for alternative i is:
$${I_i}\,=\,\frac{{\sum\nolimits_{t} {{y_{ti}}{P_{ti}}} }}{{\sum\nolimits_{t} {{P_{ti}}} }}$$
2
where t is the index of observation, \({P_{ti}}\) is the probability of choosing an alternative i in observation t, and \({y_{ti}}=1\)when alternative i is chosen in observation t, and 0 otherwise.
\(\sum\nolimits_{t} {{y_{ti}}{P_{ti}}}\) in the numerator is the expected value predicted by the model of the number of only who choose alternative i based on the observations, while \(\sum\nolimits_{t} {{P_{ti}}}\)in denominator equals the expected value predicted by the model of individuals who choose alternative i (also equal, in MNL, to the observed number of those who choose i).
The overall index is:
$$I\,=\,\frac{{\sum\nolimits_{i} {\left( {\sum\nolimits_{t} {{y_{ti}}{P_{ti}}} } \right)} }}{{\sum\nolimits_{i} {\left( {\sum\nolimits_{t} {{P_{ti}}} } \right)} }}$$
3
Furthermore, we consider the Brier score for evaluating the calibration ability of the model outcomes (Brier, 1950; Kruppa et al., 2014). This approach is applicable where predictions assign the probabilities to a set of mutually exclusive discrete outcomes and the probabilities assigned to each individual alternative sum to one. In this way, Brier score can be used to evaluate the outcomes at disaggregate level. The Brier score basically measures the mean squared difference between the predicted probability \({P_{ti}}\) assigned to the possible outcome for observation t and the actual outcome \({y_{ti}}\) (both at an individual level), and varies in the interval of [0,2]. Hence, the lower the Brier score value, the better the predicted forecasts from the model, meaning that the predicted probabilities are correspond to the observed probabilities. The approximation of the Brier score can be made as follows:
$$BS\,=\,\frac{1}{{NT}}\sum\limits_{{i=1}}^{N} {\sum\limits_{{t=1}}^{T} {{{\left( {{y_{ti}} - {P_{ti}}} \right)}^2}} }$$
4
where T is the total number of observations and N is the total number of alternatives. \({y_{ti}}=1\) when it is i-th class in the observation t, and 0 otherwise.
Next, we include Polytomous Discrimination Index (PDI), a set approach based upon the probability to correctly discriminate between a set of alternatives. It evaluates all possible sets of k cases (containing prediction probabilities for each category) consisting of one case for each outcome category/alternative (Van Calster et al., 2012). It is desirable that the probability of choosing the alternative under consideration is highest for the case that belongs to this category. Hence, PDI can provide more robust evaluation of predictability in terms of clearness of the predictions. A set is assigned to a score equals to the number of categories for which it is true. If we denote this score by kc, then score assign to a set is kc / k. This score can vary between 0 and 1 with steps of 1 / k. For illustration, consider an example of choice variable with four alternatives as is the case in our study. Taking one case from each chosen alternative observation, assume a set of probabilities from case where alternative 1 is chosen is {0.45, 0.35, 0.05, 0.15}, for alternative 2: {0.05, 0.45, 0.4, 0.1}, for alternative 3: {0.20, 0.30, 0.35, 0.15}, and for alternative 4: {0.3, 0.05, 0.1, 0.55}. It can be noted that the probabilities for alternatives 1, 2 and 4 are highest for the corresponding sets. But the probability for alternative 3 is highest in a set corresponding to alternative 2. Thus, the model correctly identifies the sets for alternatives 1, 2 and 4 but not for alternative 3, resulting in a score of 3/4 for this case. The category specific PDI can be defined as a probability of correctly identifying a case from category i from a set of k cases. The average of total category specific PDI values results in a PDI for an entire prediction of the model. In mathematical formulation:
$$PD{I_i}\,=\,\,\frac{1}{{{N_1} \ldots {N_k}}}\sum\limits_{{{n_1}=1}}^{{{N_1}}} \ldots \sum\limits_{{{n_k}}}^{{{N_k}}} {{C_i}\left( {{p_{{n_1}}} \ldots ,{p_{{n_k}}}} \right)} \,{p_{i,{n_i}}}>{p_{i,{n_j}}}j \ne i$$
5
and the PDI as
$$PDI\,\,=\,\,\frac{1}{k}\sum\limits_{{i=1}}^{k} {PD{I_i}}$$
6
where \({C_i}\left( {{p_{{n_1}}} \ldots ,{p_{{n_k}}}} \right)\) is taking value of 1 if \({p_{i,{n_i}}}>{p_{i,{n_j}}}\) for all \(j \ne i\).
Lastly, we study Hypervolume under ROC manifold (HUM) measure which is proposed as a generalization of area under ROC (receiver operating characteristics) curve (AUC) for multiclass classification (Li & Fine, 2008). Similar to PDI, the HUM also evaluates the discrimination ability of model and can be termed as a probability that all alternatives are correctly classified by the model. However, PDI measures accuracy at an individual level randomly selected from a sample of individuals, while HUM measures accuracy at an aggregate level of individuals in a selected sample. Priorly, it was defined for binary classification based on ROC and AUC, and later extended for multinomial classification models. HUM does not depend on the class prevalence and thus reflects the intrinsic accuracy of the model (Li & Fine, 2008). Suppose a choice variable Y contains N alternatives, i.e., response \(Y \in \left\{ {1, \ldots ,N} \right\}\). Considering the probability for ith category be \({\rho _i}=P\left( {Y=i} \right),\,\,i=1, \ldots ,N\)and summation of \({\rho _i}\)over N alternatives be one. Considering that Xi denotes the probability value for ith category, the general HUM can be defined as \(P\left( {{X_1}< \ldots <{X_N}} \right)\). In such a way, we can define an N-dimensional indicator function \({C_{{N_1}, \ldots ,{N_i}}}\left( {{x_1}, \ldots ,{x_N}} \right)\) for a sequence of discrete real numbers \({x_1}, \ldots ,{x_N}\) which equals 1 only if \({x_1}= \cdots ={x_{{N_1}}}<{x_{{N_1}+1}}= \cdots ={x_{{N_1}}}_{{+{N_2}}}< \cdots <{x_{{N_1}+ \cdots +{N_{i - 1}}+1}}= \cdots ={x_N}\), and 0 otherwise. The HUM can be mathematically defined as
$$HUM=\sum\limits_{{i=1}}^{N} {\sum\nolimits_{{{N_1}+ \ldots +{N_i}=N}} {\frac{{P\left( {{C_{{N_1}, \ldots ,{N_i}}}\left( {{X_1}, \ldots ,{X_N}} \right)=1} \right)}}{{{N_1}! \cdots {N_i}!}}} }$$
7
In particular, for N = 4, which is the case in our choice model in Section 3, the HUM can be given by
$$\begin{gathered} HUM=P\left( {{X_1}<{X_2}<{X_3}<{X_4}} \right)\,\,+\,\,\frac{1}{2}P\left( {{X_1}={X_2}<{X_3}<{X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{2}P\left( {{X_1}<{X_2}={X_3}<{X_4}} \right)\,\,+\,\,\frac{1}{2}P\left( {{X_1}<{X_2}<{X_3}={X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{4}P\left( {{X_1}={X_2}<{X_3}={X_4}} \right)\,\,+\,\,\frac{1}{6}P\left( {{X_1}={X_2}={X_3}<{X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{6}P\left( {{X_1}<{X_2}={X_3}={X_4}} \right)\,\,+\,\,\frac{1}{{24}}\,P\left( {{X_1}={X_2}={X_3}={X_4}} \right) \hfill \\ \end{gathered}$$
8