Evaluation of approaches to internal validation of multinomial Logit models: the case of personal travel mode choice

doi:10.21203/rs.3.rs-3407637/v1

Download PDF

Research Article

Evaluation of approaches to internal validation of multinomial Logit models: the case of personal travel mode choice

https://doi.org/10.21203/rs.3.rs-3407637/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The prediction validity of discrete choice models is key for policy making in the transportation sector. For internal validation, i.e., when the population used to estimate and validate the model is the same, different approaches exist. Each approach is characterised in terms of sampling strategy and accuracy metric. The former includes in-sample, also referred to as apparent, split-sample, cross-validation, and bootstrapping. The latter include McFadden rho-squared, percentage of right classification, McFadden proportion of right predictions, Brier Score, polytomous discrimination index, and hypervolume under ROC manifold. It is widely recognised that in-sample strategies are overly optimistic, because the model is optimized for performance in the sample in which it is estimated. Evaluation of performance of approaches to internal validation has been carried out in the clinical epidemiology area with logistic regression models. This paper evaluates approaches to internal validation using synthetic and real datasets related to personal travel mode choices modelled using multinomial Logit. The performance of each approach is evaluated against the apparent performance in the full population. With both synthetic and real data, cross-validation produces the lowest bias with most metrics. The metric with lowest bias is data-specific. Lowest variability is produced by bootstrapping.

accuracy metric

internal validation

multinomial Logit

optimism bias

prediction validity

sampling strategy

travel mode choice

Predictive choice models have been widely applied in the different disciplines to estimate consumer demand for various goods/services where more than one alternative is available to choose from. Examples include travel mode and parking choice (Parmar et al., 2023; Yan et al., 2019; Zhao et al., 2020), route choice (Delle Site, 2018), destination choice (Clifton et al., 2016), air travel itinerary choice (Lhéritier et al., 2019), food choice (Martinho et al., 2022), and brand choice (Martinovici et al., 2023). Traditionally, these models – which can be viewed as classification models – are commonly estimated by either maximum likelihood method if individual-level data are used or regression methods if aggregate data are used. They require a series of representative samples from the population in order to provide the estimates of the future choice. Though these models provide substantial information on the decision-making process, there is little account in the literature on how to evaluate and validate their classification performance, i.e., their prediction accuracy (Baker, 2016; Parady et al., 2021). In other words, not evaluating the validity of the information extracted from the model may lead to uncertain and less worthy policy implications. Moreover, the behavioural studies put excessive importance to the statistical goodness-of-fit measures to validate the model results. However, this may lead to biased predictions especially when the data do not well represent the population-level behaviour. This anticipation cannot be neglected since the data collection is purely randomised process and it is practically impossible to control this process to gain perfect population level behavioural insights. And when researchers do not apply the model to unseen data, these biases remain unrecognized and ultimately lead to biased policy implications. Hence, according to Parady et al. (2021), it is imperative to pay attention to the validity of behavioural models which helps extracting more convincing information from the observational studies.

According to past experience, the apparent performance of the model, i.e. the application of the estimated model to the training data (the data used to build a model), will be better than the performance on the unseen test data (which is used to validate the estimated model) even though this set of test dataset originates from the same population as the training data (Chatfield, 1995; Harrell et al., 1996). This is referred to as ‘optimism’ bias of the in-sample approach (Steyerberg et al., 2001). A number of sampling strategies exist which are able to evaluate the performance of the model on independent data more accurately than considering the naïve approach of using the same training data for evaluation.

Since the analysis (and prediction) of human behaviour in transport systems is one of the prominent research objectives testified in extant literature, it is important to understand the usefulness of the developed models alongside. This can be seen as a function of model’s predictive accuracy estimated on the data apart from the training data used to estimate it. Therefore, we can define model validation as the reproducibility and generalizability of a model, evaluation of which can be termed as the internal and external validation respectively (Parady et al., 2021).

In transport planning and most other market behaviour analysis though, the external validation (can be termed as generalizability) is in most cases not pragmatic and logical as there are large behavioural disparities in geographical contexts. In case of internal validation, where reproducibility of the model results within same/similar population is of primary importance, the most general and straightforward approach is split-sample. In case of split-sample, the data is partitioned into two parts – one to estimate the model and the other to measure its performance, which means that the model performance is evaluated on similar, but independent data unlike in-sample approach (Picard & Berk, 1990). Further, to determine the model performance/accuracy more rigorously, the split-sample approach has been extended as cross-validation approach. It divides the input data into predefined equal sub-portions (commonly referred as folds) and then develops the model leaving one of the sub-portions solely for testing (Arlot & Celisse, 2010). This procedure is iterated for the number of times equal to the number of sub-portions (e.g., for 10-fold cross-validation, 90% of the sample goes to train the model which is then tested on the remaining 10%; this process is to be repeated 10 times). In such way, we ensure that each data point ends up in the test portion exactly once. The model performance can be estimated as an average of the model accuracies calculated over the number of iterations. The process can be repeated several times, taking random subsamples from the large population dataset to improve the stability of the strategy. Another strategy, known as bootstrapping, has been proven efficient sampling strategy which can be utilized to perform the model validation. Bootstrapping is a computer-based process of sample generation from the original training data with replacement (Efron & Tibshirani, 1995). It is possible to develop models on the bootstrap samples and to test them on the original dataset or other unseen data (i.e., not used for model development) from the same population (Efron, 1983).

In travel behaviour analysis, it is important to evaluate the accuracy of the estimated models, which can efficiently predict an individual’s choice of possible travel alternatives. However, there is little evidence of reporting the validation results which can show the predictive ability of the estimated models in different contexts. We argue that even though the primary purpose of the study is to understand the travel behaviour, it is essential to conduct the model validation. One of the reasons which can be put forward here is that ensuring the validity of the model can greatly support the study results, since most behavioural studies aim to predict market shares of different choice alternatives w.r.t. changes in influencing factors.

Following the definitions by Justice et al. (1999) and Parady et al. (2021), accuracy is the degree to which predicted outcomes match observed outcomes. Discrimination is one criterion for accuracy: it is the ability of the model to predict higher probabilities of experienced outcomes than probabilities of non-experienced outcomes. There are several metrics which can reflect the accuracy of the model, which include McFadden’s rho-squared (McFadden, 1978), percentage of right classification (McFadden, 1978), McFadden proportion of right prediction (McFadden, 1978), Brier Score (Brier, 1950), polytomous discrimination index (PDI) (Van Calster et al., 2012), and hypervolume under ROC (receiver operating characteristics) manifold (HUM) (Li & Fine, 2008). We consider these measures to compare the prediction accuracy of the prescribed sampling strategies. PDI and HUM provide evaluation of discrimination ability.

Though the above discussed sampling strategies have been used a few times for model validation in transportation literature, it is still debatable which strategy is superior for predictive analysis (Parady et al., 2021). Steyerberg et al., (2001) and Austin and Steyerberg (2014) evaluate the internal validity of binomial logistic regression model in a medical science context. Through this paper, we primarily address the following research questions considering the case of multinomial Logit models (MNL) applied to personal travel mode choice:

Which is the bias and variability of the prediction accuracy obtained with the different sampling approaches?
Are the results concordant across different performance metrics?

Following Steyerberg et al. (2001), we evaluate the performance of the sampling approach against the benchmark represented by the performance of the model estimated and tested on full population data.

The bias refers to the deviation in mean performance of the model estimated on sampled data from that of population data, while variability represents the variations in model performance across samples randomly collected from population. This simulates the actual process of behavioural data collection. We take a case of MNL model first developed on large synthetic population data for intercity travel mode choice, and subsequently apply the same to real-world data.

The paper has the following organisation. Section 2 presents the sampling strategies and accuracy metrics adopted in this study. Section 3 discusses the process carried out to generate a large population dataset and the associated application of the methodology adopted for internal validation. In Section 4, we apply the same framework to Swiss micro-census data. Finally in Section 5, we conclude with recommendations to evaluate internal validity based on our findings.

2.1. Sampling strategies

We use four sampling strategies to perform the internal validation:

1) An in-sample approach, where the model performance will be checked on the same training data used to build the model. This strategy generally imparts the highest performance since the test data is same as the training data and can be regarded as benchmark performance at sample level.

2) Split-sample approach, also known as holdout validation, is the simplest data splitting approach. In this, the dataset is randomly split into the training dataset and testing dataset. We kept 70% and 50% of the total dataset for training in two different scenarios and left remaining 30% and 50% of the data respectively as independent evaluation parts to apply the developed MNL model later.

3) The k-fold cross-validation strategy is used as an extended form of split-sample approach in which 50% (2-fold CV), 20% (5-fold CV) and 10% (10-fold CV) of the data is kept-out for testing while the models are estimated on the other parts of the data respectively. The procedure is repeated number of times (equals the number of folds) in such a way that each observation assigned to test data 1 time and training data k-1 times. Then the average performance over the number of repetitions is calculated. We do not consider leaving one-subject out (the ‘jackknife’) because some of the performance measure discussed in the section 2.3 are not applicable on single subjects.

4) A regular bootstrap procedure (Efron, 1983) is adopted in our study, in which we generate 30 instances of bootstrap samples (with replacement) from the original dataset. The performance is then calculated by estimating the model on the bootstrap samples and applying it on the original sample.

These validation strategies (excluding in-sample) can greatly assist in understanding reproducibility of the model as they use independent data to test the model accuracy.

2.2. Prediction accuracy metrics

In order to measure the predictive performance (i.e., calibration and discrimination ability) with respect to various sampling strategies, several classification accuracy measures are adopted in this study. One limitation of LL based measure is that it does not provide absolute measure of accuracy i.e., the model with comparatively higher LL can still be low in absolute predictive power. Hence, other absolute accuracy measures discussed and used in our study are useful as they can be utilized to produce benchmark values which then can be used to evaluate other models with similar characteristics. We provide a brief overview for each accuracy measure in this section.

One of the simplest and most common measures used in transportation literature is McFadden’s pseudo rho-squared value, which is determined on the log-likelihood scale. It provides a measure of variations explained by the fitted MNL model for a given dataset and varies between interval [0,1]. The higher the value, the better is the model. In our study, we estimated rho-squared measure where initial loglikelihood is considered with respect to the “equal shares for all alternatives” model which otherwise would not allow for undertaking split-sample and cross-validation strategies. It can be expressed by the following equation:

$${\rho ^2}_{{McFadden}}\,=\,\,1\,\, - \,\,\frac{{LL\left( {\hat {\theta }} \right)}}{{LL\left( 0 \right)}}$$

where $LL\left( {\hat {\theta }} \right)$ is log-likelihood for fitted model and $LL\left( 0 \right)$is initial log-likelihood for equal-share model.

Another very popular classification accuracy measure is percentage right classified by the model, which evaluates the calibration ability of the model. It compares the overall predicted market shares against the overall observed shares for the given alternatives. In this approach, the alternative with the highest probability is defined as a predicted choice and it then compared with actual chosen alternative for that observation. The pair is considered rightly classified if predicted alternative is same as the one observed. Thus, the percentage right classified for a given model is defined as an average of the number of correct pairs over the number of samples. Considering a cross-classification table, where entry N_ij for row i and column j denotes the number of observations for which model predicts alternative i where the actual chosen alternative is j. In such way, the column-wise sum gives the expected shares for each alternative and row-wise sum illustrates the observed shares for every alternative. Then the diagonal sum of this table depicts the total number of correctly classified observations, division of which by the size of sample estimates the proportion of the right predictions. One of the primary limitations of this approach is that it does not consider the predicted probability while matching the pairs, that is if the predicted probability of the outcome variable is significantly higher/lower than the observed/actual probability, this approach cannot differentiate between the different probabilities assigned to a chosen alternative (de Luca & Cantarella, 2009).

A more robust approach to approximate the correct classification is defined by (McFadden, 1978) which examines the proportion of successful predictions (i.e., calibration ability), by alternative and overall. This measure can be calculated as follows:

The index for alternative i is:

$${I_i}\,=\,\frac{{\sum\nolimits_{t} {{y_{ti}}{P_{ti}}} }}{{\sum\nolimits_{t} {{P_{ti}}} }}$$

where t is the index of observation, ${P_{ti}}$ is the probability of choosing an alternative i in observation t, and ${y_{ti}}=1$when alternative i is chosen in observation t, and 0 otherwise.

$\sum\nolimits_{t} {{y_{ti}}{P_{ti}}}$ in the numerator is the expected value predicted by the model of the number of only who choose alternative i based on the observations, while $\sum\nolimits_{t} {{P_{ti}}}$in denominator equals the expected value predicted by the model of individuals who choose alternative i (also equal, in MNL, to the observed number of those who choose i).

The overall index is:

$$I\,=\,\frac{{\sum\nolimits_{i} {\left( {\sum\nolimits_{t} {{y_{ti}}{P_{ti}}} } \right)} }}{{\sum\nolimits_{i} {\left( {\sum\nolimits_{t} {{P_{ti}}} } \right)} }}$$

Furthermore, we consider the Brier score for evaluating the calibration ability of the model outcomes (Brier, 1950; Kruppa et al., 2014). This approach is applicable where predictions assign the probabilities to a set of mutually exclusive discrete outcomes and the probabilities assigned to each individual alternative sum to one. In this way, Brier score can be used to evaluate the outcomes at disaggregate level. The Brier score basically measures the mean squared difference between the predicted probability ${P_{ti}}$ assigned to the possible outcome for observation t and the actual outcome ${y_{ti}}$ (both at an individual level), and varies in the interval of [0,2]. Hence, the lower the Brier score value, the better the predicted forecasts from the model, meaning that the predicted probabilities are correspond to the observed probabilities. The approximation of the Brier score can be made as follows:

$$BS\,=\,\frac{1}{{NT}}\sum\limits_{{i=1}}^{N} {\sum\limits_{{t=1}}^{T} {{{\left( {{y_{ti}} - {P_{ti}}} \right)}^2}} }$$

where T is the total number of observations and N is the total number of alternatives. ${y_{ti}}=1$ when it is i-th class in the observation t, and 0 otherwise.

Next, we include Polytomous Discrimination Index (PDI), a set approach based upon the probability to correctly discriminate between a set of alternatives. It evaluates all possible sets of k cases (containing prediction probabilities for each category) consisting of one case for each outcome category/alternative (Van Calster et al., 2012). It is desirable that the probability of choosing the alternative under consideration is highest for the case that belongs to this category. Hence, PDI can provide more robust evaluation of predictability in terms of clearness of the predictions. A set is assigned to a score equals to the number of categories for which it is true. If we denote this score by k_c, then score assign to a set is k_c / k. This score can vary between 0 and 1 with steps of 1 / k. For illustration, consider an example of choice variable with four alternatives as is the case in our study. Taking one case from each chosen alternative observation, assume a set of probabilities from case where alternative 1 is chosen is {0.45, 0.35, 0.05, 0.15}, for alternative 2: {0.05, 0.45, 0.4, 0.1}, for alternative 3: {0.20, 0.30, 0.35, 0.15}, and for alternative 4: {0.3, 0.05, 0.1, 0.55}. It can be noted that the probabilities for alternatives 1, 2 and 4 are highest for the corresponding sets. But the probability for alternative 3 is highest in a set corresponding to alternative 2. Thus, the model correctly identifies the sets for alternatives 1, 2 and 4 but not for alternative 3, resulting in a score of 3/4 for this case. The category specific PDI can be defined as a probability of correctly identifying a case from category i from a set of k cases. The average of total category specific PDI values results in a PDI for an entire prediction of the model. In mathematical formulation:

$$PD{I_i}\,=\,\,\frac{1}{{{N_1} \ldots {N_k}}}\sum\limits_{{{n_1}=1}}^{{{N_1}}} \ldots \sum\limits_{{{n_k}}}^{{{N_k}}} {{C_i}\left( {{p_{{n_1}}} \ldots ,{p_{{n_k}}}} \right)} \,{p_{i,{n_i}}}>{p_{i,{n_j}}}j \ne i$$

and the PDI as

$$PDI\,\,=\,\,\frac{1}{k}\sum\limits_{{i=1}}^{k} {PD{I_i}}$$

where ${C_i}\left( {{p_{{n_1}}} \ldots ,{p_{{n_k}}}} \right)$ is taking value of 1 if ${p_{i,{n_i}}}>{p_{i,{n_j}}}$ for all $j \ne i$.

Lastly, we study Hypervolume under ROC manifold (HUM) measure which is proposed as a generalization of area under ROC (receiver operating characteristics) curve (AUC) for multiclass classification (Li & Fine, 2008). Similar to PDI, the HUM also evaluates the discrimination ability of model and can be termed as a probability that all alternatives are correctly classified by the model. However, PDI measures accuracy at an individual level randomly selected from a sample of individuals, while HUM measures accuracy at an aggregate level of individuals in a selected sample. Priorly, it was defined for binary classification based on ROC and AUC, and later extended for multinomial classification models. HUM does not depend on the class prevalence and thus reflects the intrinsic accuracy of the model (Li & Fine, 2008). Suppose a choice variable Y contains N alternatives, i.e., response $Y \in \left\{ {1, \ldots ,N} \right\}$. Considering the probability for ith category be ${\rho _i}=P\left( {Y=i} \right),\,\,i=1, \ldots ,N$and summation of ${\rho _i}$over N alternatives be one. Considering that X_i denotes the probability value for ith category, the general HUM can be defined as $P\left( {{X_1}< \ldots <{X_N}} \right)$. In such a way, we can define an N-dimensional indicator function ${C_{{N_1}, \ldots ,{N_i}}}\left( {{x_1}, \ldots ,{x_N}} \right)$ for a sequence of discrete real numbers ${x_1}, \ldots ,{x_N}$ which equals 1 only if ${x_1}= \cdots ={x_{{N_1}}}<{x_{{N_1}+1}}= \cdots ={x_{{N_1}}}_{{+{N_2}}}< \cdots <{x_{{N_1}+ \cdots +{N_{i - 1}}+1}}= \cdots ={x_N}$, and 0 otherwise. The HUM can be mathematically defined as

$$HUM=\sum\limits_{{i=1}}^{N} {\sum\nolimits_{{{N_1}+ \ldots +{N_i}=N}} {\frac{{P\left( {{C_{{N_1}, \ldots ,{N_i}}}\left( {{X_1}, \ldots ,{X_N}} \right)=1} \right)}}{{{N_1}! \cdots {N_i}!}}} }$$

In particular, for N = 4, which is the case in our choice model in Section 3, the HUM can be given by

$$\begin{gathered} HUM=P\left( {{X_1}<{X_2}<{X_3}<{X_4}} \right)\,\,+\,\,\frac{1}{2}P\left( {{X_1}={X_2}<{X_3}<{X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{2}P\left( {{X_1}<{X_2}={X_3}<{X_4}} \right)\,\,+\,\,\frac{1}{2}P\left( {{X_1}<{X_2}<{X_3}={X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{4}P\left( {{X_1}={X_2}<{X_3}={X_4}} \right)\,\,+\,\,\frac{1}{6}P\left( {{X_1}={X_2}={X_3}<{X_4}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,\,\frac{1}{6}P\left( {{X_1}<{X_2}={X_3}={X_4}} \right)\,\,+\,\,\frac{1}{{24}}\,P\left( {{X_1}={X_2}={X_3}={X_4}} \right) \hfill \\ \end{gathered}$$

3.1. Data and model

To demonstrate the application of the proposed framework, the dataset we use is a synthetic database of 100,000 observation generated on the basis of the original dataset obtained from the Apollo software webpage (http://www.apollochoicemodelling.com/index.html). This dataset relates to the travel mode choice for the long-distance inter-urban trips with 1000 observations. We first conduct the analysis using this dataset to exhibit the procedural flow of the proposed strategies for model validation and then apply it to the real-world dataset which is discussed in the next section. We consider two different sample size with 500 and 1000 observations to assure the consistency of our approach and results. A simulation is created with 500 repetitions for each sample size to obtain the stability in the results. The estimated performance is calculated as an average of the performances over 500 iterations. In this paper, we define a case of MNL model for the whole population as well as each sample drawn during the simulation to demonstrate the proposed framework. However, this method can be applied to more complex discrete choice models and machine learning based model frameworks.

A synthetic dataset contains four travel modes – car (32.5%), bus (11.9%), air (17.6%), and rail (37.9%). The choice set is individual-specific and variable since one or more alternatives may be available for some of the travellers. The independent variables include access time (all alternatives except car), in-vehicle travel time (IVTT) (all alternatives) and monetary costs (all alternatives). The attributes of the population database are drawn from the multivariate normal distribution, with means and standard deviations equals to the means and standard deviations in the available database and with the given correlations. The attributes are assumed to be statistically independent. Tables 1 and 2 show, respectively, the means and the covariance matrix. The latter takes into account that the travel times and monetary costs across alternatives are positively correlated, since they refer to the same as-the-crow-flies origin-destination distance. Draws of the attributes are obtained based on the draws from independent standard normal distributions and the Cholesky decomposition of the covariance matrix.

A choice variable defining the alternative chosen in the population database is identified according to the standard approach for discrete choice analysis discussed by (Garrow et al., 2010). A MNL model is estimated based on the available dataset. The systematic utilities considered are shown in Equations 9 through 12. Table 3 displays the parameter estimates for the original dataset (N = 1000) and the synthetic dataset (N = 100,000). Probabilities are computed for each observation. Then, the alternative chosen is identified using one random draw from a uniform 0–1 distribution. For example, assuming the probabilities for the four alternatives being {0.2, 0.4, 0.1, 0.3}, if the number drawn is less than 0.2 then the first alternative is chosen, the draws with values between 0.3 and 0.6 result in in alternative two being chosen, values between 0.6 and 0.7 result in alternative three being chosen, and the fourth alternative is chosen for values above 0.7.

$${V_{car}}\,=\,\alpha \cdot {p_{car}}+{\beta _2} \cdot {t_{2,car}}$$

$${V_{bus}}\,=\,\alpha \cdot {p_{bus}}+{\beta _{1 \cdot }}{t_{1,bus}}+{\beta _2} \cdot {t_{2,bus}}+AS{C_{bus}}$$

$${V_{rail}}\,=\,\alpha \cdot {p_{rail}}+{\beta _{1 \cdot }}{t_{1,rail}}+{\beta _2} \cdot {t_{2,rail}}+AC{S_{rail}}$$

$${V_{air}}\,=\,\alpha \cdot {p_{air}}+{\beta _{1 \cdot }}{t_{1,air}}+{\beta _2} \cdot {t_{2,air}}+AC{S_{air}}$$

Where p is monetary cost, t₁ is access time and, t₂ is in-vehicle travel time.

Table 1

Distribution of alternative attributes
Attributes	Mean	Standard Deviation
Car: In-vehicle travel time	307.63	49.56
Car: Cost	40.24	6.98
Bus: Access time	14.82	7.06
Bus: In-vehicle travel time	369.68	37.17
Bus: Cost	24.95	6.99
Air: Access time	45.24	7.06
Air: In-vehicle travel time	69.97	13.77
Air: Cost	80.20	21.11
Rail: Access time	14.93	7.11
Rail: In-vehicle travel time	142.66	17.25
Rail: Cost	55.70	14.09

Table 2

Derived covariances for data generation
IVTT1	IVTT2	Cost	Correlation	SD1	SD2	Covariance
Car		Car	0.75	49.56	6.98	259.447
Bus		Bus	0.60	37.17	6.99	155.891
Rail		Rail	0.50	17.25	14.09	121.526
Air		Air	0.45	13.77	21.11	130.808
Car	Bus		0.30	49.56	37.17	552.644
Car	Rail		0.25	49.56	17.25	213.728
Car	Air		0.20	49.56	13.77	136.488
Bus	Rail		0.15	37.17	17.25	96.1774
Bus	Air		0.15	37.17	13.77	76.7746
Rail	Air		0.15	17.25	13.77	35.6299

Table 3

Coefficient estimates for MNL models on original and synthetic data
Attributes	Coefficients (Original data)	Coefficients (Synthetic data)
Alternative specific constants Car Bus Rail Air	0 -1.290 -0.658 -0.434	0 -1.325 -0.425 -0.378
Monetary cost α	-0.032	-0.033
In-vehicle travel time β₁	-0.006	-0.006
Access time β₂	-0.007	-0.007
Sample size Log-Likelihood At convergence At zero	1000 -1077.97 -1088.98	100,000 -107,333.1 -118,786
Note: All coefficients are significant at 99% confidence interval (p < 0.01)

3.2. Internal validation

We compare the internal validation results obtained by the several sampling strategies (Section 2.2) across various accuracy measures (Section 2.3) with apparent performance which is evaluated over the whole population dataset (n = 100,000). The modal shares of each alternative are shown in Table 3. The results in the Fig. 1 (for sample size = 500) and Fig. 2 (for sample size = 1000) show the bias in the optimism (in terms of accuracy) and variation pertaining to the selected strategies across 500 random draws from population dataset. Here, the biasness refers to the difference between mean of boxplot i.e., average accuracy across 500 iterations and accuracy of apparent model. Variability refers to the range of boxplots. The red dotted lines in the figures show the accuracy measures obtained for apparent model. All the analysis tasks are performed in R with Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz processor.

Looking to the Fig. 1 and Fig. 2, it is clear that the results produced by the models are quite similar for both the sample size for overall judgement based on mean and median values. The range and variability of the box plots for various performance measures are also identical in nature for both sample size. Hence, it can be asserted that the results obtained with this methodology are stable if sufficient sample size is considered in the experiment. In comparison with the apparent model performance, that is for the model estimated and tested on the full population dataset, it can be visualized that split-sample and cross-validation approaches produce similar performance, that is less biased and more optimistic. Looking to the boxplots for McFadden proportion of correct predictions and HUM, evidently the split-sample and cross-validation strategies show optimal performance (i.e., less biased) in line with the apparent model performance. However, for the most accuracy measures, the variability across 500 models is seen to be less for bootstrap strategy followed by 10-fold cross-validation approach and in-sample approach. It asserts that testing model predictability with the bootstrap strategy can provide more promising results compared to others.

Further, it is observed that the biases do not vary much between 50% and 70% split-sample approach and between 2-fold, 5-fold and 10-fold cross-validation procedures. However, observations at micro level confirm that 70% split-sample and 10-fold cross-validation strategies are closer to apparent performance. The possible reason could be the higher number of observations in training samples for both these approaches. In cross-validation strategy, we do not find much variation among three different settings. While for split-sample approach, variability is seen to increase substantially when we use 70% of the data for model training. Though, this is conformed to the dataset used in the experiments and may vary with different datasets.

4.1. Data and model

This section is dedicated to present the empirical results of our study. In order to evaluate the robustness of the study, we extended the application of the proposed framework to the real-world dataset. This dataset is extracted from the Mobility and Transport Micro-Census (MTMC) 2021 which is based upon the statistical survey of the travel behaviour that takes place every 5 years by the Swiss Federal Statistical Office and Swiss Federal Office for Spatial Development ARE (2021). The data used in this study contains 94,714 observations related to the trips made by Swiss people. It contains three categories of transport modes: personal vehicles, public transit, and active modes. Personal vehicles are most used (53.1% share), followed by active modes (38% share) and public transit (8.8% share). The MNL model specification is based on the variables: trip time, trip purpose, and respondent’s age, gender, and income. The estimation results of the final model are presented in the Table 4. All parameter estimates are statistically significant at 99% confidence interval.

Table 4

Coefficient estimates for MNL model on Swiss data
Attributes	Coefficients	t-stats
Alternative specific coefficients Personal vehicles Public transit Active modes	1.347 0 1.442	36.137 0 33.958
Trip time Personal vehicles Public transit Active modes	-0.791 -0.524 -0.387	-16.190 -16.102 -16.626
Trip purpose: work	-0.139	-8.557
Income	0.027	8.012
Age	0.014	22.032
Gender	-0.174	-13.093
Sample size Log-Likelihood At convergence At zero	94,714 -75,636.53 -101,279.4
Note: All coefficients are significant at 99% confidence interval (p < 0.01)

4.2. Internal validation

We analysed the prediction validity of the estimated model employing the same strategies and accuracy indicators as described in the previous section. Table 4 presents the MNL model developed for the full population data with 94,714 observations. Following the procedure described in the Section 3.2, separate MNL models are developed for the 500 randomly extracted samples with 500 and 1000 observations each. The model results then analysed with selected sampling strategies and accuracy metrics. The results of this simulation are presented graphically using boxplots in the Fig. 3 (N = 500) and Fig. 4 (N = 1000).

Overall, the results show little deviations (i.e., biasness) in the accuracy metrics compared to the performance of apparent model with full population data. Looking to the chart for rho-squared, all the sampling strategies impart marginally higher accuracy (i.e., more optimistic) values than the apparent model. Nevertheless, our results show only slight biases, it is evident that relying heavily only on such measures may differ the inferences which might fail to reflect the actual expectations. For other metrics, while the means are revolving around the apparent performance, a significant variation can be seen across 500 samples. For both the cases, the deviations in percentage right classified measure are less and mean values revolve around 60%, however these values decline while using probabilistic measure based on McFadden’s proportion of right predictions which rest around 50%. The estimated values of the Brier score are near to zero which indicate that the model has good calibration ability and well predicted the individual level choice probabilities of each outcome in choice set. While looking at the results of PDI and HUM indicators, we did not recognize considerable sign of optimism in either case. However, the estimated values depict average discrimination ability – the model can discriminate between each choice alternative in individual outcomes only ~ 52% of times (i.e., PDI = 0.524), while a population level HUM of 0.138 demonstrate poor discrimination performance at an aggregate level.

In terms of sampling strategies (essentially an internal validation), the results show that the variation across accuracy indicators is higher for split-sampling approaches compared to k-fold cross validation strategies, especially with 70 − 30 split strategy. Similar results are found by (Steyerberg et al., 2001) in clinical study. One of the possible reasons may be the higher variations (i.e., heterogeneity) among the observations in the hold-out 30% of data, which ultimately could be better explained while holding out 50% of the data. However, there is no significant difference between mean and median values between both split-sampling approaches. Similar is the case with cross validation strategies. The bootstrap strategy resulted in the lower variability and higher biases, a quite similar to that of the in-sample approach. Overall, our results show that 10-fold cross validation strategy gives more realistic accuracy values with less biases and variability, which is seen to closely followed by the bootstrap strategy.

While comparing two cases of different sample size, we do not observe any significant variations between them. Looking at the micro level, split-sample (70%) and 2-fold cross validation strategies have slightly higher biases in former case (with 500 samples) for Brier score. In PDI plots, cross-validation has more stable (i.e., consistent) results across three variants in 1000 sample size. Further, split sample (70%) has slightly overestimated PDIs in 1000 sample size, while the variability is significantly higher in former case. The 2-fold cross validation and bootstrap strategies have near perfect results in case with 500 samples as they are nearly coinciding with the apparent performance line.

This paper addresses the common issue of over-reliance on statistical goodness-of-fit measures while evaluating the model performance and policy implications in transportation literature. As researchers usually do not have the luxury of having extra independent validation sample from the same population, our demonstrated framework to internally validate the estimated model can be proven as a suitable alternative. Model validation should be considered as inherent part of model estimation and reporting process as it can serve multi-fold purposes. We also postulate the use of several accuracy measures to estimate the calibration and discrimination ability, rather than depending on just one or two goodness-of-fit measures such as loglikelihood and AIC/BIC criteria. The quality of the developed model directly indicates the potential usefulness of it in future predictions. The quality and strength of the model can be assured on the basis of the final model selected through internal validation. For example, a predictive model with lower PDI and HUM value will not be able to discriminate between alternatives compare to a model with higher PDI and HUM. Moreover, the process of internal validation can be used to approximate the reproducibility of the predicted model, i.e., the performance of the predictive model can be checked on the external data within the proposed framework. This can be achieved by non-random split of training and testing data in internal validation on the basis of for example, region of origin of the data, sociodemographic attributes, and further. However, it should be noted that the model performance and discriminative ability may be lower in these cases than that with the random splits. It is also dependent on the sample size, number of choice alternatives and percentage share of each alternative, as for instance, lower sample size may induce more variability and lead to less reliable approximations on accuracy measure.

In our tests with synthetic dataset, split-sample and cross-validation strategies performed well compared to other strategies as their corresponding performance metrics nearly coincide with the apparent performance results, while with the Swiss micro-census data, 10-fold cross validation strategy marginally outperforms all other strategies and configurations. Besides, bootstrap strategy has produced the lowest variability in both the cases, even though the biases are comparatively higher than some of the other strategies. It should be noted that after evaluating the internal validity, the final model should be based on the whole sample size unless and until large and sufficient sample size is available as they are still attractive due to their stability. Our results show that the use of split-sample strategy resulted in highest variability with either of the datasets, while cross-validation and bootstrap strategies provide closer estimates. Rho-squared imparted largest biases among other metrics in Swiss data, and HUM produced consistent results across sampling strategies. We also observe that the nature of biasness and variability is not consistent across considered accuracy metrics and sampling strategies. This study also shows that there are significant variations in prediction ability of the models estimated on different samples randomly extracted from the same population. This is attributed to the fact that all the sampled data even when originated from the same population associated with unique uncertainties which lead to variability in the predictability.

It is quite possible that the model performance may reduce when less data is used for the model construction (Steyerberg et al., 2000). Our analysis with the synthetic dataset also confirmed the same. As it is reasonable to comment that the performance may vary with different factors as mentioned earlier, the comparison made in model performance across different studies and contexts must be interpreted with clear mindset. Also, the estimate may get unstable when comparatively more data is used for testing (e.g., 50% split-sample where half of the data is hold-out for testing. Moreover, there is not much difference observed in variability of accuracy measures between two sample size. Across different accuracy measures, it is discernible that bootstrap strategy and in-sample approach show less variability and nearly consistent biases compared to split-sample and cross-validation approaches. Similar observation is made by (Steyerberg et al., 2001) for bootstrap strategy in their health study. This is because we use 100% of the data for model building and for validation in these procedures. In practice, it is quite common to hold-out a few observations for model validation, though the reduction in sample-size may lead to poorer performance. It can be considered as an inefficient use of data (Kohavi, 1995) and, at the same time, to produce more independent data for the model validation might be too expensive (Parady et al., 2021). Hence, it is advisable to construct a final model on a full sample following to the strategy that is used to estimate and validate the predictive model, including setting out model specifications and selection of final covariates in the model (Parady et al., 2021; Steyerberg et al., 2001). At the same time, it is important to clearly report the validation tests conducted and respective results which can explain the extent of generalizability of the model.

Our study can greatly support the researchers and practitioners in understanding the generalizability and transferability (where applicable) of the model results, which will ultimately support the validity of the policy inferences derived based on these results. This framework can be employed to select the best model that excel in most accuracy indicators across different internal validation strategies. It is possible to test and compare several utility specifications of statistically significant models within and across different types of models to choose superior one with better prediction validity. Moreover, the study can be extended to understand the performance with various sampling strategies with varied share of alternatives in datasets through iterative process. Our research is helpful to address the issue of optimism (i.e., overfitting and over-estimation), which will ultimately strengthen the results reported in behavioural studies. In future, rigorous exercise can be performed with more real-world datasets and accuracy metrics to evaluate the robustness of this framework and findings with various sampling strategies.

Ethical Approval

Not applicable

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Authors' contributions

Conceptualization: P. Delle Site; Methodology: P. Delle Site, J. Parmar; Formal analysis and investigation: J. Parmar; Writing - original draft preparation: J. Parmar; Writing - review and editing: P. Delle Site; Funding acquisition: P. Delle Site; Resources: P. Delle Site; Supervision: P. Delle Site.

Funding

Not applicable

Availability of data and materials

The Swiss Microcensus data is not available publicly and the authors do not have permission to share data. The Apollo database used in this study is publicly available at: http://www.apollochoicemodelling.com/index.html

Austin PC, Steyerberg EW (2014) Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res 26(2):796–808
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Statistics Surveys, 4. https://doi.org/10.1214/09-SS054
Baker M (2016) 1,500 scientists lift the lid on reproducibility. Nature 533(7604):452–454. https://doi.org/10.1038/533452a
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
Chatfield C (1995) Model Uncertainty, Data Mining and Statistical Inference. J Royal Stat Soc Ser (Statistics Society) 158(3):419. https://doi.org/10.2307/2983440
Clifton KJ, Singleton PA, Muhs CD, Schneider RJ (2016) Development of destination choice models for pedestrian travel. Transp Res Part A: Policy Pract 94:255–265. https://doi.org/10.1016/j.tra.2016.09.017
de Luca S, Cantarella GE (2009) Validation and Comparison of Choice Models. In: Sammer G, Saleh W (eds) Travel Demand Management and Road User Pricing: Success, Failure and Feasibility, 1st edn. Routledge, pp 37–58
Delle Site P (2018) A mixed-behaviour equilibrium model under predictive and static Advanced Traveller Information Systems (ATIS) and state-dependent route choice. Transp Res Part C: Emerg Technol 86:549–562. https://doi.org/10.1016/j.trc.2017.12.001
Efron B (1983) Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J Am Stat Assoc 78(382):316. https://doi.org/10.2307/2288636
Efron B, Tibshirani R (1995) Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule
Garrow LA, Bodea TD, Lee M (2010) Generation of synthetic datasets for discrete choice analysis. Transportation 37(2):183–202. https://doi.org/10.1007/s11116-009-9228-6
Harrell FE, Lee KL, Mark DB (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15(4):361–387. Https://doi.org/10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4
Justice AC, Covinsky KE, Berlin JA (1999) Assessing the generalizability of prognostic information. Ann Intern Med 130(6):515–524
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1137–1143
Kruppa J, Liu Y, Diener H-C, Holste T, Weimar C, König IR, Ziegler A (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications. Biom J 56(4):564–583. https://doi.org/10.1002/bimj.201300077
Lhéritier A, Bocamazo M, Delahaye T, Acuna-Agost R (2019) Airline itinerary choice modeling using machine learning. J Choice Modelling 31:198–209. https://doi.org/10.1016/J.JOCM.2018.02.002
Li J, Fine JP (2008) ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostatistics 9(3):566–576. https://doi.org/10.1093/biostatistics/kxm050
Martinho VJPD, Bartkiene E, Djekic I, Tarcea M, Barić IC, Černelič-Bizjak M, Szűcs V, Sarcona A, El-Kenawy A, Ferreira V, Klava D, Korzeniowska M, Vittadini E, Leal M, Bolhuis D, Papageorgiou M, Guiné RPF (2022) Determinants of economic motivations for food choice: insights for the understanding of consumer behaviour. Int J Food Sci Nutr 73(1):127–139. https://doi.org/10.1080/09637486.2021.1939659
Martinovici A, Pieters R, Erdem T (2023) Attention Trajectories Capture Utility Accumulation and Predict Brand Choice. J Mark Res 002224372211410. https://doi.org/10.1177/00222437221141052
McFadden D (1978) Quantitative methods for analyzing travel behaviour of individuals: some recent developments. In: Hensher DA, Stopher PR (eds) Behavioural Travel Modelling. Croom Helm London, pp 279–318
Parady G, Ory D, Walker J (2021) The overreliance on statistical goodness-of-fit and under-reliance on model validation in discrete choice models: A review of validation practices in the transportation academic literature. J Choice Modelling 38:100257. https://doi.org/10.1016/j.jocm.2020.100257
Parmar J, Saiyed G, Dave S (2023) Analysis of taste heterogeneity in commuters’ travel decisions using joint parking– and mode–choice model: A case from urban India. Transp Res Part A: Policy Pract 170:103610. https://doi.org/10.1016/J.TRA.2023.103610
Picard RR, Berk KN (1990) Data Splitting. Am Stat 44(2):140–147
Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF (2000) Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 19(8):1059–1079. https://doi.org/10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0
Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y, Habbema JDF (2001) Internal validation of predictive models. J Clin Epidemiol 54(8):774–781. https://doi.org/10.1016/S0895-4356(01)00341-9
Van Calster B, Van Belle V, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW (2012) Extending the c-statistic to nominal polytomous outcomes: the Polytomous Discrimination Index. Stat Med 31(23):2610–2626. https://doi.org/10.1002/sim.5321
Yan X, Levine J, Marans R (2019) The effectiveness of parking policies to reduce parking demand pressure and car use. Transp Policy 73:41–50. https://doi.org/10.1016/J.TRANPOL.2018.10.009
Zhao X, Yan X, Yu A, Van Hentenryck P (2020) Prediction and behavioral analysis of travel mode choice: A comparison of machine learning and logit models. Travel Behav Soc 20:22–35. https://doi.org/10.1016/j.tbs.2020.02.003

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Evaluation of approaches to internal validation of multinomial Logit models: the case of personal travel mode choice

Status:

Version 1

Abstract

Figures

1. Introduction

2. Internal validation methodologies

2.1. Sampling strategies

2.2. Prediction accuracy metrics

3. Application to synthetic data

3.1. Data and model

3.2. Internal validation

4. Application to real data

4.1. Data and model

4.2. Internal validation

5. Discussion and conclusion

Declarations

References

Additional Declarations

Status:

Version 1