In the case of the MDRM dataset, training, validation, and testing splits are provided by the dataset [36]. We use these provided splits to train our binary classifiers with hand-crafted features and deep bidirectional Transformer neural networks. All the splits are created ensuring a suitable balance ratio between positive and negative class samples for the different target categories ‘disaster’, ‘medical’, ‘humanitarian standards’, and ‘severity’, which are controlled by an empirically set parameter.
However, for the multiclassification approach we use the entire MDRM dataset to perform the evaluation using splits of 33% for testing and 66% for training the models. Then, we train our classifiers using 5-fold cross validation and compute the average scores for each metric. This is done to make results comparable with [39], given that all the results we found for this dataset do not follow the proposed splits provided by the dataset on the multiclassification task. Besides, there is no information specified in those results about any validation strategy performed in their evaluation, such as the k-fold cross validation which we do perform to provide with more reliable results. Therefore, we use this evaluation only for comparison purposes with existing reported results for this dataset. Figure 2 plots the learning curve to show the scalability of the SVC multiclassification approach. One can note the quick improvement of the training process when we train the model until 8000 samples, and the saturation with no further improvements when adding more samples after that point.
We perform an additional evaluation with the original form for the splits of the MDRM dataset to enhance current results and to allow future comparisons to users. To the best of our knowledge, there are no previous published results for this dataset in spite of the dataset being also publicly available with its original splits in competition platforms such as Kaggle7. Therefore, we provide for the first time a reliable machine learning evaluation for this dataset on the disaster response domain. Figure 3 plots the learning curves to show the performance and scalability of our NB-based approaches and the SVC multiclassification approach. In this case, the improvement is progressively slower during the training process as we add more samples, until the point where the performance on the validation data is close to the performance on the training data. This is due to the reduced number of validation samples in this setting of original splits of the dataset with respect to the above setting using custom splits. In this setting, we see a considerable improvement in the learning process in all our methods, especially in our SVC model.
Since there are no splits provided for the other datasets, in all remaining settings we generate the splits via random sampling with 40% for testing and 60% for training the binary classifiers.
The results are provided in terms of precision, recall, F1-score, and -for binary classifiers- accuracy. Note we do not provide results for accuracy in test data in the multiclassification because when the class distribution is unbalanced, accuracy metric is considered a poor choice as it gives high scores to models which just predict the most frequent class.
To calculate the above metrics we considered the method developed in [30]. These metrics are essentially defined for binary classification tasks, which by default only the positive label is evaluated, assuming the positive class is labelled ‘1’. In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class. There are then several ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. We select the ‘micro’ average parameter because "micro" it gives each sample-class pair an equal contribution to the overall metric (because of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.
Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. The F-measure can be interpreted as a weighted harmonic mean of the precision and recall. A F measure reaches its best value at 1 and its worst score at 0. With F1-score, the recall and the precision are equally important. Their values are between 0 and 1 and higher is better. The average precision (AP) is computed from prediction scores as:
$$AP=\sum _{n}({R}_{n}-{R}_{n-1}){P}_{n} \left(1\right)$$
where \({P}_{n}\) and \({R}_{n}\) are the precision and recall at the n-th threshold and are calculated with the true positives, false positives, and false negative predictions [30]. With random predictions, the AP is simply the fraction of positive samples.
The accuracy metric is used to compute either the fraction or the count of correct predictions. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0. If \(\widehat{{y}_{i}}\) is the predicted value of the \(i\)-th sample and \({y}_{i}\) is the corresponding true value, then the fraction of correct then the fraction of correct predictions over \({n}_{samples}\) is defined as:
$$acc(y,\widehat{y})=\frac{1}{{n}_{samples}}\sum _{i=0}^{{n}_{samples}-1}1({\widehat{y}}_{i}={y}_{i}) \left(2\right)$$
[7] https://www.kaggle.com/landlord/multilingual-disaster-response-messages