This section will be split into 3 different subsections: Results from the Entire Dataset, Results from Most Important 25 Features, and Results from Most Important Feature (IAT). We will go over the results from all tested machine learning classifiers and discuss the results.
Results from the Entire Dataset
Model
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Training
Time (s)
|
Pred
Time (s)
|
DTC
|
99.29%
|
83.71%
|
84.14%
|
83.81%
|
1.48e+1
|
2.77e-7
|
RFC
|
99.23%
|
79.73%
|
71.49%
|
73.02%
|
3.83e+1
|
4.29e-6
|
MLP
|
98.53%
|
71.10%
|
67.05%
|
67.32%
|
1.05e+3
|
2.03e-6
|
DL
|
96.44%
|
67.88%
|
59.28%
|
63.29%
|
2.83e+2
|
4.95e-5
|
W-KNN
|
94.67%
|
66.72%
|
61.47%
|
62.13%
|
1.61e-1
|
1.89e-3
|
KNN
|
93.73%
|
65.81%
|
60.40%
|
61.12%
|
1.18e-1
|
2.39e-3
|
Table 7: Machine Learning Results Without Data Preprocessing
Table 7 shows the machine learning results from all tested classifiers in- cluding training time (in seconds) and prediction time per entry (in seconds). The models are listed according to their accuracy in descending order.
DTC achieved the highest accuracy, precision, recall, F1 score and pre- diction time which were 99.29%, 83.71%, 84.14%, 83.81% and 2.77e-7 respec- tively. DTC had the quickest training time (1.48e+1) among the algorithms that require a training process, unlike KNN and W-KNN, which do not have such a process. The superior performance of DTC may be attributed to its decision-making structure, which focuses on important attributes and creates clear decision boundaries. This simplistic approach allows for the maximized computational efficiency seen in DTC, resulting in its superior performance among the evaluated algorithms.
RFC came in second with its accuracy, precision, recall, F1 score and training time (again, among the algorithms that require a training process) which were 99.23%, 79.73%, 71.49%, 73.02% and 3.83e+1 respectively. It also had a relatively low prediction time of 4.29e-6. Unlike DTC’s simplistic decision-making process, RFC employs numerous Decision Trees, aggregating their predictions to enhance generalization. Although less efficient in compu- tation compared to DTC, its method of integrating multiple trees provides a better modeling of complex relationships within the data. These attributes may have helped RFC achieve this performance, positioning it just below DTC in the rankings of the evaluated algorithms.
MLP ranked third in terms of accuracy, precision, recall and F1 score which were 98.53%, 71.10%, 67.05% 67.32% respectively. It had the highest training time of 1.05e+3, but had a relatively low prediction time of 2.03e- Unlike the straightforward decision-making structures found in DTC and RFC, MLP utilizes multiple interconnected layers of nodes. This architec- ture, while powerful in capturing non-linear relationships within the data, demands more computational resources and careful tuning of parameters. The increased complexity and need for precise fine-tuning might have con- tributed to MLP’s extended training time and lower ranking among the eval- uated algorithms. Nevertheless, its ability to learn non-linear relationships contributed to its commendable performance, placing it in the third position among the evaluated models.
Our DL model was the 4th best in terms of accuracy, precision, and F1 score which were 96.44%, 67.88% 63.29%. The DL model had the second highest training time of 2.83e+2 and a prediction time of 4.95e-5. The DL model utilizes “deep” neural networks. The reason they are called deep is the fact that they have 3 or more hidden layers. This complex architecture, capable of discovering patterns and relationships within the data, can require substantial computational resources, thus, may cause a longer training time. Its flexibility and adaptability to various data types, notwithstanding its computational demands, contribute to its recognized performance among the evaluated models.
W-KNN ranked 5th best with an accuracy, precision, recall, and F1 score of 94.67%, 66.72%, 61.47%, 62.13% respectively, while KNN was the 6th best model with an accuracy, precision, recall, and F1 score of 93.73%, 65.81%, 60.40%, 61.12% respectively. The training times in both W-KNN and KNN are irrelevant to compare with the other algorithms since there is no training process that these neighbor methods employ. Since predictions are based on the k amount of neighbors, predictions can be computationally ineffi- cient when dealing with large datasets, causing W-KNN to have the second slowest prediction time and KNN the slowest. The task at hand, detecting intrusions, with both W-KNN and KNN, is relatively inefficient when com- pared to the other algorithms. This inefficiency in both models, along with the relative insensitivity to underlying data distribution and sensitivity to irrelevant features in KNN, placed them at the lower end of the rankings among the evaluated models.
Results from Most Important 25 Features
Model
|
Accuracy
|
Precision
|
Recall
|
F1
Score
|
Training
Time (s)
|
Pred
Time (s)
|
RFC
|
99.31%
|
89.20%
|
74.32%
|
76.63%
|
4.49e+1
|
4.58e-6
|
DTC
|
99.29%
|
83.86%
|
84.91%
|
84.26%
|
1.20e+1
|
2.39e-7
|
MLP
|
98.43%
|
69.88%
|
66.01%
|
66.26%
|
1.37e+3
|
2.37e-6
|
DL
|
96.32%
|
64.26%
|
56.23%
|
60.00%
|
3.22e+2
|
5.68e-5
|
W-KNN
|
96.03%
|
64.13%
|
61.28%
|
61.62%
|
8.70e-2
|
1.78e-3
|
KNN
|
95.45%
|
62.99%
|
59.97%
|
60.42%
|
1.22e-1
|
1.74e-3
|
Table 8: Machine Learning Results Utilizing the Top 25 Features Identified by Random Forest Feature Importance
Table 8 shows the machine learning results from all tested classifiers for the most important 25 features that were identified by the Random Forest algorithm.
RFC ranked as the best algorithm with an accuracy, precision, recall and F1 score of 99.31%, 89.20%, 74.32% and 76.63% respectively which was slightly higher than RFC trained with the entire dataset. Surprisingly, RFC’s training time and prediction time trained with the most important 25 features were higher than RFC’s training and prediction time with the entire dataset. The specific interaction of these chosen features might have necessitated more intricate calculations, leading to this unexpected increase in both training and prediction times.
DTC also yielded a high accuracy, precision, recall and F1 score of 99.29%, 83.86%, 84.91%, 84.25% respectively which were marginally higher than com- pared to DTC trained with the whole dataset. Its training and prediction times were also lower, thus, making it more efficient.
MLP performed worse when compared to MLP trained with the entire dataset with an accuracy, precision, recall and F1 score of 98.43%, 69.88%, 66.01%, 66.26% respectively. The reduction in features to only the 25 most essential ones likely removed information that was beneficial for the MLP’s learning process. While these 25 features were considered most important, they might not have captured the full complexity of the patterns in the data MLP might have used to make predictions when it used the entire dataset.
Our Deep Learning (DL) model recorded a slight decline in its efficiency metrics when restricted to the most important 25 features, yielding an accu- racy of 96.32%, precision at 64.26%, recall of 56.23%, and F1 score hitting 60.00%. The narrowed feature set seemed to have stripped some of the com- plexities that might have contributed to a higher efficiency when utilizing the complete dataset.
Shifting focus to the W-KNN and KNN models, both witnessed a com- mon trend. W-KNN’s performance stats were an accuracy rate of 96.03%, precision of 64.13%, recall standing at 61.28%, and an F1 score of 61.62%, while KNN lagged just behind with 95.45% in accuracy, 62.99% precision, 59.97% recall, and an F1 score of 60.42%. This underlines that the 25-feature dataset, might have improved subtle details vital for these specific models, therefore marginally increasing their efficiency.
Results From Most Important Feature (IAT)
Model
|
Accuracy
|
Precision
|
Recall
|
F1
Score
|
Training
Time (s)
|
Pred
Time (s)
|
DTC
|
99.06%
|
99.10%
|
86.00%
|
90.46%
|
8.32e-1
|
1.76e-7
|
RFC
|
99.04%
|
99.09%
|
80.58%
|
85.76%
|
1.26e+1
|
3.87e-6
|
KNN
|
99.03%
|
89.59%
|
77.84%
|
81.74%
|
7.94e-1
|
6.08e-5
|
W-KNN
|
98.91%
|
90.32%
|
80.91%
|
84.10%
|
7.63e-1
|
1.95e-5
|
DL
|
91.27%
|
40.37%
|
41.73%
|
41.04%
|
3.18e+2
|
4.55e-5
|
MLP
|
89.06%
|
37.89%
|
38.80%
|
37.72%
|
1.55e+3
|
2.38e-6
|
Table 9: Machine Learning Results Utilizing only IAT Feature
Table 9 outlines the machine learning results from all tested classifiers using only the IAT feature. A stark transformation in the relative perfor- mance between the models can be observed when considering only this single feature.
DTC takes the lead, achieving an impressive accuracy of 99.06%, precision of 99.10%, recall of 86.00%, and an F1 score of 90.46%. The training time is dramatically reduced to 8.32e-1, and the prediction time stays incredibly low at 1.76e-7. Focusing only on IAT, DTC’s inherently decisive approach seems to leverage the crucial information encoded within this single feature, yielding a tremendous improvement in precision and recall.
Close behind is RFC, achieving 99.04% accuracy, 99.09% precision, 80.58% recall, and 85.76% F1 score. Its training time is 1.26e+1, with a prediction time of 3.87e-6. By exploiting only the IAT feature, RFC’s ensemble of Deci- sion Trees appears to have harnessed the essential characteristics of the data, almost mirroring DTC’s performance.
KNN and W-KNN show remarkable performance as well, with KNN reaching 99.03% accuracy, 89.59% precision, 77.84% recall, and 81.74% F1
score, and W-KNN at 98.91%, 90.32%, 80.91%, and 84.10% for the same metrics. Interestingly, both models, which were previously lower in the rank- ings, have climbed up when focusing solely on the IAT feature. It seems that IAT, being a significant factor in distinguishing patterns, resonates well with the neighbor-based decision-making process, enhancing the effectiveness of both KNN and W-KNN.
Our Deep Learning (DL) model experiences a drop in performance with the IAT-only approach, recording an accuracy of 91.27%, precision of 40.37%, recall of 41.73%, and F1 score of 41.04%. Training and prediction times re- main similar to previous evaluations. It seems that the intricate architecture of deep learning networks requires a richer set of features to understand the underlying complexities of the data, and the isolation to a single feature diminishes the model’s capacity to generalize well.
MLP experiences the most significant decline, standing at 89.06% accu- racy, 37.89% precision, 38.80% recall, and 37.72% F1 score. The extended training time of 1.55e+3 reflects the struggle of MLP to adapt to the IAT- only feature. Similar to DL, the underlying complexity of MLP appears to demand a more comprehensive feature set. Reducing it to a single character- istic seems to restrain MLP’s capability to discern non-linear relationships within the data, leading to its diminished performance.
In Figure 2, the confusion matrix for the results obtained from a Decision Tree Classifier, utilizing Inter-Arrival Time (IAT) features to distinguish be- tween various classes of cybersecurity threats is presented. The key to the classes is provided above the matrix, with each cybersecurity threat assigned a unique numerical identifier, ranging from 0 to 33.
For example, ’DDoS-RSTFINFlood’ is assigned ’0’, while ’DoS-TCP Flood’ is assigned ’1’, and so on, up to ’Uploading Attack’ which is assigned ’33’. The identifier ’12’ represents ’BenignTraffic’, which is crucial as a baseline for comparison to identify anomalous, harmful network behaviors.