Advancing Intrusion Detection Efficiency: A ’Less is More’ Approach via Feature Selection

doi:10.21203/rs.3.rs-3398752/v1

Download PDF

Research Article

Advancing Intrusion Detection Efficiency: A ’Less is More’ Approach via Feature Selection

https://doi.org/10.21203/rs.3.rs-3398752/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Cybersecurity relies heavily on effective intrusion detection, a task that may fall short when utilizing unprocessed data in machine learning models. In an endeavor to improve detection rates, our research embraced a ‘Less is More’ strategy. By employing Random Forest feature selection, the in- terpacket arrival time (IAT) surfaced as the key determinant, in a real-time dataset encompassing 33 attacks in an IoT environment consisting of 105 devices. Concentrating on this singular feature and reducing the data’s di- mensions (thus drastically minimizing training and prediction time), our best model yielded an F1 score of 90.46%, outperforming prior results by nearly 19%. Additionally, a trial using the most important 25 features yielded an F1 score of 84.26%. While this was not as successful, it may yield better results when experimenting with different datasets. We also measured training time and prediction time per entry for all models and stated that lower dimension- ality in data leads to drastically lower training and prediction times. Though the IAT-centered method showed considerable promise, its universal applica- bility may be limited. Our findings illuminate the substantial potential of this method in intrusion detection, emphasizing the crucial role that feature se- lection can play in enhancing accuracy, with effects that could be far-reaching across various real-world scenarios and scholarly pursuits. The limitation of this research lies in the potential inability to apply the one-feature approach universally. Moving forward, investigations may pivot towards assessing the ’Less is More’ strategy’s adaptability across diverse datasets, fine-tuning the approach to harmonize efficacy and applicability. This investigation not only underscores the potential of feature selection in intrusion detection but also manifests a breakthrough in efficiency, achieving a remarkable improvement over previous methods. By employing a focused approach, our research has catalyzed an advancement in the field. The substantial enhancement in detec- tion efficiency validates this approach, positioning it as a viable and effective solution for those seeking to strengthen and streamline intrusion detection systems.

Intrusion Detection

Feature Selection

Interpacket Arrival Time

Random Forest

Less is More

Network Security

As technology evolves, cybersecurity continually adapts facing a rise in challenges. Intrusion detection systems (IDS) help networks monitor mali- cious behavior [1]. Traditionally, IDSs have been handcrafted systems which is restrictive and not adaptive.

The rise of machine learning has provoked a transformation in intrusion detection. Algorithms that can adapt and learn attack patterns have the potential to transform intrusion detection [2, 3].

However, machine learning in intrusion detection naturally caused some challenges. The large-scale data that is used in networks, naturally, makes the algorithms complex and prone to overfitting [4, 5].

Traditional methods can not keep up with evolving cyber threats. While machine learning has offered solutions, it often involves using large datasets, bringing up problems with the models’ efficiency and complexity. Complex- ity affects the computational efficiency and the comprehensibility and inter- pretibility of the models.

Since new attack types are emerging and data are becoming more diverse, the matter becomes more complicated [6]. All of these concerns underline that a method is needed with low data dimensionality and higher accuracy. In this study, we focus on a ’Less is More’ strategy and reduce the di- mensions of the data. We used Random Forest feature importance and the most important feature was found to be interpacket arrival time (IAT) [7].

By only using the IAT feature, we hoped to create faster and more efficient models.

By reducing the data, this study aims to demonstrate feature selection can perform better than existing methods [8]. This study supports the evidence that states feature selection is helpful in many ways such as training speed and predictive accuracy [5].

This study’s implications could be in real world applications. This em- phasizes the need for efficient and fast intrusion detection systems. The related works are presented after the introduction. Then subsequent sections will cover the methodology, experimental setup, results, and a comparative analysis with previous studies. This paper will explain the advancement in intrusion detection.

Intrusion Detection Systems (IDS) is a field that is expanding. The re- lated works section goes over existing literature about IDS, accompanied by a summary table to summarize the key findings of each study.

In Machine Learning Classifiers in IDS, we explore various studies that have leveraged traditional machine learning techniques to enhance the efficacy and efficiency of IDS. A corresponding table in this section delivers an in-depth contrast between various classifiers and how they performed.

Machine Learning Classifiers in IDS

Almseidin et al. looked at different machine learning classifiers to see how good they were at predicting intrusions, using the KDD intrusion dataset [9]. They focused on decreasing false negatives and false positives. The Random Forest classifier achieved the best accuracy on average. This research indi- cates that choosing the right algorithm and how you evaluate it is crucial. It’s relevant to our study, where we are using Random Forest to pick out features and try to get better at predicting intrusions.

Halimaa and Sundarakantham studied how well machine learning meth- ods like Support Vector Machine (SVM) and Naive Bayes worked for pre- dicting intrusions [10]. They utilized the NSL-KDD dataset. Determining that support vector machines outperformed others, their investigation sheds light on our own efforts to improve detection precision using a minimalist strategy. Their work offers valuable insights into the role of algorithm choice in enhancing intrusion detection accuracy.

Sangkatsanee et al. looked at how to use supervised machine learning to make a real-time intrusion detection system. They focused on using the Decision Tree algorithm. By choosing 12 important features of the data from networks and using a new way to process it after, they were able to detect more than 98% of the main types of attacks in less than 2 seconds. In the study at hand, the usage of Decision Tree methods for identifying unautho- rized access is highlighted, which aligns with our own research where tree methods were employed both for isolating features and for spotting intru- sions.

Abrar et al. used various machine learning models to make an intrusion detection system [12]. They used Random Forest, Extratree and Decision Tree, yielding an accuracy of 99% for each attack class. This research in- forms us that machine learning and feature selection can increase accuracies in intrusion detection. This study is relevant because Abrar et al. also em- phasize computational efficiency in intrusion detection.

Patgiri et al. used Random Forest and Support Vector Machine to pre- dict intrusions in the NSLKDD dataset [13]. They used recursive feature elimination to increase accuracy and speed. This is relevant to our study because they used feature selection and machine learning models similar to our methodology.

Authors	Methods	Dataset	Findings	Alignment
Almseidin et al.	RF, others	KDD	Top per- former: RF	Algorithm selection
Halimaa et al.	SVM, Na¨ıve Bayes	NSL-KDD	SVM out- performed Na¨ıve Bayes	Optimization of efficiency
Sangkatsanee et al.	Decision Tree	Real-time	Detection rate > 98%	Real-time detection
Abrar et al.	RF, ETC, DT	NSLKDD	Prediction rate > 99%	Detection rates, effi- ciency
Patgiri et al.	RF, SVM	NSLKDD	Importance of feature selection	Value of feature re- duction

Table 1: Summary of Machine Learning Classifiers in IDS.

Deep Learning in IDS

Vinayakumar et al. detected cyber attacks with deep neural networks [14]. They used multiple datasets. They present a scalable and hybrid deep neural network ”scale-hybrid-IDS-AlertNet” that performed better than conventional machine learning models. This pertains to to our study since they used deep neural networks that are adaptable and efficient, emphasizing strong algorithms that can handle changing security risks.

Karatas et al. used deep learning in intrusion detection systems to adapt to rapid changes in cybersecurity [15]. They highlight components of intru- sion detection systems like data collection, feature selection, and decision engines. Since deep learning has low training time and high accuracy, it emerged as the top technique. This work is pertinent to our study since they use machine learning and advanced algorithms in intrusion detection.

Alrawashdeh and Purdy used a Restricted Boltzmann Machine and a deep belief network, for real time attack detection [16]. They yielded an accuracy rate of 97.9% and a false negative percentage of 2.47% on the DARPA KD- DCUP’99 dataset beating older deep learning approaches. They used an old dataset. They show that deep learning, and more advanced algorithms can effectively detect intrusions in real time.

Ashiku and Dagli used deep learning to develop adaptive network intru- sion systems [17]. They used the UNSW-NB15 dataset and they stated the potential of deep learning in developing an intrusion detection system that is able to detect novel and sophisticated network intrusions. They highlight the potential of deep learning in intrusion detection.

Comprehensive Reviews and Surveys

Liu and Lang reviewed how machine learning and deep learning are used in intrusion detection systems that are able to recognize familiar and unfa- miliar threats [18]. They summarized intrusion detection system literature, and went over the efficiency of deep learning models. This is relevant to our study since they went over machine learning and deep learning for intrusion detection systems.

Tsai et al. examined 55 papers about machine learning in intrusion de- tection from 2000 to 2007 [19]. They looked at single, hybrid, and ensem- ble classifiers, and compared them based on their attributes such as design, datasets, setups. Their review offers historical achievements in intrusion de- tection. This review contributes to out understanding of intrusion detection systems.

Ahmet et al. looked into network centered system intrusions with ma- chine learning and deep learning [20]. They first explain intrusion detection systems and then go into intrusion detection systems around machine learing

Authors

Methods

Dataset

Findings

Alignment

Vinayakumar

et al.

DNN

Various

Scale-

hybrid-IDS- AlertNet; outper- formed

classical ML

Robust algo-

rithms

Karatas et

al.

Deep Learn-

ing

Not specified

Low training

time, high accuracy

Advanced

algorithms in IDS

Alrawashdeh

et al.

RBM, Deep

Belief

KDDCUP’99

Detection rate 97.9%,

low false negative

Advanced

techniques

in real-time IDS

Ashiku et al.

DNN

UNSW-

NB15

Adaptive

IDS; detects

new, zero- day threats

Adapting

to evolving

security threats

Table 2: Summary of Deep Learning in IDS.

and deep learning. They review recent articles on network based intrusion de- tection systems, discussing their pros and cons, methodologies, metrics and datasets. They recognize limitations and identify potential research chal- lenges and future directions. This study provides us with context for our study.

Maseer et al. considered challenges like inconsistencies in algorithm se- lections and superficial validation. They focused on techniques used for anomaly-based intrusion detection systems using machine learning and deep learning [21]. They used 10 supervisewd and unsupervised algorithms on the CICIDS2017 Dataset. They use KNN-AIDS, DT-AIDS, and NB-AIDS that perform the best in web attacks. This study is relevant to our research since they evaluate different algorithms that are used for intrusion detection systems, and fine-tuning said models to increase accuracies.

Dang focused on challenges and opportunities in machine learning for in- trusion detection systems [22]. The study focused on tree based ensemble learning with feature engineering and a new method for training data selec- tion. This study introduces new methods to improve accuracies with lower costs. This study is relevant to our study since it utilizes feature engineering and efficient use of training data, akin to our study where we focus on a key feature (IAT) in intrusion detection.

Authors

Focus

Key Findings

Relevance

Liu & Lang

ML & DL in IDS

Emphasis on

deep learning, taxonomy

Adoption of

techniques for accuracy

Tsai et al.

ML for IDS

(2000-2007)

Historical per-

spective on design, datasets

Insights into

evolving land- scape

Ahmad et al.

ML & DL in

NIDS

Comprehensive

review, merits & drawbacks,

future directions

Contextual

insights, gap identification

Maseer et al.

ML & DL in

AIDS

Benchmarking,

k-NN-AIDS,

DT-AIDS, NB-

AIDS superior

Selection and

optimization of algorithms

Dang

ML in IDS, tree-

based

Feature engi-

neering, training data selection

Focus on key

feature, stream- lined IDS

Table 3: Summary of Comprehensive Reviews and Surveys on IDS.

IDS in Specialized Contexts

Saranya et al. analyzed machine learning algorithms applied to intrusion detection systems [23]. They focused on IoT, big data, and 5g networks. They used linear discriminant analysis, classification and regression trees, and Random Forest with the KDD-CUP dataset. Their study yielded valuable insights that enriched our research.

Musa et al. used machine learning to increase network security [24]. They provide an understanding of how machine learning is used to enhance network security. They go over single, hybrid and ensemble machine learnig models for intrusion detection systems with 7 datasets. Their focus on signature-based detection and anomaly detection with machine learning models is relevant to our study.

Meryem and Ouahidi went over cloud security and they proposed a hy- brid intrusion detection sytem using machine learning [25]. They stated the importance of making adaptable techniques for the changing security risks in real time.

Mighan and Kahani present a scalable intrusion detection system using deep learning with Apache Shark to process network traffic data [26]. They integrated stacked autoencoders and machine learning models. They ad- vanced the field’s ability to efficiently detect intrusions in large data.

Authors

Focus

Key Findings

Relevance

Saranya et al.

ML in IDS (IoT,

big data, 5G)

LDA, CART,

Random Forest on KDD-CUP

Performance

insights across

tech environ- ments

Musa et al.

Development of

IDS, ML classi- fiers

Evaluation

across seven

datasets, sin- gle, hybrid, ensemble ML

classifiers

Comparison of

methodologies, alignment with intrusion chal- lenges

Meryem &

Ouahidi

Cloud security,

hybrid IDS

Dynamic se-

curity rules,

real-time block- ing

Adaptability,

relevance to

cloud-based architectures

Mighan & Ka-

hani

Scalable IDS,

deep learning

Stacked autoen-

coders, SVM, Random Forest

Hybrid models

for large data volumes

Table 4: Summary of Studies on IDS in Specialized Contexts.

Our Dataset and Methodology

In the context of our study, we concentrated on the CICIoT2023 dataset introduced by Neto et al., a comprehensive and real-time collection encom- passing 33 attacks in an IoT topology consisting of 105 devices [27]. This dataset, meticulously designed to simulate a smart home environment, serves as an accurate reflection of real-world IoT deployments. They used Logis- tic Regression, Perceptron, Adaboost, Random Forest and a Deep Neural Network on their dataset. In 34 class classification, they yielded the highest F1 score which was 71.40% with Random Forest. In contrast to the raw data that Neto et al. used, our approach involves the application of pre- processing methods to increase the efficiency and precision of our machine learning models. We identified an opportunity to tackle challenges in raw data, such as reducing noise and managing high dimensionality, through the use of these preprocessing techniques. By doing so, our study not only lever- ages the extensive and varied attack scenarios contained within the dataset but also extends its utility by demonstrating the effectiveness of preprocess- ing in improving detection methodologies. Through our focus on this rich dataset, combined with innovative preprocessing techniques, we contributed to advancing intrusion detection systems by achieving higher F1 scores and lower training and prediction times. [27].

The methodology is going to be separated into 2 subsections: Data Pre- processing and Classification. The first subsection, Data Preprocessing, ex- plains the methods we used such as Random Forest feature importance and scalers to make the data ready for our machine learning models. The lat- ter subsection, Classification, discusses the machine learning models that we used and their features.

Data Preprocessing

As stated in the related works section, the dataset used in this study was the CICIoT2023 dataset, which was made publicly available by Neto et al. [27]. The dataset is a comprehensive and real-time collection encompassing 33 attacks in an IoT topology consisting of 105 devices.

There are 47 features including the labels in the dataset which can be seen in Table 5. These features encapsulate different aspects of network traffic, including fundamental attributes like flow duration, header length, protocol types, and various flag values. In addition, the features provide insights into the rate of packet transmission within a flow, statistical measurements such as minimum, maximum, average, and standard deviation of packet lengths, and several indicators of specific protocols such as HTTP, HTTPS, DNS, Telnet, and more. Additionally, there are derived metrics that include as- pects like magnitude, radius, covariance, and variance related to the lengths of incoming and outgoing packets. These features allow machine learning models to learn attack patterns. Finally, there are 34 classes of which 33 are attacks and 1 is benign traffic. The classes can be seen in Table 6.

Feature	Description
flow duration	Duration of the packet’s flow
Header Length	Header Length
Protocol Type	IP, UDP, TCP, IGMP, ICMP, Unknown
Duration	Time-to-Live
Rate	Rate of packet transmission in a flow
Srate	Rate of outbound packets transmission in a flow
Drate	Rate of inbound packets transmission in a flow
fin flag number	Fin flag value
syn flag number	Syn flag value
rst flag number	Rst flag value
psh flag number	Psh flag value
ack flag number	Ack flag value
ece flag number	Ece flag value
cwr flag number	Cwr flag value
ack count	Number of ack flags in the same flow
syn count	Number of syn flags in the same flow
fin count	Number of fin flags in the same flow
urg count	Number of urg flags in the same flow
rst count	Number of rst flags in the same flow
HTTP	Indicates if HTTP is used
HTTPS	Indicates if HTTPS is used
DNS	Indicates if DNS is used
Telnet	Indicates if Telnet is used
SMTP	Indicates if SMTP is used
SSH	Indicates if SSH is used
IRC	Indicates if IRC is used
TCP	Indicates if TCP is used
UDP	Indicates if UDP is used
DHCP	Indicates if DHCP is used
ARP	Indicates if ARP is used
ICMP	Indicates if ICMP is used
IPv	Indicates if IP is used
LLC	Indicates if LLC is used
Tot sum	Summation of packets lengths in flow
Min	Minimum packet length in the flow
Max	Maximum packet length in the flow
AVG	Average packet length in the flow
Std	Standard deviation of packet length
Tot size	Packet’s length
IAT	Time difference with the previous packet
Number	Number of packets in the flow
Magnitude	Avg. of lengths of in/out packets / 2
Radius	Variance of lengths of in/out packets / 2
Covariance	Covariance of lengths of in/out packets
Variance	Variance of lengths of incoming/outgoing packets
Weight	Number of in packets * Number of out packets

Table 5: Summary of Features in the Dataset.

Attack Type	Count
DDoS-ICMP Flood	161281
DDoS-UDP Flood	121205
DDoS-TCP Flood	101293
DDoS-PSHACK Flood	92395
DDoS-SYN Flood	91644
DDoS-RSTFINFlood	90823
DDoS-SynonymousIP Flood	80680
DDoS-ICMP Fragmentation	10223
DDoS-ACK Fragmentation	6431
DDoS-UDP Fragmentation	6431
DDoS-HTTP Flood	626
DDoS-SlowLoris	493
DoS-UDP Flood	74787
DoS-TCP Flood	59807
DoS-SYN Flood	45207
DoS-HTTP Flood	1680
Recon-HostDiscovery	3007
Recon-OSScan	2225
Recon-PortScan	1863
Recon-PingSweep	41
Mirai-greeth flood	22115
Mirai-udpplain	20166
Mirai-greip flood	16952
SqlInjection	122
CommandInjection	105
XSS	72
DictionaryBruteForce	324
MITM-ArpSpoofing	7019
DNS Spoofing	4034
VulnerabilityScan	809
BrowserHijacking	140
Backdoor Malware	76
Uploading Attack	23
BenignTraffic	24476

Table 6: Distribution of Attack Classes in the Dataset

As we can see from Table 6, the dataset is imbalanced. Consequently, this puts more importance on machine learning metrics such as F1 score, Precision and Recall rather than just the accuracy.

While Neto et al. [27] experimented with machine learning models, they did not preprocess the data and failed to address the fact that if the dataset is properly preprocessed, the models can perform more efficiently.

For the data preprocessing, we first started off by mapping each label to numeric values since machine learning models do not accept string values. Then, we ranked the features from most important to least important using Random Forest, a method that calculates feature importance by averaging the extent to which each feature improves prediction accuracy across nu- merous Decision Trees [28]. Figure 1 shows the features ranked from most important to least important in the CICIoT2023 dataset.

We prepared three different scenarios with our dataset. First, we imple- mented machine learning models on the raw and whole dataset, which is the same as Neto et al.’s approach [27]. Next, we applied the models using the 25 most important features, as seen in Figure 1. Finally, we focused on the most important feature, interpacket arrival time (IAT), and implemented the machine learning models using only this feature, as seen in Figure 1.

The reason we prepared three scenarios was to investigate the relationship between the amount of data used and the resulting accuracies of our models. By progressively reducing the number of features —from using the whole dataset to selecting only the 25 most important features, and finally focusing on the most important feature, interpacket arrival time (IAT)— we aimed to see how much of the original accuracy could be maintained. This approach enabled us to see how much data reduction decreased training and prediction times of the machine learning models while maintaining original accuracies.

Figure 1: Random Forest Feature Importance Rankings on CICIoT2023 Dataset

Classification

In this section, we describe the classification algorithms used in our study to predict intrusions, including their explanations, pros, and cons. We used several algorithms to determine the best algorithm for the CICIoT2023 dataset.

• K-Nearest Neighbors (KNN) [29]:

Explanation: KNN is a non-parametric method used for classi- fication and A sample is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k amount of nearest neighbors.
Pros: Simple to implement, inherently multiclass, and no assump- tions about the underlying data
Cons: High computational cost, sensitivity to irrelevant features, and the choice of the value of

• Weighted K-Nearest Neighbors (W-KNN):

Explanation: An extension of KNN, where weights are assigned to the neighbors. Weights may be assigned based on the distance, so closer neighbors influence the prediction more than distant ones [30].
Pros: Improved accuracy over standard KNN, especially in cases with uneven class
Cons: More complexity in weight determination, potential over- fitting, and the same computational cost as standard

• Random Forest Classifier (RFC) [31]:

Explanation: An ensemble learning method that constructs a multitude of Decision Trees at training time, outputting the mode of the classes of the individual trees for
Pros: Can yield high accuracy, can handle large datasets, and can infer missing data
Cons: Complexity, longer training period, and may overfit smaller

• Multilayer Perceptron (MLP) [32]:

Explanation: MLP is a class of feedforward artificial neural net- work, consisting of multiple layers of nodes in a directed The result from one layer becomes the input for the following layer.
Pros: Adaptable to different types of data and can learn non- linear
Cons: Computationally heavy, prone to overfitting, and needs precise fine-tuning.

• Decision Tree Classifier (DTC) [33]:

Explanation: Decision Trees are a non-parametric supervised learning method used for classification and regression. They pre- dict the value of a target variable by learning simple decision rules inferred from the data
Pros: Simple to understand and implement, little to no data preparation
Cons: Likely to overfit, especially when the tree is deep, can be unstable when dealing with a diverse

• Deep Learning (DL) [34]:

Explanation: Deep Learning encompasses algorithms that en- able the model to learn from vast amounts of data through neural networks with many layers. It is capable of automatically discov- ering the representations needed for detection or
Our Model: We utilized a Sequential model from the Keras li- brary [35], comprising multiple densely connected layers, activa- tion functions, and dropout for The architecture is as follows: dense layer with 256 units and ReLU activation. dropout layer with a rate of 0.5. 2 dense Layers with 128 and 64 units, with ReLU and Sigmoid activations respectively. Output dense layer with 34 units and Softmax activation. This model, compiled with Adam optimizer, employs categorical cross-entropy loss and has a 15-epoch training period with 128-sized batches.

Pros: Can learn highly complex relationships, high accuracies in various tasks, and availability of pre-trained
Cons: Computationally not efficient, may not perform well with small datasets, likely to overfit, and can be complicated to imple-

This section will be split into 3 different subsections: Results from the Entire Dataset, Results from Most Important 25 Features, and Results from Most Important Feature (IAT). We will go over the results from all tested machine learning classifiers and discuss the results.

Results from the Entire Dataset

Model	Accuracy	Precision	Recall	F1 Score	Training Time (s)	Pred Time (s)
DTC	99.29%	83.71%	84.14%	83.81%	1.48e+1	2.77e-7
RFC	99.23%	79.73%	71.49%	73.02%	3.83e+1	4.29e-6
MLP	98.53%	71.10%	67.05%	67.32%	1.05e+3	2.03e-6
DL	96.44%	67.88%	59.28%	63.29%	2.83e+2	4.95e-5
W-KNN	94.67%	66.72%	61.47%	62.13%	1.61e-1	1.89e-3
KNN	93.73%	65.81%	60.40%	61.12%	1.18e-1	2.39e-3

Table 7: Machine Learning Results Without Data Preprocessing

Table 7 shows the machine learning results from all tested classifiers in- cluding training time (in seconds) and prediction time per entry (in seconds). The models are listed according to their accuracy in descending order.

DTC achieved the highest accuracy, precision, recall, F1 score and pre- diction time which were 99.29%, 83.71%, 84.14%, 83.81% and 2.77e-7 respec- tively. DTC had the quickest training time (1.48e+1) among the algorithms that require a training process, unlike KNN and W-KNN, which do not have such a process. The superior performance of DTC may be attributed to its decision-making structure, which focuses on important attributes and creates clear decision boundaries. This simplistic approach allows for the maximized computational efficiency seen in DTC, resulting in its superior performance among the evaluated algorithms.

RFC came in second with its accuracy, precision, recall, F1 score and training time (again, among the algorithms that require a training process) which were 99.23%, 79.73%, 71.49%, 73.02% and 3.83e+1 respectively. It also had a relatively low prediction time of 4.29e-6. Unlike DTC’s simplistic decision-making process, RFC employs numerous Decision Trees, aggregating their predictions to enhance generalization. Although less efficient in compu- tation compared to DTC, its method of integrating multiple trees provides a better modeling of complex relationships within the data. These attributes may have helped RFC achieve this performance, positioning it just below DTC in the rankings of the evaluated algorithms.

MLP ranked third in terms of accuracy, precision, recall and F1 score which were 98.53%, 71.10%, 67.05% 67.32% respectively. It had the highest training time of 1.05e+3, but had a relatively low prediction time of 2.03e- Unlike the straightforward decision-making structures found in DTC and RFC, MLP utilizes multiple interconnected layers of nodes. This architec- ture, while powerful in capturing non-linear relationships within the data, demands more computational resources and careful tuning of parameters. The increased complexity and need for precise fine-tuning might have con- tributed to MLP’s extended training time and lower ranking among the eval- uated algorithms. Nevertheless, its ability to learn non-linear relationships contributed to its commendable performance, placing it in the third position among the evaluated models.

Our DL model was the 4th best in terms of accuracy, precision, and F1 score which were 96.44%, 67.88% 63.29%. The DL model had the second highest training time of 2.83e+2 and a prediction time of 4.95e-5. The DL model utilizes “deep” neural networks. The reason they are called deep is the fact that they have 3 or more hidden layers. This complex architecture, capable of discovering patterns and relationships within the data, can require substantial computational resources, thus, may cause a longer training time. Its flexibility and adaptability to various data types, notwithstanding its computational demands, contribute to its recognized performance among the evaluated models.

W-KNN ranked 5th best with an accuracy, precision, recall, and F1 score of 94.67%, 66.72%, 61.47%, 62.13% respectively, while KNN was the 6th best model with an accuracy, precision, recall, and F1 score of 93.73%, 65.81%, 60.40%, 61.12% respectively. The training times in both W-KNN and KNN are irrelevant to compare with the other algorithms since there is no training process that these neighbor methods employ. Since predictions are based on the k amount of neighbors, predictions can be computationally ineffi- cient when dealing with large datasets, causing W-KNN to have the second slowest prediction time and KNN the slowest. The task at hand, detecting intrusions, with both W-KNN and KNN, is relatively inefficient when com- pared to the other algorithms. This inefficiency in both models, along with the relative insensitivity to underlying data distribution and sensitivity to irrelevant features in KNN, placed them at the lower end of the rankings among the evaluated models.

Results from Most Important 25 Features

Model	Accuracy	Precision	Recall	F1 Score	Training Time (s)	Pred Time (s)
RFC	99.31%	89.20%	74.32%	76.63%	4.49e+1	4.58e-6
DTC	99.29%	83.86%	84.91%	84.26%	1.20e+1	2.39e-7
MLP	98.43%	69.88%	66.01%	66.26%	1.37e+3	2.37e-6
DL	96.32%	64.26%	56.23%	60.00%	3.22e+2	5.68e-5
W-KNN	96.03%	64.13%	61.28%	61.62%	8.70e-2	1.78e-3
KNN	95.45%	62.99%	59.97%	60.42%	1.22e-1	1.74e-3

Table 8: Machine Learning Results Utilizing the Top 25 Features Identified by Random Forest Feature Importance

Table 8 shows the machine learning results from all tested classifiers for the most important 25 features that were identified by the Random Forest algorithm.

RFC ranked as the best algorithm with an accuracy, precision, recall and F1 score of 99.31%, 89.20%, 74.32% and 76.63% respectively which was slightly higher than RFC trained with the entire dataset. Surprisingly, RFC’s training time and prediction time trained with the most important 25 features were higher than RFC’s training and prediction time with the entire dataset. The specific interaction of these chosen features might have necessitated more intricate calculations, leading to this unexpected increase in both training and prediction times.

DTC also yielded a high accuracy, precision, recall and F1 score of 99.29%, 83.86%, 84.91%, 84.25% respectively which were marginally higher than com- pared to DTC trained with the whole dataset. Its training and prediction times were also lower, thus, making it more efficient.

MLP performed worse when compared to MLP trained with the entire dataset with an accuracy, precision, recall and F1 score of 98.43%, 69.88%, 66.01%, 66.26% respectively. The reduction in features to only the 25 most essential ones likely removed information that was beneficial for the MLP’s learning process. While these 25 features were considered most important, they might not have captured the full complexity of the patterns in the data MLP might have used to make predictions when it used the entire dataset.

Our Deep Learning (DL) model recorded a slight decline in its efficiency metrics when restricted to the most important 25 features, yielding an accu- racy of 96.32%, precision at 64.26%, recall of 56.23%, and F1 score hitting 60.00%. The narrowed feature set seemed to have stripped some of the com- plexities that might have contributed to a higher efficiency when utilizing the complete dataset.

Shifting focus to the W-KNN and KNN models, both witnessed a com- mon trend. W-KNN’s performance stats were an accuracy rate of 96.03%, precision of 64.13%, recall standing at 61.28%, and an F1 score of 61.62%, while KNN lagged just behind with 95.45% in accuracy, 62.99% precision, 59.97% recall, and an F1 score of 60.42%. This underlines that the 25-feature dataset, might have improved subtle details vital for these specific models, therefore marginally increasing their efficiency.

Results From Most Important Feature (IAT)

Model	Accuracy	Precision	Recall	F1 Score	Training Time (s)	Pred Time (s)
DTC	99.06%	99.10%	86.00%	90.46%	8.32e-1	1.76e-7
RFC	99.04%	99.09%	80.58%	85.76%	1.26e+1	3.87e-6
KNN	99.03%	89.59%	77.84%	81.74%	7.94e-1	6.08e-5
W-KNN	98.91%	90.32%	80.91%	84.10%	7.63e-1	1.95e-5
DL	91.27%	40.37%	41.73%	41.04%	3.18e+2	4.55e-5
MLP	89.06%	37.89%	38.80%	37.72%	1.55e+3	2.38e-6

Table 9: Machine Learning Results Utilizing only IAT Feature

Table 9 outlines the machine learning results from all tested classifiers using only the IAT feature. A stark transformation in the relative perfor- mance between the models can be observed when considering only this single feature.

DTC takes the lead, achieving an impressive accuracy of 99.06%, precision of 99.10%, recall of 86.00%, and an F1 score of 90.46%. The training time is dramatically reduced to 8.32e-1, and the prediction time stays incredibly low at 1.76e-7. Focusing only on IAT, DTC’s inherently decisive approach seems to leverage the crucial information encoded within this single feature, yielding a tremendous improvement in precision and recall.

Close behind is RFC, achieving 99.04% accuracy, 99.09% precision, 80.58% recall, and 85.76% F1 score. Its training time is 1.26e+1, with a prediction time of 3.87e-6. By exploiting only the IAT feature, RFC’s ensemble of Deci- sion Trees appears to have harnessed the essential characteristics of the data, almost mirroring DTC’s performance.

KNN and W-KNN show remarkable performance as well, with KNN reaching 99.03% accuracy, 89.59% precision, 77.84% recall, and 81.74% F1

score, and W-KNN at 98.91%, 90.32%, 80.91%, and 84.10% for the same metrics. Interestingly, both models, which were previously lower in the rank- ings, have climbed up when focusing solely on the IAT feature. It seems that IAT, being a significant factor in distinguishing patterns, resonates well with the neighbor-based decision-making process, enhancing the effectiveness of both KNN and W-KNN.

Our Deep Learning (DL) model experiences a drop in performance with the IAT-only approach, recording an accuracy of 91.27%, precision of 40.37%, recall of 41.73%, and F1 score of 41.04%. Training and prediction times re- main similar to previous evaluations. It seems that the intricate architecture of deep learning networks requires a richer set of features to understand the underlying complexities of the data, and the isolation to a single feature diminishes the model’s capacity to generalize well.

MLP experiences the most significant decline, standing at 89.06% accu- racy, 37.89% precision, 38.80% recall, and 37.72% F1 score. The extended training time of 1.55e+3 reflects the struggle of MLP to adapt to the IAT- only feature. Similar to DL, the underlying complexity of MLP appears to demand a more comprehensive feature set. Reducing it to a single character- istic seems to restrain MLP’s capability to discern non-linear relationships within the data, leading to its diminished performance.

In Figure 2, the confusion matrix for the results obtained from a Decision Tree Classifier, utilizing Inter-Arrival Time (IAT) features to distinguish be- tween various classes of cybersecurity threats is presented. The key to the classes is provided above the matrix, with each cybersecurity threat assigned a unique numerical identifier, ranging from 0 to 33.

For example, ’DDoS-RSTFINFlood’ is assigned ’0’, while ’DoS-TCP Flood’ is assigned ’1’, and so on, up to ’Uploading Attack’ which is assigned ’33’. The identifier ’12’ represents ’BenignTraffic’, which is crucial as a baseline for comparison to identify anomalous, harmful network behaviors.

We used 2 significant approaches for our dataset to detect intrusions. Concentrating on the interpacket arrival time (IAT) has been a crucial aspect of our success, enabling us to narrow our focus and achieve a striking F1 score of 90.46% which is a 19% improvement over the prior best study.

At the same time, we must not overlook our experiment with the top 25 most critical features, resulting in an F1 score of 84.26%. Although this trial was not as successful as the IAT-centric approach, it has shown that the ’Less is More’ strategy has potential across different dimensions.

Since we focused on the most important feature(s), we were able to reduce training and prediction times of our models. However, we should caution that the IAT-centered method, while promising, may not be universally applica- ble. The limitations in our study are IAT-centered or 1 featured methods

that could be addressed in future works.

Assessing the ’Less is More’ strategy across diverse datasets, fine-tuning our methods to strike a balance between efficacy and adaptability, and poten- tially combining the IAT-based method with a selection of the top features could yield even more robust results.

Though we have explored and found success with the IAT-centered method and the top 25 features, it is just the tip of the iceberg. There are a substan- tial amount of aspects that future research could focus on thus enabling the creation of much more efficient and innovative intrusion detection systems.

Compliance with Ethical Standards

This research strictly adhered to all ethical guidelines and protocols per- tinent to cybersecurity and data analysis studies.

Competing Interests

The authors declare that they have no competing interests. All authors confirm that there are no financial affiliations or any other personal conflicts of interest that could have influenced the work reported in this study.

Research Data Policy and Data Availability Statements

Data Accessibility

The datasets used in the current study are publicly available at https:

//www.kaggle.com/datasets/subhajournal/iotintrusion/data.

Data Restrictions

There are no restrictions on the datasets utilized in this study, ensuring transparency and reproducibility.

D. E. Denning, ”An Intrusion-Detection Model,” in IEEE Transactions on Software Engineering, vol. SE-13, no. 2, pp. 222–232, Feb. 1987, doi: 10.1109/TSE.1987.232894.
A. L. Buczak and E. Guven, ”A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection,” in IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, Secondquarter 2016, doi: 10.1109/COMST.2015.2494502.
Laskov, P., Du¨ssel, P., Sch¨afer, C., Rieck, K. (2005). Learning Intru- sion Detection: Supervised or Unsupervised?. In: Roli, F., Vitulano, S. (eds) Image Analysis and Processing – ICIAP 2005. ICIAP 2005. Lec- ture Notes in Computer Science, vol 3617. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553595 6
Liaw, Andy & Wiener, Matthew. (2001). Classification and Regression by RandomForest. Forest. 23.
Girish Chandrashekar, Ferat Sahin, A survey on feature se- lection methods, Computers & Electrical Engineering, Vol- ume 40, Issue 1, 2014, Pages 16–28, ISSN 0045-7906, https://doi.org/10.1016/j.compeleceng.2013.11.024.
Garcia, S., et al. (2014). An empirical comparison of botnet detection methods. Computers & Security, 45, 100–123.
Guyon, Isabelle & Elisseeff, Andr´e. (2003). An Introduction of Vari- able and Feature Selection. J. Machine Learning Research Spe- cial Issue on Variable and Feature Selection. 3. 1157–1182. 10.1162/153244303322753616.
Huan Liu and Lei Yu, ”Toward integrating feature selection algorithms for classification and clustering,” in IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 491–502, April 2005, doi: 10.1109/TKDE.2005.66.
M. Almseidin, M. Alzubi, S. Kovacs and M. Alkasassbeh, ”Evalua- tion of machine learning algorithms for intrusion detection system,” 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 2017, pp. 000277–000282, doi: 10.1109/SISY.2017.8080566.
A. Halimaa A. and K. Sundarakantham, ”Machine Learning Based In- trusion Detection System,” 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 916–920, doi: 10.1109/ICOEI.2019.8862784.
Phurivit Sangkatsanee, Naruemon Wattanapongsakorn, Chalermpol Charnsripinyo, Practical real-time intrusion detection using ma- chine learning approaches, Computer Communications, Vol- ume 34, Issue 18, 2011, Pages 2227–2235, ISSN 0140–3664, https://doi.org/10.1016/j.comcom.2011.07.001.
I. Abrar, Z. Ayub, F. Masoodi and A. M. Bamhdi, ”A Machine Learning Approach for Intrusion Detection System on NSL-KDD Dataset,” 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 2020, pp. 919–924, doi: 10.1109/ICOSEC49089.2020.9215232.
R. Patgiri, U. Varshney, T. Akutota and R. Kunde, ”An Investigation on Intrusion Detection System Using Machine Learning,” 2018 IEEE Sym- posium Series on Computational Intelligence (SSCI), Bangalore, India, 2018, pp. 1684–1691, doi: 10.1109/SSCI.2018.8628676.
R. Vinayakumar, M. Alazab, K. P. Soman, P. Poornachandran, A. Al- Nemrat and S. Venkatraman, ”Deep Learning Approach for Intelligent Intrusion Detection System,” in IEEE Access, vol. 7, pp. 41525–41550, 2019, doi: 10.1109/ACCESS.2019.2895334.
G. Karatas, O. Demir and O. Koray Sahingoz, ”Deep Learning in In- trusion Detection Systems,” 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT), Ankara, Turkey, 2018, pp. 113–116, doi: 10.1109/IBIGDELFT.2018.8625278.
K. Alrawashdeh and C. Purdy, ”Toward an Online Anomaly Intrusion Detection System Based on Deep Learning,” 2016 15th IEEE Inter- national Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 2016, pp. 195–200, doi: 10.1109/ICMLA.2016.0040.
Lirim Ashiku, Cihan Dagli, Network Intrusion Detection System using Deep Learning, Procedia Computer Science, Volume 185, 2021, Pages 239–247, ISSN 1877 – 0509, https://doi.org/10.1016/j.procs.2021.05.025.
Liu, H.; Lang, B. Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey. Appl. Sci. 2019, 9, 4396. https://doi.org/10.3390/app9204396
Chih-Fong Tsai, Yu-Feng Hsu, Chia-Ying Lin, Wei-Yang Lin, Intrusion detection by machine learning: A review, Expert Systems with Applica- tions, Volume 36, Issue 10, 2009, Pages 11994–12000, ISSN 0957–4174, https://doi.org/10.1016/j.eswa.2009.05.029.
Ahmad, Z, Shahid Khan, A, Wai Shiang, C, Abdullah, J, Ahmad, F. Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Trans Emerging Tel Tech. 2021; 32:e4150. https://doi.org/10.1002/ett.4150
Z. K. Maseer, R. Yusof, N. Bahaman, S. A. Mostafa and C. F. M. Foozy, ”Benchmarking of Machine Learning for Anomaly Based Intrusion De- tection Systems in the CICIDS2017 Dataset,” in IEEE Access, vol. 9, pp. 22351–22370, 2021, doi: 10.1109/ACCESS.2021.3056614.
Dang, QV. (2019). Studying Machine Learning Techniques for In- trusion Detection Systems. In: Dang, T., Ku¨ng, J., Takizawa, M., Bui, S. (eds) Future Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science(), vol 11814. Springer, Cham. https://doi.org/10.1007/978-3-030-35653-8 28
T. Saranya, S. Sridevi, C. Deisy, Tran Duc Chung, M.K.A.Ahamed Khan, Performance Analysis of Machine Learning Algorithms in Intrusion Detection System: A Review, Procedia Computer Sci- ence, Volume 171, 2020, Pages 1251–1260, ISSN 1877 – 0509, https://doi.org/10.1016/j.procs.2020.04.133.
U. S. Musa, M. Chhabra, A. Ali and M. Kaur, ”Intrusion Detection System using Machine Learning Techniques: A Re- view,” 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 2020, pp. 149–155, doi: 10.1109/ICOSEC49089.2020.9215333.
Amar Meryem, Bouabid EL Ouahidi, Hybrid intrusion detection sys- tem using machine learning, Network Security, Volume 2020, Issue 5, 2020, Pages 8–19, ISSN 1353–4858, https://doi.org/10.1016/S1353- 4858(20)30056-8.
Mighan, S.N., Kahani, M. A novel scalable intrusion detection sys- tem based on deep learning. Int. J. Inf. Secur. 20, 387–403 (2021). https://doi.org/10.1007/s10207-020-00508-5
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. https://doi.org/10.3390/s23135941
Pedregosa, Fabian, et al. ”Scikit-learn: Machine learning in Python.” the Journal of machine Learning research 12 (2011): 2825–2830.
T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
S. Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Trans- actions on Systems, Man, and Cybernetics, no. 4, pp. 325–327, 1976.
L. Breiman, Random Forests, Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
D. Rumelhart, G. Hinton, and R. Williams, Learning representations by back-propagating errors, Nature, vol. 323, no. 6088, pp. 533–536, 1986.
J. R. Quinlan, Induction of Decision Trees, Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436–444, 2015.
F. Chollet et al., Keras, https://keras.io, 2015.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Advancing Intrusion Detection Efficiency: A ’Less is More’ Approach via Feature Selection

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Methodology

• K-Nearest Neighbors (KNN) [29]:

4. Results and Discussion

5. 5. Conclusion and Future Work

Declarations

References

Additional Declarations

Status:

Version 1