The research methodology adopted in this study follows the design science approach outlined by Alomari et al. (2023), encompassing three pivotal stages. Firstly, the selection, preprocessing, and partitioning of datasets, namely the Malware and Android Dataset, were meticulously executed to facilitate subsequent analysis. Secondly, feature selection was conducted based on correlations with the target attribute, followed by training deep learning models, including Dense and LSTM, using various feature subsets and data splitting configurations. Lastly, the performance of the models was rigorously assessed using metrics such as accuracy, training time, and precision.
The study incorporates two distinct datasets with contrasting characteristics to provide comprehensive insights. The Malware Dataset, sourced from Kaggle, comprises 100,000 observations, evenly split between malware and benign samples. This dataset, created in a Unix/Linux-based virtual machine, features thirty-five attributes tailored for Android device classification. Conversely, the Android Dataset, originally curated by Tiwari (2018), contains feature vectors extracted from 15,036 Android applications. Among these, 5,560 apps are categorized as malware from the Drebin project, while the remaining 9,476 apps are benign. Figure 1 visually depicts the distribution of data points across malware and benign classes within the Android malware dataset, providing a clear representation of the dataset's composition.
In this study, a variety of deep learning techniques were introduced and utilized. To effectively train the deep learning models on the two datasets, a crucial preprocessing step was undertaken. This involved encoding (numbering) the classification (target) column and addressing any special characters or missing values within the data. Given the distinct characteristics of each dataset, the preprocessing steps applied to them inherently differed. Subsequent to the preprocessing phase, data splitting was executed on both datasets, dividing them into separate train and test sets. Multiple scenarios for data splitting were incorporated, allowing for comprehensive analysis. Before commencing training of the deep learning models, a feature selection process was implemented to streamline computational efficiency. The study employed various training scenarios, encompassing diverse splitting criteria, deep learning architectures, and the flexibility to include or exclude feature selection during the training phase. Figure 2 outlines the proposed methodology for both datasets, providing a visual representation of the study's approach.
During the preprocessing stage, several steps were undertaken to ensure data quality:
- Special characters and missing values were addressed by replacing them with "NaN."
- As a result, the "Hash" attribute, which exclusively contained special characters, was dropped from the dataset.
- The target class (classification) was designated with labels: zero for benign instances and one for malware instances.
Following the preprocessing phase, the dataset was partitioned into training and testing sets according to various scenarios:
- 80% of the data was allocated for training, with the remaining 20% reserved for testing.
- Another scenario involved a split of 75% for training and 25% for testing.
- Lastly, a split of 70% for training and 30% for testing was also considered.
Dealing with high-dimensional data presents a significant challenge in machine learning, increasing computational complexity and storage requirements. To mitigate this challenge, feature selection techniques prove invaluable by eliminating irrelevant and redundant data. This not only reduces computational overhead but also enhances learning accuracy and provides deeper insights into the model or data. Thus, a feature selection technique was employed in this study to identify the most pertinent features that bolster the model's predictive capabilities. Figure 3 visually illustrates the correlation-based feature selection approach adopted in this research (Dhal & Azad, 2021; Jie, Jiawei, Shulin, & Sheng, 2018).
Equation 1 is applied to calculate the correlations between all independent attributes and the target or dependent feature.
$$\begin{array}{c}{Corr}_{x,y}=\frac{\sum \left({x}_{i}-\stackrel{-}{x}\right)\left({y}_{i}-\stackrel{-}{y}\right)}{\sqrt{\sum {\left({x}_{i}-\stackrel{-}{x}\right)}^{2}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}} \#\left(1\right)\end{array}$$
Where, \({Corr}_{x,y}\) is the correlation between feature \({x}_{i}\) and target feature y. \(\stackrel{-}{x}\) and \(\stackrel{-}{y}\) are the mean value of x and y, respectively. After obtaining the K number of required features, a potential list of features to be dropped was prepared. Various selection scenarios were created considering that the correlation values range between 0 and 1. The same methodology was used for the second dataset, except for the selection step. Due to the larger number of columns (215), specific correlation thresholds were utilized to eliminate columns from consideration.
Dense Layer Model. The dense layer model was constructed with varying configurations based on the characteristics of the datasets used. For scenarios involving the first dataset, hidden layers were designed with fifty neurons, while for scenarios involving the second dataset with a larger attribute count of 215, one hundred neurons were employed. The input layer of the deep learning model was adaptable to different input sizes corresponding to the number of selected features derived from the feature selection process. To ensure optimal performance, five hidden layers were incorporated, as further enhancements were not observed beyond this configuration. Activation functions (AFs) play a crucial role in neural networks by facilitating the learning of abstract features through non-linear transformations (Dubey, Singh, & Chaudhuri, 2022). Hence, the "relu" non-linear activation function was selected for the first layer. Subsequently, the following five hidden layers, each comprising fifty neurons, also utilized the "relu" activation function. Finally, the output layer, serving as the last dense layer, employed the "softmax" activation function. The model was intentionally designed to prioritize simplicity and minimize the number of learnable parameters. Unlike previous studies that conducted experiments to determine the optimal number of neurons and hidden layers for specific problems, the focus here was on streamlining the model architecture and avoiding unnecessary complexity.
Long Short-Term Memory (LSTM) Model. When training deep neural networks, a significant challenge arises from the vanishing or exploding gradient problem, which impedes the learning of long-term dependencies. In response to this issue, the LSTM (Long Short-Term Memory) architecture was introduced. LSTM networks are specifically engineered to alleviate the vanishing or exploding gradient problem and excel at capturing and learning long-term dependencies in sequential data (Houdt, Mosquera, & Nápoles, 2020). In the proposed LSTM model, the initial dense layer was substituted with an LSTM layer, utilizing a "relu" activation function. However, the remaining dense layers and the output layer remained unchanged from the previous dense model. This substitution led to an increase in the number of learnable parameters and consequently extended the training time.
Training Scenarios. During the training phase, a series of experiments were conducted, exploring various scenarios that involved feature selection, the inclusion of an LSTM layer, different feature selection thresholds, and varied data splitting criteria. For the first dataset, the training scenarios included:
- Training a deep learning model with a dense layer, using both the original dataset and different subsets of selected features. This involved four feature groups in addition to the original dataset, resulting in five unique scenarios.
- Modifying the deep learning model by adding an LSTM layer and training it with the first set of features.
- Training a deep learning model with three different data splitting criteria, resulting in two additional scenarios.
For the Android Malware dataset, the training scenarios were as follows:
- Training a dense layer model using the original dataset and three distinct subsets of selected features, resulting in four scenarios.
- Incorporating an LSTM layer into the deep learning model and training it with the original dataset.
- Training a deep learning model with various data splitting criteria.
In total, twelve different training scenarios were executed to evaluate the effects of utilizing different datasets, employing various feature selection methods, implementing diverse data splitting criteria, and employing different deep learning architectures.
Evaluation Criteria. The evaluation step serves as the final phase, where the performance of the model is assessed using various metrics, including validation accuracy, training time, precision, recall, and F1-score. During the training process, validation accuracy is computed by setting aside a holdout set, or validation set, to assess the model's performance. On the other hand, test accuracy is calculated after the completion of training. It is employed to evaluate the trained model's capability to accurately classify new instances as either malware or benign samples. The precision, recall, and F1-score were calculated using four different metrics, where TP, FP, TN, and FN denote true positive, false positive, true negative and false negative, respectively. The four metrics are computed as follows:
- Precision: The precision is determined by dividing the number of true positives (TP) by the sum of true positives and false positives (TP + FP). The equation is expressed below as:
$$\begin{array}{c}Precision=\frac{TP}{\left(TP+FP\right)} \#\left(2\right)\end{array}$$
- Recall: The recall is calculated by dividing the number of true positives (TP) by the sum of true positives and false negatives (TP + FN). It is computed using the equation below:
$$\begin{array}{c}Recall=\frac{TP}{\left(TP+FN\right)} \#\left(3\right)\end{array}$$
- F1-score: The F1-score is a harmonic mean of precision and recall. It is computed using the equation below:
$$\begin{array}{c}F1score=2 \times \frac{Precision\times Recall}{\left(Precision+Recall\right)} \#\left(4\right)\end{array}$$
TP (true positives) refers to the cases where the model accurately predicts malware samples among all the samples. FN (false negatives) represents cases where the model incorrectly predicts benign samples instead of identifying them as malware. TN (true negatives) denotes the instances where the model correctly identifies and rejects benign samples as benign. FP (false positives) signifies the cases where the model incorrectly predicts malware samples instead of correctly identifying them as benign. The optimal performance is achieved when TP and TN have the highest values or when FP and FN have the lowest values. This indicates that the model has a high level of accuracy in correctly classifying both malware and benign samples. Precision, also known as confidence in the context of data mining, represents the proportion of predicted malware samples that are correctly classified as malware. It indicates the accuracy of the model in identifying malware instances among the predicted positives. On the other hand, recall, also referred to as sensitivity in psychology, is the proportion of actual malware samples that are correctly predicted as malware. It measures the model's ability to identify all the true positive malware instances among the actual positives (Powers, 2011). A high precision value indicates that the trained model has a high sensitivity to correctly rejecting benign samples, meaning that it is effective in avoiding false positives. It reflects the model's ability to accurately classify samples as either malware or benign, minimizing the misclassification of benign samples as malware. When both precision and recall have high rates, it leads to a high F1-score value. The F1-score combines the concepts of precision and recall, providing a balanced evaluation metric that considers both false positives and false negatives. A high F1-score indicates a model that achieves a good balance between precision and recall, effectively managing both types of misclassifications.