Deep Convolution Neural Networks and Image Processing for Malware Detection

doi:10.21203/rs.3.rs-2508967/v1

Download PDF

Research Article

Deep Convolution Neural Networks and Image Processing for Malware Detection

https://doi.org/10.21203/rs.3.rs-2508967/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Current anti-malware technologies have exposed its glaring vulnerabilities as a result of a signature-based approach as more sophisticated malware has been appearing in recent years, particularly in the android operating system. The state-of-the-art literature offers a wide range of possibilities, but none of them are flawless in terms of providing clear and timely solutions. The current study used a CNN-based deep learning architecture to address this problem. The proposed method collected RGB images from unprocessed malware binaries. We explored complex high-level aspects that effectively identify malware families using an image-based method rather than feature representations in order to detect and identify malware families. The RGB graphics were extracted from the raw APK files because colour images may hold more data in the source code. We developed deep CNNs using produced images that extract higher-level semantics associated with malware. This has allowed us to develop more complex software with enhanced security and made it more challenging for attackers to avoid detection. We trained a customised CNN and a refined Vgg-16 architecture using their covered images. The Android malware datasets CIC-AndMal2017 and CICMalDroid 2020 were used in this analysis. Results show that the shallow CNN model is defeated by the pretrained vgg-16. The accuracy of the suggested method, which is substantially better than earlier studies on the topic, has been assessed to detect malware samples with a 97.81% accuracy rate.

Deep Learning

Security

Deep Neural Networks

Transfer learning

In the current modern period, smartphone utilisation and its potential technologies are growing quickly due to the practicality and effectiveness of numerous applications as well as the ongoing improvement of the software and hardware on smart devices [1]. Due to its robust features and versatility, smartphones have evolved into a technological instrument that is essential to people's daily lives. Nowadays, smartphones give consumers access to every aspect of their lives, including work, entertainment, and financial management. Among the several mobile operating systems, the Android system has dominated both home and foreign markets [2]. With a 75.16% [3] share of the global smartphone operating system market as of December 2018, Android continued to hold the top spot. By 2023, 4.3 billion individuals are expected to own smartphones [1]. Social media, messaging, and mobile banking apps are some of the almost one million Android apps available in the top app stores. They continuously assume greater and greater significance in our daily lives. The official app store for Android mobile devices is Google Play. As of May 2021, there were more than 2.9 million apps available. AppBrain [4] classifies about 2.5 million apps as standard apps, but only 0.4 million as low-quality apps. It is simple to create and distribute Android applications. As a result of its widespread distribution and open-source nature, which makes it a desirable target for hackers, Android is susceptible to security vulnerabilities.

Grimes claims that malicious software, or malware, is a key source of security problems. [5]. It is possible for personal information to be amassed without the user's awareness and for computer operations to be interfered with, which is problematic for the user. These apps frequently have access to the sensitive information of the users, including their payment card information, location, and mobile number, according to Liu et al. [6]. Additional probable outcomes include financial loss and personal data leaks [7]. Malware comes in a very small number of variations, including viruses, worms, rootkits, trojan horses, spyware, backdoors, adware, etc. [8] [9] [10]. According to yearly surveys conducted by virus protection companies, more than a million new malware variants are created every day. Therefore, 1.809 million new malware samples were discovered on mobile devices in 2019, with 5,000 new malware samples being discovered on average per day, according to 360 Security's 2019 Android Malware Special Report (360 Internet Security Center 2020). As the Internet of Things (IoT) has proliferated and is now being used in many industries, from industrial control to automotive IoT to smart homes and healthcare, IoT security is also becoming more crucial. According to reports, IoT devices have become the primary access point for targeted attacks [11]. Cameras and routers are the IoT devices most frequently compromised, although almost all of them are vulnerable. The fast proliferation of Android malware has made detection more challenging. Many Android malware programmes use encryption and obfuscation, among other techniques, to prevent detection [12]. As hazardous Android apps continue to appear, the issue is receiving more attention in both the industry and academics. Finding reliable methods to spot malware on Android has consequently elevated to the top of the priority list.

The training flow is divided into four sections as shown in Fig. 1. First, the behaviour of the process is monitored, and log files are created. The second step involves using log data to extract features, which are then converted into feature images using trained models for malware detection and image processing. Finally, CNN is trained using training feature images with malicious or benign labels.

Additionally, antivirus software providers including Kaspersky, McAfee, TrendMicro, and Comodo have created software to shield Android devices from infection. According to a study [13], this anti-virus software has been shown to be ineffectual against malware that uses code obfuscation and encryption. Dynamically evaluating an application is also possible without actually running it. Although the speed of static and dynamic analysis is advantageous, it cannot support programmes that require dynamic loading or encryption. Dynamic analysis, which requires running programmes on a physical device to collect dynamic components like network activity or dynamically loaded files, has been advocated by researchers as a remedy for the situation. The hybrid analytic techniques CuckooDroid [15], AASandbox [14], and MobSF [16] use to improve detection performance mix static and dynamic analysis.

It has been difficult to adjust to the quick growth of Android malware using traditional malware detection techniques, which mostly rely on signature libraries and human engagement by malware researchers [17]. Due to the shortcomings in current defences, researchers have been working hard to develop new tactics and tools for combating mobile malware. Decision Trees, K-Nearest Neighbors, Naive Bayes, and Support Vector Machines are some of the most widely used techniques for malware detection [18] [19]. heoretically, malware behaviour traits may be learned using standard machine learning approaches. However, the majority of machine learning methods rely on how accurate the characteristics are returned, which can be troublesome. Extrapolating major behaviour features that can enhance malware detection performance can also be difficult. High-level expertise is also necessary for feature processing. As a result, current machine learning methods for malware detection are still inadequate.

The new method eliminates the requirement for human contact by using data directly to obtain high-level properties, providing an end-to-end botnet detection solution. DL is widely used in image processing, object detection, visual identification, and other pertinent fields [20] [21] to achieve this. To more effectively identify Android malware, DL techniques are applied to pictures of malicious programmes. The work also covers a number of topics that have been left out of earlier research on Android virus detection. Our tests have demonstrated that the suggested strategy increases the rate at which harmful software is detected. The proposed technique decreases the bulk of the training and testing data compared to past studies.

List of contributions

Converting malware binary APKs to picture and classification-based files significantly reduces the effort required to reverse engineer the program code in order to extract relevant features from the source code.
In contrast to grayscale images, colour images have been used, with 16777216 colours (with 24-bit pixels per sample) comparable to the 256 colours of a grayscale image enabled for the retention of more complex concepts in Android apps (with 8-bit pixels per sample).
Image-based detection aids in the extraction of features that were missed by manual methods, as convolutional layers may extract the more complex relationship between the pixels.
With fewer trainable parameters, transfer learning extracts complex properties that aid in efficiently distinguishing malware and benign files. It saves considerable time without requiring extensive computation.

The remaining sections of the essay are organised as follows. Section 2 provides a description of earlier investigations in the same field. The method is explained in Section 3, which also covers the dataset, pre-processing, model development, and performance assessment. Plots of the experimental data are shown, followed by a section 4 confusion matrix and section 5 conclusions.

Several machine learning algorithms were used to solve the malware detection problem. Some of the attempts involved retrieving APIs, permissions, opcodes, and other data from the APKs and using classifiers to determine whether the data was benign or dangerous. Features can be derived from static, dynamic, and hybrid investigations and further evaluated.

[22] used static analysis to find malware on Android handsets (see References). Through API method calls and the permissions given to an Android app, the feature data is displayed. API calls were represented in a vector space using Word2vec technology, which creates feature vectors for related API requests that are semantically equivalent. A permission is represented as a binary feature that may be used to check whether it's present in the input sequence. A mixed-data CNN model was used to diagnose Android malware with an average accuracy of 93%.

As an alternative to past methods, [23] introduces DroidCat, a dynamic app classification system. In order to give greater resilience than static or dynamic solutions that rely on the operating system, rather than dependent on permissions, system calls, and app resources, DroidCat uses a range of dynamic features that depend on method calls and inter-component communication (ICC) Intents. These features additionally display a performance evaluation of both poor and strong apps. A semi-supervised learning technique called pseudo-label stacking auto-encoders was developed by [24]. (PLSAE). They combined static analysis with dynamic analysis to produce their feature vectors. It is assessed using CICMalDroid2020, a database of 17,341 samples from five different Android app categories.

Extraction of features from an APK file requires a lot of effort. The method that works best is model development on raw datasets. [25] suggested a conversion to RGB images with a specified size in their android classes.dex file. With a 90% success rate, a convolutional neural network has been used to categorise good and bad Android applications. The same method was applied by [26] to create the training images. From texts and images, elements like colour and texture are retrieved. A multiple kernel learning method is then applied for classification as a last step. The Android Malware Dataset served as the investigation's sample set (AMD). The traditional way of visualising data and the feature fusion approach were put to the test against the methodology presented in this study.

In order to train their model, [27] converted mobile binaries into grayscale images. This led to the development of a malware detection system. They created grayscale images from executable samples, and then they collected various attributes from each image to make several classifiers. On 50,000 real-world Android samples (24,553 malicious samples dispersed across 71 families and 25,447 genuine samples) and 230 real-world Apple samples (115 samples dispersed across 10 families), they tried the suggested approach and discovered promising results. They were able to attain an accuracy rate of 91.8% using the deep learning model.

A continuation of the works by [27] was the current project's goal. Aiming to produce a more accurate model with fewer training images, the recent research is focused on improving model accuracy.

The malware has been found and categorised using a variety of methods. The development of ML has created a wide range of new possibilities for the classification and analysis of malware. Despite the numerous DNN, CNN is acknowledged as the finest one for images classification. Using CNN, numerous techniques have been used to characterize malware. In their research, [25] [26] [28] created the Maling dataset employing their binary representation. They employed K-nearest neighbours for classification and focussed on image texture and feature vectors for classification, accomplishing 97% accuracy.

The majority of assessments are relatively simple to go around for malware programmers who are aware of how anti-malware signature-based solutions work [28] [30]. The strict signatures used in [30] as the foundation for these techniques appears to be one of the main causes of this. This is primarily because the malware detection process used by these techniques utilizes stringent pattern recognition rules rather than analytical metrics. Using two separate datasets from Makandar [33], Kadir et al. [29] identified malware images by converting malware files into grayscale images. They achieved accuracy rates of 97.92% and 98.99%, correspondingly. For the classification of asymmetrical malware images, Saif [30] presented a weighted softmax loss for CNNs and got good results for classification.

Eventually, deploying it for classification on the malware classification dataset, [36] [37] have adopted a "hierarchical CNN applied at functional and mnemonic levels to create to generate n-gram feature representation of malware" utilising CNN to categorize malware with greater precision than any prior projects of its kind.

Su and Prasad developed an automated framework for Android malware detection and family identification in [37] that is centered on DL-based sequence classification. Su involves extracting and learns the malicious and benign patterns from the actual samples to identify Android malware, starting from the raw sequence of the app's API method calls. Su can function as a general-purpose malware detection solution that is deployed not just on servers but also on mobile and even Internet of Things (IoT) devices. CNN are used to create tasks for malware detection and identification.

Below table contains the list of prior papers cited as well as the methods and conclusions used in Deep CNN and Image Processing for Malware Detection;

Table 1: List of Past Paper References with Methodology Used and Results

Ref	Methodology	Results
[6]	•Design analytics libraries.	•Developed an app named “ALManager”
[9]	•Particle Swarm Optimization •Android Malware	•1000 malicious programmes and 500 clean programmes from the official Google Play store made up the applications, representing 66.3% and 34.34% of the total.
[10]	•Inference algorithms for Deep Neural Network models.	•In the TensorFlow computation graph, over 36,000 nodes have been constructed, and some deep recurrent LSTM models have more than 15,000 nodes.
[12]	•Android malware analysis and detection techniques	•Extraction of around 22,000 features from android apps. •Malware Detection rate jumped from 15.9% to 43.5%.
[14]	•Android Application Sandbox (AASandbox)	•Around 150 programs/Applications were collected for showing the positive results in the clustering.
[18]	•Android malware are constructed by inserting malicious components. •Polymorphic variants of Android malware	•Results demonstrated that FalDroid have categorised malware samples into their respective families with 95.3% accuracy in just 4 seconds per programme.
[24]	•Pseudo-label stacked auto-encoder (PLSAE)	•About 17,341 samples from five distinct categories of Android apps were analysed in this trial.
[29]	•Machine learning technique and algorithm. •CICMalDroid2020 and Label Propagation.	•Adware, Banking, SMS, Riskware, and Benign. •F 1 - Score of 98.75 %. •False positive rate of 2,56 %.

a) Dataset

The two datasets that are freely accessible were used to create the dataset for this investigation. Android Malware Dataset (CIC-AndMal2017) [28] and CICMalDroid 2020 [29] [24] are both useful resources. The classes of files on which this investigation would concentrate were divided into two groups: malignant and benign. To guarantee that each class received an equal amount of data, the collected data was balanced. The dataset includes 957 malicious APKs from four different families and 647 benign binary files from various playstore applications, as shown in Table 1. The descriptions of each malware family are provided below.

Banking: As mobile banking has become more widely available and popular, malicious software has been developed to intercept user’s transactions and steal all information linked to banking on their devices.
Adware: Even without internet access, adware is unwelcome software that displays ads on the user's screen. It is the most frequent Android malware. Some experts believe it was today's PUP (potentially unwanted program) forerunner. Malware often poses as a legitimate program or piggybacks on another program to fool users into installing it on their computer, tablet, or smartphone.
Ransomware: malware that uses encryption to extort the data of its victims. The initial step is for the malware to acquire access to the system. Ransomware encrypts either the entire operating system or particular files, depending on the type of ransomware. The victim is then threatened with extortion and told to pay a ransom.
Scareware: In order to trick people into downloading or purchasing dangerous, but often worthless, software, scareware is employed. Scareware exploits a user's fear to trick them into installing bogus antivirus software, which is usually launched via a pop-up ad. Users are duped by malware with which the data of the victim can be corrupted, money can be made, or other viruses can be downloaded if this software is used.

Table 2: Dataset description.

Dataset	Family	Count
CIC-AndMal2017	Adware	99
	Ransomware	101
	Scareware	112
	Benign	647
CICMalDroid 2020	Banking	645

b) Pre-processing

i. Android Package Kit (APK)

APK files are created by compiling Android Studio-based Java or Kotlin programmes into an archive format. The Android operating system distributes and instals mobile applications using the APK file format (APK). As seen in Fig. 2, a typical Android application package (APK) [30] is made up of numerous files and directories, and the amount of resources it contains [31] determines the APK's size.

ii. APK to image conversion

It has been shown that classifying malware images is an efficient method for identifying typical PC malware [32][33]. The current work follows the process explained by [34] [35] for creating these malware images and is depicted in Fig. 3. An 8-bit string representation of a one-channel pixel's decimal encoding can be used to reformat a malware binary (in the range [0, 255]). As a result, the entire sequence is made up of binary data. The colour images were created by resampling the original 8-bit vectors. Each of the vectors is used as a pixel to represent the intensity while converting from binary to images. The red, green, and blue colour channels are each represented by a group of three bytes in RGB images. In RGB images, the spacing between pixels is reduced, allowing for more intricate patterns to be discovered. To feed images into a Convolutional Neural Network, we resize them to 512x512x3 pixels. Each line of bytes is processed sequentially in order to generate the output pixels, which are then arranged in a horizontal row with an arbitrary number of 512 pixels. When there aren't enough bytes left to fill a line, it's padded with black pixels to prevent any newline characters from being displayed [36].

Some of the sample malware and benign android application images were seen in Fig 4. The figure clearly depicts the differentiation between Malware and benign images. The structural difference is evident between the classes, for example, malware samples are always more densely packed. There is a large central code payload in most malware images [37].

c) Model development

i. Customized CNN

We create a CNN-based classifier for image-based classification. The architecture of the classifier is shown in Fig 5. The two main steps in this procedure are feature extraction and classification. Feature extraction is handled through convolutional layers. There is a max-pooling step that comes after each convolutional layer. The ReLU activation function [38], expressed as max (0, x), is utilised following each convolution due of its low computational cost. The three convolutional layers of the network allowed for the retrieval of all image features. The obtained features are separated into the two desired categories using Fully Connected layers.

d) Transfer Learning with VGG16

Deep CNN, or DCNN, performs well on large datasets of annotated image. However, gathering and analysing large datasets in many industries can be expensive and difficult. When employing the "off-the-shelf" characteristics of reputed DCNNs like VGG-16 [39] that have been trained on a sizable categorised raw image dataset like ImageNet [40], transfer learning has proven to be beneficial for finding solutions with images classification. The VGG-16 model is our choice for two factors. First, compared to the VGG-19 model, it extracts low-level features using a smaller kernel, which is perfect for malware images with fewer layers. It performs better at feature extraction for categorization, according to a number of earlier publications [42]. As seen in Fig. 6, the conv1 layer receives an image with a 224 x 224 resolution. A 3x3 receptive field is then used to transmit the input image across Conv layers. The Conv layers are followed by the Fully Connected (FC) layers. Each neuron in the FC layer receives input from the layer that comes before it. Based on the classes in the present problem, the FC layers are modified, and the output soft-max layer generates the probability of each class. The VGG-16 architecture's strong generalizability is its key advantage [39].

e) Training Pipeline

The dataset is divided into three parts: validation, testing, and training. Validation accounts for 20% of the entire dataset. Validation accounts for 20% of the entire dataset. The designed models were all produced using the Tensorflow framework. On a 32GB NVIDIA Quadro P1000 GPU, training was conducted using the Adam optimiser, a categorical cross entropy loss function, and a learning rate of 0.001. A batch of two images was used for the training, which included 100 epochs. To verify the model's effectiveness, the trained model is applied to the test images. All other processing and analysis was carried out using open-source software such as NumPy, OpenCV, Scikit-learn, and others.

f) Model Evaluation

Accuracy: Ratio of the number of correct predictions to the total number of predictions, and it represents how often the classifier makes the correct predictions.

Here, Eq (1) relates to an equation for accuracy, which expresses the proportion of correctly classified data instances to all other data instances.

If the dataset is unbalanced, accuracy might not be an acceptable metric (both negative and positive classes have different number of data instances).

Precision: Proportion of anticipated positives that are actually positive.

The precision model is shown in Eq (2). A good classifier's precision should preferably be 1 (high). Only when the numerator and denominator are equal, or when TP = TP + FP, does precision become 1, which also implies that FP is zero. The accuracy value drops as FP rises because the denominator value exceeds the numerator.

Recall: The fraction of true positives successfully identified.

The recall equation is shown in Eq. (3), where Recall for a good classifier should ideally be 1 (high). Only when the numerator and denominator are identical, as in TP = TP + FN, does recall become 1, which also implies that FN is zero. As FN increases, the denominator value rises above the numerator and the recall value falls.

F1 score: The harmonic mean of recall and precision.

The F1 Score equation is shown in Eq. (4). When precision and recall are both 1, the F1 Score is 1. Only when precision and recall are both strong can the F1 score rise. A more useful metric than accuracy is the F1 score, which is the harmonic mean of recall and precision.

The presence of harmful files on Android devices must be detectable by antivirus software. Despite that, the situation is ideal. The classic method of comparing suspect executable files to already-banned signatures in a database is often used by malware detection software, although malware authors are skilled enough to get around it. The method, nevertheless, requires a lot of time and resources. Machine learning models were used to solve these problem, and different researchers may employ different features for malware identification and categorization. A researcher is more likely to use characteristics that can precisely identify hazardous behaviour and yield superior results if he or she is a "expert." Otherwise, the result can be unsatisfactory.

Deep learning was intended as the answer to these issues. In multiple earlier investigations, the APK was reverse engineered to extract properties [43] [44] for model construction. The technique produced more accurate results, but it is difficult and time-consuming to use. In this situation, our research was crucial since we wanted to use a simpler model with less training data. Prior to training deep learning models on these images, we preprocess the input executable files (APKs) of benign and malicious software into images.

Making malware binaries into images and using machine learning on such images will enable Android malware to be detected successfully. The target apps' whole executable files are converted into images and used for machine learning in existing research (example: DEX files in Android application packages). However, the entire DEX file, which consists of a header part, an identifier section, a data section, an optional link data region, etc., may have noisy information that makes it difficult to detect malware. In this study, we solely convert data portions of DEX files into grayscale images and use CNN to apply machine learning to the images.

a) Performance of the customized CNN model

In the beginning, we built a fundamental Convolutional Neural Network from scratch using the same approach as [45], trained it using a training image dataset, and tested it. The training progress curve with accuracy and loss values is shown in Fig. 7. The charts demonstrated that the model is optimal because there is no overfitting. Table 2 displays the generated CNN's performance. It is also evident from Fig. 8's confusion matrix that all of the test set's benign samples were accurately predicted (Recall: 100%). However, only 155 out of 176 images for malignant samples are successfully predicted (Recall: 88.1%). The model misread some of the malignant images and incorrectly identified them as belonging to the benign class (11.9%).

b) Performance of the transfer learning

Reductions were made to the VGG19 weights, the trainable weights in the top layers of CNN, and the fully linked layers. A model's validation loss is automatically observed by the early stopping callback, which determines when to halt training based on its parameterization. According to the setting of 5, it will take five epochs for the data loss to diminish by the minimum delta. Changing the learning rate as the number of epochs increases and the validation loss disappears is the responsibility of the learning rate scheduler. To extract features from imagenet weights and transfer the result to a new classifier, we used a VGG-16 pretrained model. We must include weights = "imagenet" to retrieve the VGG-16 model which was produced using imagenet dataset. To circumvent downloading the pretrained model's fully linked layers, include top = False is mandatory. The total number of parameters in both the top and bottom parts of the network is 14,976,834, which includes the VGG16 weights, trainable weights in the top layers of CNN, and fully connected layers. The number of trainable parameters has been reduced to 262,146 because the VGG16 layers were frozen. There were 14,714,688 non-trainable parameters. The loss and accuracy plots (Fig 9) indicate that the model is optimized. The performance of the model is shown in Table 3. By comparing the confusion matrix of the previous CNN model, the recall of both the classes were improved. The benign class has 95.8% of true predictions while the malignant images acquired 99.4%.

Table 3: Performance of the developed models.

Model

Accuracy (%)

F1 score (%)

Precision (%)

Recall (%)

Customised CNN

93.43

93.42

93.63

94.03

Pretrained VGG16

97.81

97.78

97.98

97.63

Results indicate that the transfer learned VGG-16 model performs the best compared to the customized CNN model for the malware detection problem. It could be necessary to use a more complex model, as the one trained on imageNet weights for learning the informative features [46], as the images get bigger. We developed the CNN as a means to understand the performance of a less complicated model on the malware detection process. But as the images were created directly from the APKs, the image size was too large and the feature extraction part could not be handled by the convolutional layers designed by us as efficiently as the VGG-16 model. If the images for CNN training was developed only from the .dex content inside the APK file [47], the image size will be more reduced, and contain only the required information and in such a smaller model could be able to classify the images properly. But extracting relevant information from .dex file requires more time and computation as there is a need to reverse engineer the APKs.

c) Comparison of performance

As shown in Table 4, a number of works used images to train CNN to recognise malware by converting APKs or other APK components to images. [25] have created a variety of CNNs, including VGG, GoogleNet, AlexNet, and Inception-v3, using the bytecode of classes. classes.dex from an Android zip file converted to RGB. The study suggests a detection method that can identify both known and unidentified Android malware. They also intended to integrate the method with the backend of their main product to provide comfortable usage situations for consumers or enterprises, thus accuracy wasn't as important to them as cutting down on computational costs.

Grayscale images taken directly from an executable sample of a mobile app were used by [27]. The feature set is put into a three-layer DNN in order to determine whether the sample under examination is malware and, if so, to which family and iteration it corresponds. The Neural Network was far more basic than our model, which would account for its lower accuracy.

[26] converted the DEX file into RGB images and plain text based on the section characteristics of the DEX file, and then extracted image and text properties to classify the Android malware. A variety of properties were examined in the images, and these results were used to train the classifier. With manual feature extraction and conventional machine learning, it achieved 96%, but it takes a lot of effort.

The model's performance is comparable to ours. Similar to our work, [48] converted raw malware binaries into colour images that the optimised CNN architecture utilised to locate and classify malware families.

The findings show that the model can match the performance of a variety of cutting-edge models that determine if a file is benign or malignant using extracted data, such as API calls [49], which were recovered and turned into a sequence [50]. To reduce False positives and False negatives in the prediction, the majority's knowledge of which characteristics contribute to malware's malignant nature or which sequences are present in all malwares has a fundamental error. Combining attention networks with standard CNN can do this [51].

Table 4: Comparison of performance to previous works.

Works	Accuracy
Hsien-De Huang and Kao (2018) [25]	90%
Mercaldo and Santone, (2020) [27]	91.8%
Fang et al, 2020 [26]	96%
Vasan et al., 2020 [48]	97.35%
Our model	97.81%

The presence of harmful files on Android devices must be detectable by antivirus software. To address this issue, ML models were used. We create images by preprocessing both malicious and benign input files (APKs). Although the procedure was more accurate, it was difficult and time-consuming to put into practise. To better understand how a simpler model performed during the malware detection process, we created the CNN. But because the images were made directly from APKs, they were too big, and the feature extraction part could not be handled by the convolutional layers we developed as effectively as the VGG-16 model. Compared to the malignant images, which acquired 99.4%, the benign class has 95.8% accurate predictions.

It was crucial to build and implement the neural network architecture as well as figure out the values of the hyperparameters that directly affect computational complexity and detection accuracy when creating the Android malware detection method. We were primarily guided by the idea of striking a balance between detection accuracy and computing complexity while selecting hyperparameters. Naturally, increasing the number of succeeding convolution layers and the size of the input feature vectors will increase detection accuracy. But in this situation, the computing complexity of the detection approach would drastically rise, making it impossible to use it for real-time virus detection.

The Android malware detection framework is shown in the results section. It makes use of static attributes to reflect the characteristics of applications in various ways. Seven different types of static features are extracted by examining the manifest and dex files. These features enhance the extracted data and help to express the behaviours of apps. Additionally, we proposed three levels of feature selection methods to obtain the highly distinct features that can be useful in differentiating between benign and malware in order to screen out features that was used to effectively distinguish benign applications from malicious applications and reduce the feature dimension to reduce the computational overhead. In this investigation, the Android malware datasets CIC-AndMal2017 and CICMalDroid 2020 were used. We created deepCNNs that use generated images to extract higher-level malware meanings. The suggested method is 97.81% accurate at identifying malware samples. Then, because colour images could contain more data in the source code, the RGB visuals were retrieved from the raw APK files.

Additionally, it has been proven that classifying malware images is a viable method for finding common PC viruses. The methodology outlined by for creating these malware images is followed in the current work. where an 8-bit string representation of a one-channel pixel's decimal encoding could be used to construct a malware binary.

Therefore, the entire sequence is made up of binary data. The coloured visuals were created by resampling the original 8-bit vectors. Each vector served as a pixel to represent the intensity when transformed from binary to a images. The red, green, and blue colour channels in RGB images were each represented by three bytes.

In RGB images, more intricate patterns could be observed because there were less pixels between each other. Before feeding the images into a convolutional neural network, we reduced the size of the images to 512x512x3. Each line of bytes is gradually processed to produce the output pixels, which were then arranged in a horizontal row with an arbitrary number of 512 pixels. When there aren't enough bytes left to fill a line, it's padded with black pixels, which hides any newline letters.

Additionally, antivirus software must be able to recognise when harmful files are present on Android devices. Malware writers are cunning enough to trick malware detection tools, which frequently employ a conventional method of comparing suspect executable files to prohibited signatures already present in their database. The solution to this problem involved the use of machine learning models.

The created CNN's performance is displayed in Table 2 and based on the confusion matrix in Fig. 7a. Only 155 out of 176 imagess for malignant samples are correctly predicted. The model misread some of the malignant images and incorrectly identified them as belonging to the benign class (11.9%).

Due to the increased size of the images, the CNN may require a more complex model, such as one that was trained using imageNet weights to learn the informative features. We created the CNN in order to comprehend how well a simpler model performed during the malware detection procedure. However, because the images were produced straight from APKs, the image size was excessive, and the feature extraction process could not be handled by the convolutional layers we developed as well as it could by the VGG-16 model. The number of features may be lowered if the images used in CNN training were created only from APK files.

Finally, we contrast a number of CNNs designed to detect Android malware, including those created by VGG, AlexNet, GoogleNet, and Inception-v3. Performance of the model is on par with that of cutting-edge models built using extracted features, such API calls. Based on the DEX file's section characteristics, RGB images and plain text were generated from the DEX file, and attributes were then retrieved from the images and text to classify the Android malware.

We investigated complicated high-level elements that lead to efficient malware identification utilising an image-based approach rather than feature representations. Because colour images might store more information in the source code, the RGB images were recovered from the raw APK files. Using generated images, we created deep CNNs that extract higher-level semantics connected to malware, allowing us to create more sophisticated software with increased security and making it more difficult for attackers to evade detection. When hazardous files are available on Android devices, this antivirus programme must be able to detect them. Malware authors are crafty enough to manipulate malware detection software. Machine learning models were used to overcome this issue, and different researchers may use various features for malware identification and classification. If not, the outcome can be inadequate. The solution to these problems was supposed to be deep learning. We initially preprocess the input files (APKs) of harmful and benign software into images in order to train deep learning models. The APK was reverse engineered in a number of earlier efforts to get attributes for model building. The method gave more precise findings, but it required a lot of time and effort to utilise.

According to the results, the transfer-learned VGG-16 model outperforms the customised CNN model for the malware detection job. As the images are bigger, it could be essential to employ a more complicated model, such the one trained on imageNet weights for learning the helpful characteristics, which could be the reason for this. We developed the CNN to see how well a more straightforward model worked during the virus detection procedure. Since the images were created directly from the APKs, leading to an excessively large image size, the convolutional layers we created were unable to handle the feature extraction portion as well as the VGG-16 model. If the images for CNN training were created only from the.dex data inside the APK file, the image size will be more reduced, it will just contain the necessary information, and a smaller model may be able to correctly categorise the images. However, since the APKs must be reverse engineered, getting important information from.dex files require more effort and processing. Therefore, a thorough experimental analysis was presented to compare transfer learning's benefits to improving malware detection accuracy to shallow CNN. The accuracy of the Vgg-16 model was 97.81%. Transfer learning surpasses other alternative Android malware detection algorithms, according to promising experimental data.

Last but not least, the results show that the model's performance is on par with a number of state-of-the-art models based on attributes that were extracted and converted into a sequence, like API calls, which were extracted and utilised to determine if a file is benign or malignant. The majority's understanding of what traits contribute to malware's malignant nature or what sequences are found in all malwares has a basic inaccuracy that needs to be corrected in order to reduce False positives and False negatives in the prediction. This is possible when attention networks and regular CNN are combined.

Data Availability: The dataset is available at: https://www.unb.ca/cic/datasets/andmal2017.html

Conflicts of Interest “The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.”

Funding Statement: This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Funding Program, Grant No. (FRP-1443-15)

Acknowledgement: This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Funding Program, Grant No. (FRP-1443-15)

“Number of Mobile Phone UsersWorldwide from 2016 to 2023 (In Billions). Available online: https://www.statista.com/statistics/ 330695/number-of-smartphone-users-worldwide/ (accessed on 19 May 2021).”
“Mobile Operating System Market ShareWorldwide. Available online: https://gs.statcounter.com/os-market-share/mobile/ worldwide/ (accessed on 19 May 2021).”
“World’s Most Popular Mobile Operating Systems (Android VS IOS: Market Share 2012–2018), June 2019, https:// ceoworld.b-iz/2019/01/18/worlds-most-popular-mobileoperating- syste-ms-android-vs-ios-market-share-2012-2018.”
“Number of Android Applications on the Google Play Store. Available online: https://www.appbrain.com/stats/number-ofandroid- apps/ (accessed on 19 May 2021). R. A. Grimes, ‘Malicious mobile code,’ Oreilly & Assoc, 2001.”
R. A. Grimes, Malicious mobile code: Virus protection for Windows. “ O’Reilly Media, Inc.,” 2001.
X. Liu, J. Liu, S. Zhu, W. Wang, and X. Zhang, “Privacy risk analysis and mitigation of analytics libraries in the android ecosystem,” IEEE Trans. Mob. Comput., vol. 19, no. 5, pp. 1184–1199, 2019.
“Research and Analysis Report on Online Privacy and Online Fraud in 2018, June 2019, https://m.qq.com/security_lab/ news_deta-il_473.html.”
C. Alme, “Systems, apparatus, and methods for detecting malware.” Google Patents, 2012.
O. S. Adebayo and N. Abdul Aziz, “Static Code Analysis of permission-based features for android malware classification using apriori algorithm with Particle Swarm Optimization,” 2015.
M. Abadi et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv Prepr. arXiv1603.04467, 2016.
“Symantec, The 2019 internet security threat report (ISTR), 2019. https://www. symantec.com/content/dam/symantec/docs/reports/.”
K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The evolution of android malware and android analysis techniques,” ACM Comput. Surv., vol. 49, no. 4, pp. 1–41, 2017.
V. Rastogi, Y. Chen, and X. Jiang, “Catch me if you can: Evaluating android anti-malware against transformation attacks,” IEEE Trans. Inf. Forensics Secur., vol. 9, no. 1, pp. 99–108, 2013.
T. Bläsing, L. Batyuk, A.-D. Schmidt, S. A. Camtepe, and S. Albayrak, “An android application sandbox system for suspicious software detection,” in 2010 5th International Conference on Malicious and Unwanted Software, 2010, pp. 55–62.
“CuckooDroid. Idan Revivo, Ofer Caspi. https://github.com/idanr1986/cuckoo-droid. Accessed October 2017.”
“Mobile-Security-Framework. Ajin Abraham, Dominik Schlecht, Matan Dobrushin. https://github.com/MobSF/Mobile-Security-Framework- MobSF. Accessed October 2017.”
M. Spreitzenbarth, F. Freiling, F. Echtler, T. Schreck, and J. Hoffmann, “Mobile-sandbox: having a deeper look into android applications,” in Proceedings of the 28th Annual ACM Symposium on Applied Computing, 2013, pp. 1808–1815.
M. Fan et al., “Android malware familial classification and representative sample selection via frequent subgraph analysis,” IEEE Trans. Inf. Forensics Secur., vol. 13, no. 8, pp. 1890–1905, 2018.
Z. Lin, F. Xiao, Y. Sun, Y. Ma, C.-C. Xing, and J. Huang, “A secure encryption-based malware detection system,” KSII Trans. Internet Inf. Syst., vol. 12, no. 4, pp. 1799–1818, 2018.
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015, doi: 10.1007/978-3-319-24574-4_28.
W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “DL4MD: A deep learning framework for intelligent malware detection,” in Proceedings of the International Conference on Data Science (ICDATA), 2016, p. 61.
A. Nicheporuk, O. Savenko, A. Nicheporuk, and Y. Nicheporuk, “An Android Malware Detection Method Based on CNN Mixed-Data Model.,” in ICTERI Workshops, 2020, pp. 198–213.
H. Cai, N. Meng, B. Ryder, and D. Yao, “Droidcat: Effective android malware detection and categorization via app-level profiling,” IEEE Trans. Inf. Forensics Secur., vol. 14, no. 6, pp. 1455–1470, 2018.
S. Mahdavifar, D. Alhadidi, A. Ghorbani, and others, “Effective and Efficient Hybrid Android Malware Classification Using Pseudo-Label Stacked Auto-Encoder,” J. Netw. Syst. Manag., vol. 30, no. 1, pp. 1–34, 2022.
T. Hsien-De Huang and H.-Y. Kao, “R2-d2: Color-inspired convolutional neural network (cnn)-based android malware detections,” in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 2633–2642.
Y. Fang, Y. Gao, F. A. N. Jing, and L. E. I. Zhang, “Android malware familial classification based on dex file section features,” IEEE Access, vol. 8, pp. 10614–10627, 2020.
F. Mercaldo and A. Santone, “Deep learning for image-based mobile malware detection,” J. Comput. Virol. Hacking Tech., vol. 16, no. 2, pp. 157–171, 2020.
A. H. Lashkari, A. F. A. Kadir, L. Taheri, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark android malware datasets and classification,” in 2018 International Carnahan Conference on Security Technology (ICCST), 2018, pp. 1–7.
S. Mahdavifar, A. F. A. Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani, “Dynamic android malware category classification using semi-supervised deep learning,” in 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2020, pp. 515–522.
M. Al-Fawa'reh, A. Saif, M. T. Jafar and A. Elhassan, "Malware Detection by Eating a Whole APK," 2020 15th International Conference for Internet Technology and Secured Transactions (ICITST), 2020, pp. 1-7, doi: 10.23919/ICITST51030.2020.9351333.
Q. Xie et al., “Trimming Mobile Applications for Bandwidth-Challenged Networks in Developing Regions,” IEEE Trans. Mob. Comput., 2021.
Wagner, M., Fischer, F., Luh, R., Haberson, A., Rind, A., Keim, D.A., Aigner, W., Borgo, R., Ganovelli, F. and Viola, I., 2015. A survey of visualization systems for malware analysis. In EG Conference on Visualization (EuroVis)-STARs (pp. 105-125).
A. Makandar and A. Patrot, “Trojan malware image pattern classification,” in Proceedings of International Conference on Cognition and Recognition, 2018, pp. 253–262.
G. Conti, E. Dean, M. Sinda, and B. Sangster, “Visual reverse engineering of binary and data files,” in International Workshop on Visualization for Computer Security, 2008, pp. 1–17.
L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: visualization and automatic classification,” in Proceedings of the 8th international symposium on visualization for cyber security, 2011, pp. 1–7.
K. He and D.-S. Kim, “Malware detection with malware images using deep learning techniques,” in 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 2019, pp. 95–102.
J. Su, D. V. Vasconcellos, S. Prasad, D. Sgandurra, Y. Feng, and K. Sakurai, “Lightweight classification of IoT malware based on image recognition,” in 2018 IEEE 42Nd annual computer software and applications conference (COMPSAC), 2018, vol. 2, pp. 664–669.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on International Conference on Machine Learning, 2010.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255.
H.-C. Shin et al., “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
H. Panwar, P. K. Gupta, M. K. Siddiqui, R. Morales-Menendez, and V. Singh, “Application of deep learning for fast detection of COVID-19 in X-Rays using nCOVnet,” Chaos, Solitons \& Fractals, vol. 138, p. 109944, 2020.
Z. Yuan, Y. Lu, and Y. Xue, “Droiddetector: android malware characterization and detection using deep learning,” Tsinghua Sci. Technol., vol. 21, no. 1, pp. 114–123, 2016.
F. Xiao, Z. Lin, Y. Sun, and Y. Ma, “Malware detection based on deep learning of behavior graphs,” Math. Probl. Eng., vol. 2019, 2019.
N. Huang, M. Xu, N. Zheng, T. Qiao, and K.-K. R. Choo, “Deep android malware classification with API-based feature graph,” in 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 2019, pp. 296–303.
W.-C. Lin and Y.-R. Yeh, “Efficient Malware Classification by Binary Sequences with One-Dimensional Convolutional Neural Networks,” Mathematics, vol. 10, no. 4, p. 608, 2022.
J. Jung, J. Choi, S. Cho, S. Han, M. Park, and Y. Hwang, “Android malware detection using convolutional neural networks and data section images,” in Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, 2018, pp. 149–153.
D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture,” Comput. Networks, vol. 171, p. 107138, 2020.
J. Chen et al., “Slam: A malware detection method based on sliding local attention mechanism,” Secur. Commun. Networks, vol. 2020, 2020.
R. Lu, “Malware detection with lstm using opcode language,” arXiv Prepr. arXiv1906.04593, 2019.
S. Ganesan, V. Ravi, M. Krichen, V. Sowmya, R. Alroobaea, and K. P. Soman, “Robust malware detection using residual attention network,” in 2021 IEEE International Conference on Consumer Electronics (ICCE), 2021, pp. 1–6.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Deep Convolution Neural Networks and Image Processing for Malware Detection

Status:

Version 1

Abstract

Figures

I. Introduction

Ii. Literature Review

Iii. Materials And Methods

Iv. Results

V. Discussion

Vi. Conclusions

Declarations

References

Additional Declarations

Status:

Version 1