Behavioural Based Detection of Android Ransomware Using Machine Learning Techniques

doi:10.21203/rs.3.rs-2555218/v1

Download PDF

Research Article

Behavioural Based Detection of Android Ransomware Using Machine Learning Techniques

https://doi.org/10.21203/rs.3.rs-2555218/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Currently the whole world is going digitalization, using handheld device like smartphones and evolution of Internet, due to pandemic, all the transactions are going online. The security at end devices is an important issue to everyone. We believe that the, data is in transit is more secure, but in reality is not true. The data are in hands of bad actors for malicious activities. Android Ransomware is one of the most widely distributed assaults throughout the world. It is a type of virus that prevents users from accessing the operating system and encrypts essential data saved on their device. The majority of this work focuses on two goals: the first is to offer an introduction of ransomware and machine learning techniques, and the second part focussed on thorough assessment of detection of Android ransomware application using machine learning methods. After a thorough analysis of existing mechanisms of android ransomware detection, we found that the combination of static behaviour analysis of application and machine learning techniques gives good accuracy of android ransomware applications. In this research used, proposed a static based feature selection technique and applied machine learning algorithms for prediction of ransomware applications. For classification, the Decision Tree, Extra Tree classifier, Light Gradient Boosting Machine methods are employed in conjunction with the random forest tree. The dataset used was obtained from Kaggle and consists of 331 Android application permissions, 199 of which are Ransomware. The suggested model outperforms with a detection accuracy of 98.05 percent. Based on its best performance, we believe our suggested approach will be useful in malware and forensic investigation.

Android Ransomware

Behavioural Analysis

Decision Tree classifier

Light Gradient Boosting machine and Random forest tree classifiers

The popularity of smartphones has grown rapidly in recent years. Smartphones are used for more than calling and texting; they are now used for internet browsing, social media, online financial transactions, downloading and installing applications. Smartphone users store sensitive information on these devices, such as contact information, bank account information, online banking usernames and passwords, credit card information, and memorable and private photos. Hackers are increasingly targeting Android smartphone users through fraudulent installation of software (malware), according to a McAfee report. Android mobile operating systems market share estimated at 74 .97 percent by 2022.

However, a sizable number of attackers have been drawn to the Android market due to its explosive growth, and they use malware programs to gain unauthorized access to Android devices and data. A malicious program known as malware is created specifically to attack mobile devices and cause harm. Trojans, spyware, and ransomware are just a few examples of the many varieties of malware that can infect the victim's device. Recent reports indicate that, among the aforementioned categories, ransomware has experienced significant growth. By launching malicious code into the device's operating system, ransomware obstructs access to the victim's data until a ransom is paid. The creators of ransomware have developed a variety of tried-and-true strategies to deprive their victims of their money.

Due to the financial losses in the billions of dollars and the impact on one million Android users in a single month, this kind of malware has evolved into one of the most dangerous attacks that target both individuals and businesses. Additionally, new families have emerged as a result of ransomware's recent surge in popularity. Malware detection, vulnerability detection, and application reinforcement are just a few of the methods put forth to enforce security on the Android platform. To stop dangerous applications from being released in the Android marketplace, malware detection is a widely used security protection measure. There have been several malware detection methods put forth, which can be categorized into three groups: static approach, dynamic approach, and hybrid approach.

Static analysis examines the program's source code without running it to find malware. To obtain the source code of the Android application package (APK), it uses reverse engineering techniques. Numerous static traits (like permissions and API calls) can be recovered and used in the malware detection process depending on the reverse engineering strategy that was used. In order to provide security against malware infection, the most practical way to protect customers from deceitful and horrible behavior is to develop feature elements that include essential malware features. Standard code analysis and signature-based detection are regarded as the fundamental building blocks of malware resistance. To study and identify malware using static and dynamic behavior analysis, many researchers have put in a lot of effort. Monitoring static properties of programs without running them is known as static analysis.

To study and identify malware using static and dynamic behavior analysis, many researchers have put in a lot of effort. While dynamic analysis approaches monitor semantic features while the application is running in a controlled environment, static analysis refers to monitoring static properties of programs without actually running them. If a threat or malicious activity is related to a known group of Android malware, these techniques are useful for identifying it.

The suggested approach makes use of an effective machine learning-based technique to try and identify ransomware. The proposed method uses a decision tree classifier, extra tree, light gradient boosting machine, random forest tree, and decision tree algorithm to improve the detection accuracy of android ransomware applications. Pearson correlation is used to identify significant permissions in the technique. The objective of this research is to create a framework that is effective at finding Android ransomware applications.

The effectiveness of applying machine learning algorithms to the detection of Android ransomware has been studied by a number of researchers. This section discusses current studies that employed machine learning to resolve issues. Android ransomware detection based on a hybrid evolutionary strategy in the presence of highly unbalanced data, according to Almomani et al.'s[1]research. The hyper parameters of the classification algorithm are tuned using the binary particle swarm optimization technique, and features are chosen. They combined the synthetic minority oversampling strategy for classification (SMOTE) with the support vector machines (SVM) method. The dataset used contains 10,153 Android applications, 500 of which are ransomware, and was assembled from a number of sources. According to the findings, the suggested approach technique was 97.5 percent successful in identifying ransomware.

Bibi et al. [2] made a successful deep learning-based malware detection model for proficient and improved ransomware detection in the Android environment in a separate study. Eight different feature selection algorithms were used in the feature selection technique. The dataset used has 1509550 Android applications, 1048574 of which are ransomware, and the proposed model outperforms with a detection accuracy of 97.08 percent.

Alzahrani et al. [3] had two aims in mind: the first was to provide an introduction to ransomware and deep learning techniques, and the second was to conduct a detailed examination of more than 20 research relevant to detecting Android malware using deep learning methods. The dataset is not mentioned in the study.

In order to extract static features from Android ransomware, this paper attempts to conduct an in-depth analysis using reverse engineering and forensic analysis ([4]). For problems like mislabeling of previous targets and detecting unexpected Android ransomware, a unique RansomDroid framework based on clustering-based unsupervised machine learning techniques is also provided.

This paper attempts to perform an in-depth analysis using reverse engineering and forensic analysis in order to extract static features from Android ransomware ([4]). A special RansomDroid framework based on clustering-based unsupervised machine learning techniques is also provided for issues like mislabeling of prior targets and detecting unexpected Android ransomware.

Ransomware is a particular type of malware that Sheen et al. [6] focus on because it has emerged as a major security risk for end users, large enterprises, and businesses. They present a technique based on dynamic analysis for identifying locker ransomware variants through foreground analysis.

According to Manavi et al.'s[7]research, ransomware detection is accomplished without the aid of any software or additional pre-processing, relying solely on the headers of executable files. The suggested technique creates a graph from the headers of executable files, particularly portable executable files, and then uses the "Power Iteration" method to map the graph in an eigenspace. An executable file is transformed into a feature vector using this mapping, and the feature vector can then be used to train a Random Forest classifier. 2000 of the 4000 data samples in the dataset used for this study are malicious.

A dynamic analysis-based Android ransomware detection method was put forth by Abdullah et al. [8]. The ransomware and benign datasets were both used in this study. System calls were used in the proposed technique as dynamic analysis-generated characteristics. The testing results show that the Random Forest Algorithm has the highest detection accuracy of 98.31% and the lowest false positive rate of 0.016. There are 400 training data points in the dataset, 20 of which are malware data points. According to the data, the Random Forest method has the highest detection accuracy (98.31%) and the lowest false positive rate (0.016).

Recent ransomware detection systems were examined and compared in research by Alsoghyer et al. [9]. Additionally, a thorough analysis of Android permissions was carried out to identify crucial permissions that could accurately classify ransomware before harming users' devices.

In this study, Nada et al. [10] outline a method for recognizing Android ransomware based on metaheuristics and machine learning. The ideal subset of features is then modeled using the Salp Swarm Algorithm (SSA) and the Kernel Extreme Learning Machine (KELM), with the SSA used to identify the ideal subset of features and the KELM hyper parameters optimized. The SSA-KELM algorithm put forth here is a useful instrument for recognizing ransomware.

This section describes how this study is carried out, including the technique utilized in data collecting and analysis, an explanation of research instruments and the research environment, and an overview of the planned research. Furthermore, the detection phase employs a machine learning system to detect harmful activity in Android applications.

In Fig. 1, we begin with an Android application. We disintegrate the applications and obtain its Android permissions using a reverse engineering tool. We assess the permissions using Decision Tree, Random forest, extra tree and light gradient boosting machine learning techniques based. After processing it, we will compare it to our dataset to determine whether or not the application is malicious.

In this study, we present a method for Android ransomware detection that does not require prior labeling, even when fresh and original samples are offered. In this study, we reverse engineer APKs in order to disassemble them and examine the source code of Android ransomware. We conduct forensic analysis on Android malware source code to recover static components like permissions. From a large number of characteristics, we select the most significant ones using Pearson correlation techniques. Then, we employ the Decision tree classifier, extra tree, light gradient boosting machine, and random forest tree techniques for Android ransomware detection.

3.1. STATIC ANALYSIS

A quick and efficient method for identifying and categorizing known mobile viruses is static analysis. Static analysis involves dissecting an application from the inside out in order to examine any questionable strings, methods, or computer code. The ransomware that targets Android applications is depicted in Fig. 2.

3.1.1. REVERSE ENGINEERING

APK files, which are zip-compressed files containing the entire application's code, are used to distribute Android applications. The AndroidManifest.xml file, which lists all permissions, is located in the apk file. Using Apktool v2.3.14 to decompile the apk file, we were able to read the text of the AndroidManifest.xml file. The Android OS transmits and installs programs on smartphones using acquired datasets that include malicious and benign samples in APK file format. APK files hold an app's source code, which can be examined to determine how Android ransomware operates. Pre-processing the dataset in order to obtain an app's source code is the second step. For this, the APKs are disassembled using the reverse engineering technique and put into a readable format.

Reverse engineering has been used frequently in the literature to perform forensic analyses on software and hardware in order to extract functionality. Reverse engineering tools are used to disassemble APKs, as shown in Fig. 3. After extracting the source code from the raw datasets (APKs) using reverse engineering, there are a total of 24 android application permissions used for analysis.

3.2. DATASET

We used open source bench mark dataset from Kaggle for this study, a 331-permissions dataset obtained from Kaggle, one of the largest dataset repositories freely available on the Internet. We then utilized the Pearson Correlation approach to identify malware permissions, and discovered that 24 of the 331 permissions were malware triggering permissions. The selected permissions are shown in the Fig. 4.

3.3. FEATURE SELECTION

Initially, the feature set had 331 permissions, some of which were never requested by all apps. However, deleting extraneous features decreases the system's complexity and improves detection accuracy. 199 permissions were never encountered in ransomware or innocuous apps before. The permissions have been removed from the feature set. Finally, 24 permissions out of 132 have a high likelihood of causing applications to become either ransomware or benign using Pearson correlation in Fig. 5.

3.3.1 FEATURE ANALYSIS

Ransomware applications, on average, demanded more rights than benign apps. As seen in Fig. 6, some permissions were never asked by all applications. Permissions such as ACCESS_NETWORK_STATE, INTERNET, WAKE_LOCK, and WRITE_EXTERNAL _STORAGE, on the other hand, were popular among both ransomware and innocuous apps.

3.3.2. BENIGN APPLICATIONS

Figure 7 depicts the top ten most frequently requested permissions. The permissions INTERNET and ACCESS_ NETWORK_STATE, which account for 99.6% and 98.4% of occurrences in benign programs, respectively, are clearly at the top of the list. Applications must have these permissions in order to connect to the Internet and examine network data. In newer versions of Android, the INTERNET and ACCESS_NETWORK_STATE permissions are not necessary, but developers still include them to support older versions. WRITE_EXTERNAL_STORAGE was the permission that was most frequently requested, and it was used by 80% of good programs. WRITE_EXTERNAL_STORAGE was used more frequently than READ_EXTERNAL_STORAGE, which accounted for about 56 percent of all benign programs. This might be the case because WRITE_EXTERNAL_STORAGE implicitly grants this permission to any application that receives it.

Additionally common among safe applications is WAKE_LOCK, with 79.4 percent of them including this permission in the manifest file. It is necessary to maintain the processor's functionality and the screen's illumination. The permission VIBRATE was added because more than half of the apps needed access to the vibrator to catch the user's attention. The most common permission requested, accounting for about 44% of all benign apps, was READ_PHONE_STATE. The app can read phone information like the device number and the status of active calls thanks to this risky permission. Finally, RECEIVE_BOOT_COMPLETED is mentioned in 39.6% of apps. It is a common permission that enables the program to receive notification when the system has finished booting.

3.3.3. RANSOMWARE APPLICATIONS

The majority of ransomware programs, or 89 percent of all ransomware apps, demanded RECEIVE BOOT COMPLETED. Although this is a typical permission, locking ransomware frequently prompts the user when the system boots. This permission may be used by the Locker ransomware to display the threat text as soon as the device boots up. Because INTERNET permission is a standard requirement for all Android apps, 88.2 percent of ransomware apps are likely to request it as frequently as legitimate apps do. In addition, 75.6 percent of ransomware applications had the top third permission, READ PHONE STATE, compared to 43.8 percent of legitimate apps.

Both benign software and ransomware frequently requested the commands ACCESS NETWORK STATE, WAKE LOCK, and WRITE EXTERNAL STORAGE. However, only a few harmless apps (25.4%) and 53.2 percent of ransomware apps possessed the SYSTEM ALER WINDOW permission. Another dubious permission, BIND DEVICE ADMIN, was found in 48.8% of ransomware samples but only in four good apps (0.8%). The device administrator privilege, which can be abused to reset the PIN code as a ransomware locking mechanism, is misused by this permission. Figure 8 displays the top ten ransomware permissions that have been chosen.

The app needs to have the BIND DEVICE ADMIN permission in order to access the device administrator API. Additionally, compared to 7.6% of legitimate apps, 30% of ransomware apps exposed the permission KILL BACKGROUND PROCESSES. This permission could be used by attackers to prevent anti-virus processes from running, preventing ransomware detection. Furthermore, only 10% of benign apps requested the SEND SMS permission, compared to 29.8% of ransomware apps. Similarly, compared to 14.2 percent of innocent apps, 29 percent of ransomwares demanded RECEIVE SMS. Since they permit the app to send and receive SMS messages without requiring authentication by using IMSI numbers, which can be acquired using READ PHONE STATE, SEND SMS and RECEIVE SMS are both regarded as risky permissions.

Even if they are not frequently requested, additional permissions should be taken into account when comparing benign and ransomware programs because they might eventually encourage malicious behavior. For instance, hackers could disable the lock screen momentarily by using the DISABLE KEYGUARD feature. Another permission that is important to note is MOUNT FORMAT FILESYSTEMS because it permits the software to format removable storage.

For our investigation, we used numerous machine learning algorithms, among them Decision tree classifier and Random Forest classifier gives best results as shown in the Fig. 9. Hence, we used Decision tree and random forest to assess our android permissions and tweak them for greater accuracy and we evaluated the model using 10-fold cross validation.

4.1. DECISION TREE CLASSIFIER

Although it can be used to solve classification and regression problems, Decision Tree is most frequently used to solve classification problems. It is a tree-structured classifier where each leaf node represents the outcome and internal nodes contain dataset attributes and decision rule branches.

Step 1: The root of the tree, which contains the entire dataset, should be your first choice.

Step 2: Locate the dataset's most basic attribute using the Attribute Selection Measure (ASM).

Step 3: Separate the Dataset into subsets that have characteristics that have the potential to be the best.

Create the best attribute-containing classification tree node in step four.

Step 5: Employing the subgroups of the dataset obtained in step 3 to iteratively build new decision trees. Continue doing this until you can no longer classify the nodes and then refer to the end node as a leaf node.

For our research, we picked a Decision tree classifier since it has a higher accuracy of 98 percent than other machine learning algorithms. Figure 10. shows the accuracy of the Decision tree for 22 permissions of android application.

In Fig. 10, we studied the dataset using a decision tree classifier for 50 iterations with 10 folds, and the accuracy did not change when compared to 75, 100 iterations with 10 folds. As a result, we conclude that employing 50 iterations is more efficient than others.

4.2. RANDOM FOREST TREE

Another ML method used for classification and regression problems is random forest. It is based on the idea of ensemble learning, which is the process of combining many classifiers to address a complex issue and enhance the performance of the model. Forest operation at random

Step 1: Randomly select K data points from the training set.

Step 2: Construct the decision trees connected to the subsets of data you've selected.

Step 3: Choose N as the total number of decision trees you want to build.

Step 4: Repeat steps one and two.

Step 5: Locate each decision tree's predictions for new data points, then assign the new data points to the category that has received the most support.

For our research, we also picked a random forest algorithm since it also has a higher accuracy of 93.17 percent than other machine learning algorithms. Figure 11. shows the accuracy of random forest for 22 permissions of android application.

Figure 11 shows the Random forest to study the same dataset for 50 iterations with 10 folds, and the accuracy did not change when compared to 75, 100 iterations with 10 folds. As a result, we find that using 50 rounds is more efficient than using more repetitions.

4.3. EXTRA TREES CLASSIFIER

It is a type of ensemble learning technique that generates its classification result by averaging the classification outcomes of various de-correlated decision trees gathered in a "forest." The only theoretical difference between it and a random forest classifier is how the decision trees in the forest are built.

Decision Trees in the Extra Trees Forest are constructed using the current training sample. Following that, a random sample of k features from the feature-set is provided to each tree at each test node, from which each decision tree must select the best feature to divide the data according to some mathematical criteria. This arbitrary feature selection leads to the creation of numerous de-correlated decision trees.

In Fig. 12, we used an extra tree classifier to study the same dataset for 50 iterations with 10 folds, and the accuracy did not change when compared to 75, 100 iterations with 10 folds. As a result, we find that using 50 rounds is more efficient than using more repetitions.

4.4. LIGHT GRADIENT BOOSTING MACHINE

A decision tree framework called LightGBM uses gradient boosting to increase model effectiveness while using less memory. It uses two cutting-edge strategies to get around the drawbacks of the histogram-based algorithm used in all GBDT (Gradient Boosting Decision Tree) frameworks: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB). The two methods of GOSS and EFB that are described below combine to create the characteristics of the LightGBM Algorithm. Together, they enable the method to function effectively and give it an edge over other GBDT frameworks.

Gradient-based LightGBM’s One Side Sampling Technique:

When calculating information gain, different data instances have different functions. Information gain will be greater for instances with larger gradients (i.e., untrained instances). To maintain the accuracy of information gain estimation, GOSS keeps instances with large gradients (e.g., greater than a predefined threshold or in the top percentiles) and randomly drops instances with small gradients. When the value of information gain has a wide range, this treatment can produce a gain estimation that is more accurate than uniformly random sampling with the same target sampling rate.

LightGBM's Exclusive Feature Bundling Technique:

Since high-dimensional data is frequently sparse, we can create a method for cutting down on the number of features that must be used. Many features in a sparse feature space, in particular, are mutually exclusive, meaning they never take nonzero values simultaneously. It is safe to combine the distinctive features into a single feature (known as an Exclusive Feature Bundle). As a result, the difficulty of creating a histogram increases from O(#data #feature) to O(#data #bundle), while it stays the same for (#bundle #feature). As a result, without compromising accuracy, the training framework's speed is increased.

In Fig. 13, we studied the dataset using a light gradient boosting machine for 50 iterations with 10 folds, and the accuracy did not change when compared to 75, 100 iterations with 10 folds. As a result, we conclude that employing 50 iterations is more efficient than others.

We can’t conclude our model with accuracy metrics alone. So, we evaluated the model with more metrics like precision recall, prediction error, learning curve, manifold learning, calibration curve, validation curve, lift chart, gain chart and ks plot.

5.1. PRECISION RECALL

The Precision Recall Curve illustrates the trade-off between recall, a measure of completeness, and precision, a measure of result relevance, in a classifier. Recall is defined as the ratio of true positives to the total of both true and false positives, whereas precision is the ratio of true positives to the total of both true and false positives for each class.

Precision: Precision is a metric for how accurate a classifier is. The ratio of true positives to the total of both true and false positives for each class is how it is defined.

Recall: A classifier's recall, or capacity to correctly identify every positive instance, is a gauge of how complete it is. It is described as the ratio for each class.

Both precision and recall have values between 0 and 1, and when choosing and fine-tuning machine learning models, we frequently aim to maximize both, producing a prototype that accurately identifies the majority of the classes it chooses. The result would be a Precision Recall Curve graphical analysis with a significant area under the curve.

The comparison of precision-recall for the machine learning algorithms that we compared in our research is shown in Fig. 14, where the average precision is 0.98 percent in the light gradient boosting machine compared to other algorithms.

5.2. PREDICTION ERROR

It assesses how well samples are classified into the appropriate category. The prediction problem is concerned with whether the samples are correctly categorised into their respective categories. The goal is to find a rule that predicts outcomes or categories well for new cases where the response or category is unknown.

The prediction error for the machine learning algorithms that we compared in our research is shown in Fig. 15, with the error being very low in the light gradient boosting machine when compared to other algorithms.

5.3. LEARNING CURVE

A learning curve is a graph that shows how the training and test accuracy scores change as the count of samples/rows in the data increases.

So you can see both the training and cross-validation scores on this curve. Increasing the number of examples has little effect on the training score. But the cross-validation score most emphatically does! You can see that once we get to about 1000–1200 examples, the performance changes very little. As a result, we can conclude that adding more examples to the ones we already have is unlikely to be necessary as shown in the Fig. 16.

5.4. MANIFOLD LEARNING

Manifold

A d-dimensional manifold is a region of an n-dimensional space (where d n) that resembles a d-dimensional hyperplane locally.

Creating a model of the manifold on which the training instances are located. It is predicated on numerous assumptions (manifold hypothesis). Most high-dimensional datasets in the real world are close to a much lower-dimensional manifold.

The learning curve for the machine learning algorithms compared in our research is shown in figure.17, with the classification being much more accurate to ransomware than normal permissions.

5.5. CALIBRATION CURVE

Calibration curves are used to assess how well a classifier is calibrated, i.e. how the probabilities of predicting for every class label vary. The estimated predicted probability in each bin is represented by the x-axis. The ratio of positives is represented by the y-axis (the proportion of positive predictions). The ideal calibrated model's curve is a linear direct line moving linearly from (0, 0).

The calibration curve for the machine learning algorithms compared in our research is shown in figure. 18, with the calibration in random forest tree classifiers for ransomware being much more accurate than normal permissions.

5.6. VALIDATION CURVE

The accuracy of a Machine Learning model's sensitivity to changes in some model parameters is demonstrated by a Validation Curve, a helpful diagnostic tool. The relationship between a model parameter and the model's score is typically represented by a validation curve. Two curves make up a validation curve: one for the cross-validation score and one for the training set score. The scikit-learn library's function for the validation curve must by default carry out 3-fold cross-validation.

The model's hyperparameters must be selected so that it can function in the designated feature space in order to maximize the score. A grid search is the most effective method for choosing a set of the multiple hyperparameters that are present in the majority of models. To determine whether the estimator is underfitting or overfitting for some hyper-parameter values, it can be helpful to plot the influence of a single hyperparameter on the training and testing data.

The validation curve for the machine learning algorithms compared in our study is shown in figure. 19, where random forest tree classifier, extra tree classifier, and light gradient boosting machine are much more accurate than normal permissions for ransomware.

5.7. LIFT CHAT

Lift is the proportion of positive observations using the model up to decile I to the positive observations predicted by the random model up to that decile I. An illustration of the relationship between the lift on the vertical axis and the corresponding decile on the horizontal axis is called a lift chart.

Figure 20, depicts the lift chart for the machine learning algorithms compared in our study, with decision tree classifiers showing more contrast to ransomware than normal permissions.

5.8. GAIN CHART

Gain is the proportion of all positive observational data to all cumulative positive analyses up to a decile. As seen in Fig. 21, the gain chart is a graph with the gain on the vertical axis and the decile on the horizontal axis.

5.9. KS STATISTIC PLOT

A variant of the ppcc plot is the Kolmogorov-Smirnov (or KS) plot. A ppcc plot is a graphical data processing technique used to determine which member of a given distributional family provides the "best" distributional fit of the model. The KS plot modifies the ppcc plot by replacing the correlation coefficient of the probability plot with the valuation of the Kolmogorov-Smirnov goodness of fit statistic as the way of measuring distributional fit. For the KS plot, we want to find the shape parameter value that minimises the Kolmogorov-Smirnov statistic. The KS plot is created by choosing a shape parameter and computing the Kolmogorov-Smirnov goodness of fit test value. The KS plot then includes the following:

The value of the distributional parameter (on the horizontal axis) that corresponds to the least of the KS plot curve (on the vertical axis) indicates the family member that fits the data the best.

The KS plot for the ml algorithms compared in our study is depicted in figure. 22, where all of the implemented machine learning algorithms show greater accuracy between 0.8 to 0.98 for normal and ransomware permissions.

5.10. COMPARISON WITH PRIOR ART

The performance of our suggested work is contrasted with a few other, comparable works in Table 1. This table demonstrates that, when compared to earlier available detection techniques, the identified structures of requested permissions and used characteristics offer an effective performance. The comparison table shows that the proposed method uses fewer features than other, comparable methods while still achieving a high detection accuracy.

Table 1

Performance comparison with similar work
Detection Methods	Type	Accuracy	Dataset	Balanced
Alzahrani et al. [19]	Static/ Dynamic	91%	200 B 100 R	No
Alsoghyer et al. [20]	Static	97%	Not Specified	Yes
Singh et al. [23]	Static	93.92%	1147 B 905 M	No
Proposed Work	Static	97.30%	331 permissions dataset	Yes

Ransomware assaults have been wreaking havoc around the globe in recent years. This necessitates the development of more resilient intelligent malware detection systems. This research proposes an unsupervised machine learning strategy for detecting Android ransomware. The proposed method employed decision tree classifiers and random forest classification and search algorithms. The Pearson correlation method is used to fine-tune the dataset and extract the necessary permissions. Experiments based on the real world bench mark dataset reveal that the suggested model outperforms highly powerful classifiers in terms of detection power. The collected dataset contains 331 Android permissions, 24 of which are Ransomware, and covers a variety of permissions and API calls. Further study of the model's selected features reveals that some characteristics, such as WAKE LOCK and READ PHONE STATE, were substantially more frequent than others, implying that they contribute more than other features to improving the prediction capabilities of the suggested technique. As a result, the results demonstrated the virtues of the suggested approach's ability to identify Ransomware efficiently (97.5 percent However, by using more thorough permissions analysis and improved models to handle the gathered massive data, such as deep learning algorithms that are more capable of inferring accurate patterns of associations, this research study can be expanded into more research.

• I confirm that I understand journal International Journal of Information Security is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.

• No, I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

• The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration (from you or one of your Contributing Authors) by another publisher.

• I have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies.

• All of the material is owned by the authors and/or no permissions are required.

Almomani, I., Qaddoura, R., Habib, M., Alsoghyer, S., Al Khayer, A., Aljarah, I., & Faris, H. (2021). Android ransomware detection based on a hybrid evolutionary approach in the context of highly imbalanced data. IEEE Access, 9, 57674–57691.
Bibi, I., Akhunzada, A., Malik, J., Ahmed, G., & Raza, M. (2019, August). An effective Android ransomware detection through multi-factor feature filtration and recurrent neural network. In 2019 UK/China Emerging Technologies (UCET) (pp. 1–4). IEEE.
Alzahrani, N., & Alghazzawi, D. (2019, November). A review on android ransomware detection using deep learning techniques. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (pp. 330–335).
Sharma, S., Krishna, C. R., & Kumar, R. (2021). RansomDroid: Forensic analysis and detection of Android Ransomware using unsupervised machine learning technique. Forensic Science International: Digital Investigation, 37, 301168.
Qaddoura, R., Aljarah, I., Faris, H., & Almomani, I. (2021). A classification approach based on evolutionary clustering and its application for ransomware detection. In Evolutionary Data Clustering: Algorithms and Applications (pp. 237–248). Springer, Singapore.
Sheen, S., & Gayathri, S. (2022). Early Detection of Android Locker Ransomware Through Foreground Activity Analysis. In Proceedings of Third International Conference on Communication, Computing and Electronics Systems (pp. 921–932). Springer, Singapore.
Manavi, F., & Hamzeh, A. (2022). A novel approach for ransomware detection based on PE header using graph embedding. Journal of Computer Virology and Hacking Techniques, 1–12.
Abdullah, Z., Muhadi, F. W., Saudi, M. M., Hamid, I. R. A., & Foozy, C. F. M. (2020, January). Android ransomware detection based on dynamic obtained features. In International Conference on Soft Computing and Data Mining (pp. 121–129). Springer, Cham.
Alsoghyer, S., & Almomani, I. (2020, March). On the effectiveness of application permissions for android ransomware detection. In 2020 6th conference on data science and machine learning applications (CDMA) (pp. 94–99). IEEE.
Nada Lachtar, Duha Ibdah, and Anys Bacha, ”The Case for Native Instructions in the Detection of Mobile Ransomware”, IEEE Letters of the Computer Society, vol. 2, issue 2, pp. 16–19, May 2019.
Abdurrahman Pektas, Tankut Acarman, ”Learning to detect Android malware via opcode sequences”, Neurocomputing, In press, 2019. https://doi.org/10.1016/j.neucom.2018.09.102.
Abdullahi Mohammed Maigida, Shafi’i Muhammad Abdulhamid, Morufu Olalere, John K. Alhassan, ”Systematic literature review and metadata analysis of ransomware attacks and detection mechanisms”,Journal of Reliable Intelligent Environments, Springer, May 2019.
Ju-Seong Ko, Jeong-Seok Jo, Deuk-Hun Kim, Seul-Ki Choi, Jin Kwak, ”Real Time Android Ransomware Detection by Analyzed Android Applications”, International Conference on Electronics, Information, and Communication (ICEIC), 22–25 January 2019.
Michele Scalasa, Davide Maiorcaa, Francesco Mercaldob, Corrado Aaron Visaggioc, Fabio Martinellib, Giorgio Giacintoa, ”On the Effectiveness of System API-Related Information for Android Ransomware Detection”, Computer and Security, vol. 86, pp. 168–182, 2019.
Shivangi, Gautam Sharma, Anubhav Johri, Akshita, Anurag Goel and Anuradha Gupta, ”Enhancing RansomwareElite App for Detection of Ransomware in Android Applications”, Eleventh International Conference on Contemporary Computing (IC3), 2–4 August, 2018.
Faris, H., Habib, M., Almomani, I., Eshtay, M., & Aljarah, I. (2020). Optimizing extreme learning machines using chains of salps for efficient Android ransomware detection. Applied Sciences, 10(11), 3706.
Jiang, X., Mao, B., Guan, J., & Huang, X. (2020). Android malware detection using fine-grained features. Scientific Programming, 2020.
T. Bhatia and R. Kaushal. Malware detection in android based on dynamic analysis. In 2017 International Conference on Cyber Security And Protection Of Digital Services (Cyber Security), pages 1–6, June 2017.
Alzahrani, A. Alshehri, H. Alshahrani, R. Alharthi, H. Fu, A. Liu, and Y. Zhu, ‘‘RanDroid: Structural similarity approach for detecting ransomware applications in Android platform,’’ in Proc. IEEE Int. Conf. Electro/Inf. Technol. (EIT), May 2018, pp. 0892–0897.
S. Alsoghyer and I. Almomani, ‘‘Ransomware detection system for Android applications,’’ Electronics, vol. 8, no. 8, p. 868, Aug. 2019.
A. Alzahrani, H. Alshahrani, A. Alshehri, and H. Fu, ‘‘An intelligent behavior-based ransomware detection system for Android platform,’’ inProc. 1st IEEE Int. Conf. Trust, Privacy Secur. Intell. Syst. Appl. (TPS-ISA), Dec. 2019, pp. 28–35.
M. Scalas, D. Maiorca, F. Mercaldo, C. A. Visaggio, F. Martinelli, and G. Giacinto, ‘‘On the effectiveness of system API-related information for Android ransomware detection,’’ Comput. Secur., vol. 86, pp. 168–182, Sep. 2019.
A. K. Singh, G. Wadhwa, M. Ahuja, K. Soni, and K. Sharma, ‘‘Android malware detection using LSI-based reduced opcode feature vector,’’ Pro- cedia Comput. Sci., vol. 173, pp. 291–298, 2020.
Kirubavathi, G., & Anitha, R. (2018). Structural analysis and detection of android botnets using machine learning techniques. International Journal of Information Security, 17(2), 153–167.
Kirubavathi, G., & Anitha, R. (2016). Botnet detection via mining of traffic flow characteristics. Computers & Electrical Engineering, 50, 91–101.
Kirubavathi, G., & Anitha, R. (2014). Botnets: A study and analysis. In Computational intelligence, cyber security and computational models (pp. 203–214). Springer, New Delhi.
Kouliaridis, V., & Kambourakis, G. (2021). A comprehensive survey on machine learning techniques for android malware detection. Information, 12(5), 185.
Liu, Y., Tantithamthavorn, C., Li, L., & Liu, Y. (2021). Deep learning for android malware defenses: a systematic literature review. arXiv preprint arXiv:2103.05292.
Wang, X., & Li, C. (2021). Android malware detection through machine learning on kernel task structures. Neurocomputing, 435, 126–150.
Rathore, H., Sahay, S. K., Nikam, P., & Sewak, M. (2021). Robust android malware detection system against adversarial attacks using q-learning. Information Systems Frontiers, 23(4), 867–882.
Dhalaria, M., & Gandotra, E. (2021). Android malware detection techniques: A literature review. Recent Patents on Engineering, 15(2), 225–245.
Qiu, J., Zhang, J., Luo, W., Pan, L., Nepal, S., & Xiang, Y. (2020). A survey of android malware detection with deep neural models. ACM Computing Surveys (CSUR), 53(6), 1–36.
Liu, K., Xu, S., Xu, G., Zhang, M., Sun, D., & Liu, H. (2020). A review of android malware detection approaches based on machine learning. IEEE Access, 8, 124579–124607.
Alqatawna, J. F., Ala’M, A. Z., Hassonah, M. A., & Faris, H. (2021). Android botnet detection using machine learning models based on a comprehensive static analysis approach. Journal of Information Security and Applications, 58, 102735.
Yerima, S. Y., Alzaylaee, M. K., & Shajan, A. (2021). Deep learning techniques for android botnet detection. Electronics, 10(4), 519.
Karim, A., Chang, V., & Firdaus, A. (2021). Android Botnets: A Proof-of-Concept Using Hybrid Analysis Approach. In Research Anthology on Securing Mobile Technologies and Applications (pp. 75–92). IGI Global.
Alkahtani, H., & Aldhyani, T. H. (2022). Artificial Intelligence Algorithms for Malware Detection in Android-Operated Mobile Devices. Sensors, 22(6), 2268.
Anwar, S., Zolkipli, M. F., Inayat, Z., Odili, J., Ali, M., & Zain, J. M. (2018). Android botnets: a serious threat to android devices. Pertanika Journal of Science & Technology, 26(1).
Moodi, M., & Ghazvini, M. (2019). A new method for assigning appropriate labels to create a 28 Standard Android Botnet Dataset (28-SABD). Journal of Ambient Intelligence and Humanized Computing, 10(11), 4579–4593.
Alqatawna, J. F., & Faris, H. (2017, October). Toward a detection framework for android botnet. In 2017 International Conference on New Trends in Computing Sciences (ICTCS) (pp. 197–202). IEEE

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Behavioural Based Detection of Android Ransomware Using Machine Learning Techniques

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Proposed Methodology

3.1. STATIC ANALYSIS

3.1.1. REVERSE ENGINEERING

3.2. DATASET

3.3. FEATURE SELECTION

3.3.1 FEATURE ANALYSIS

3.3.2. BENIGN APPLICATIONS

3.3.3. RANSOMWARE APPLICATIONS

4. Selection Of Machine Learning Techniques

4.1. DECISION TREE CLASSIFIER

4.2. RANDOM FOREST TREE

4.3. EXTRA TREES CLASSIFIER

4.4. LIGHT GRADIENT BOOSTING MACHINE

5. Performance Analysis Of Classification Algorithm

5.1. PRECISION RECALL

5.2. PREDICTION ERROR

5.3. LEARNING CURVE

5.4. MANIFOLD LEARNING

5.5. CALIBRATION CURVE

5.6. VALIDATION CURVE

5.7. LIFT CHAT

5.8. GAIN CHART

5.9. KS STATISTIC PLOT

5.10. COMPARISON WITH PRIOR ART

6. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1