To provide a comprehensive overview of the LLMs use in the field of software vulnerabilities and cyber security, it is important to fully comprehend how these models are currently being applied, the challenges they face, and their potential. Thus, we aim to provide a systematic literature review of the application of LLMs in this fields. This study thus aims to answer the following research questions:
- RQ1: Why should we use LLMs for software vulnerabilities and cyber security threats detection?
- RQ2: What specific types of LLMs are utilized in software vulnerabilities and cyber security threats detection?
- RQ3: How can LLMs be used to detect and handle software vulnerabilities and cyber security threats?
- RQ4: What is the workflow of an LLM model in the context of detecting and handling software vulnerabilities and cyber security threats?
- RQ5: What is the best type of data sets to train LLMs for software vulnerability detection?
- In comparison to traditional methods/tools, how do LLMs perform in detecting and handling software vulnerabilities and cyber security threats?
- What metrics are used to assess LLMs in addressing software vulnerabilities and cyber threats?
- RQ8: What are the challenges of using LLMs in cybersecurity tasks?
- How to enhance LLM effectiveness in software vulnerability and cyber threat detection?
7.1 RQ1: Why should we use LLMs for software vulnerabilities and cyber security threats detection?
As businesses increasingly integrate with the digital realm, the landscape of cyber threats evolves rapidly, growing more intricate and severe. This surge in connectivity amplifies the urgency for robust cybersecurity measures. Amidst this challenge, emerging research highlights the potential of natural language processing (NLP) techniques in fortifying cyber defense mechanisms. Specifically, NLP applications exhibit promise in detecting vulnerabilities within software code, a critical aspect in preventing cyber-attacks. It's well-documented that software bugs serve as prime entry points for malicious actors, precipitating cyber crimes. Despite advancements, the persistence of software vulnerabilities remains evident, as illustrated by the annual update of the Common Vulnerabilities and Exposures (CVE) list. Traditional error identification methods, once relied upon, now face scrutiny due to their susceptibility to inaccuracies and misdiagnoses. This underscores the pressing need for innovative approaches, including those grounded in NLP, to enhance cyber resilience in an increasingly interconnected world. [17, 24]. Artificial intelligence's ability to process vast datasets in real-time, extract insights, and anticipate potential threats has the potential to revolutionize proactive cybersecurity measures [57]. By using large language models like GPT along with suitable datasets such as the SARD benchmark dataset and SeVC dataset, we can analyze to find weak code in different programming languages including C++/C and Java [59].
While machine learning has been used for vulnerability detection, traditional methods require complex and error-prone manual feature engineering to define what information the machine should analyze. Deep learning approaches using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been explored, but they necessitate specifically formatted code data, which can be challenging. Giving rise to the Transformer-based neural network architectures, known for their success in natural language processing. The goal is to develop a model that automatically learns both syntactic (structure) and semantic (meaning) information from various programming languages, paving the way for even more powerful vulnerability detection systems using large language models. [17, 18, 24] Moreover, transformer-based language models hold appeal and promise over RNNs due to their capacity for parallelization in model computation, facilitating quicker processing compared to RNNs. This is crucial for decreasing the time required for model training and testing, especially when dealing with large-sized models like transformer-based ones. Additionally, their capability to transition from natural language processing tasks to related tasks, via transfer learning, allows for broadening their applicability across various domains. [22]
Leveraging transformer-based models such as GPT4 for vulnerability detection, presents numerous benefits, notably enhanced accuracy and natural language processing capabilities. These models obviate the necessity for manual input in static analysis tools, streamlining the detection process into a faster, more automated procedure. [17, 24]
RQ1 Answer: Large language models, like GPT-4, should be utilized for software vulnerabilities and cybersecurity threat detection due to their ability to automate the identification of code vulnerabilities across various programming languages. They offer enhanced accuracy and natural language processing capabilities, streamlining the detection process and facilitating quicker processing compared to traditional methods.
|
7.2 RQ2 What specific types of LLMs are utilized in software vulnerabilities and cyber security threats detection?
Numerous large language models (LLMs) have been explored for their potential in detecting software vulnerabilities and cybersecurity threats. These models include BERT, GPT-3, RoBERTa, XLNet, ALBERT, T5, BART, ELECTRA, Longformer, CodeBERT, GraphCodeBERT, CodeT5, PLBART, and others. These LLMs, with their ability to process and understand code, have been utilized in tasks such as vulnerability detection, malware classification, code summarization, and code generation, contributing to the advancement of cybersecurity research and practice.
BERT: Researchers are exploring the application of BERT and the adaptation of pretrained language models like CodeBERT, GraphCodeBERT across a range of cybersecurity domains, encompassing malware detection in Android apps, spam email identification, intrusion detection in automotive systems, and anomaly detection in system logs. The Bidirectional Encoder Representations from Transformers (BERT) model has garnered significant interest for its remarkable capability to grasp contextual nuances in text sequences. As a pretrained transformer-based language model, BERT has demonstrated remarkable proficiency in various NLP tasks. Its inherent ability to comprehend intricate dependencies and variations within sequences has spurred investigations into its utility for cyber threat detection. Leveraging BERT's contextual comprehension, security experts have devised novel approaches to address diverse cybersecurity challenges [4, 17, 24].
CodeBERT: Feng et al. [54] introduced CodeBERT, a pretrained model, trained on source codes from diverse programming languages. Through this pretraining, CodeBERT acquires an understanding of both programming languages (PL) and natural language (NL). This model is tailored to support various NL-PL applications, including natural language code search and code documentation generation. Built upon a transformer-based neural architecture, CodeBERT is trained using a specialized objective function that combines the task of detecting replaced tokens with pretraining. This methodology enables the utilization of both NL-PL pairs and unimodal data during training, with NL-PL pairs serving as input tokens and unimodal data enhancing the performance of the generators. CodeBERT's architecture comprises 12 layers and 125 million parameters.
FalconLLM: Falcon 40B and its more advanced counterpart, Falcon 180B, hold significant potential in identifying and managing software vulnerabilities and cybersecurity risks. With their autoregressive decoder-only architecture and impressive parameter counts of 40 billion and 180 billion (180B is trained on 3.5 trillion tokens), FalconLLM and stands out as a strong solution in incident response and recovery systems, as evidenced by its outstanding performance on the Open Source LLM Leaderboard. Leveraging advanced language comprehension capabilities, FalconLLM meticulously analyzes textual logs and incident reports, extracting relevant details and identifying patterns indicative of potential threats. Additionally, FalconLLM excels in evaluating the severity and potential impact of incidents, providing customized mitigation strategies and recovery plans to response teams. Its adaptive learning mechanism continuously integrates new data, enabling retrospective analysis of past incidents and iterative enhancement of response procedures. This proactive and iterative approach empowers organizations to mount quicker, more effective responses to cyber threats, thus minimizing potential damages [4].
SecureFalcon: Researchers trained a large language model called FalconLLM40B using an extensive training procedure on 384 GPUs (A100 40GB), giving rise to SecureFalcon, an innovative model architecture constructed upon the foundation of FalconLLM. The model’s training procedure incorporated a 3D parallelism strategy that involved tensor parallelism of 8 (TP=8), pipeline parallelism of 4 (PP=4), and data parallelism of 12 (DP=12). They employed specific hyperparameters to optimize the model’s learning process, balancing efficiency and precision [7].
OpenAI's GPT: GPT stands for "generative pre-trained Transformer," which is a family of Transformer-based models developed by OpenAI. These models have demonstrated remarkable performance across various language-related tasks. ChatGPT, a commercial product by OpenAI, is based on this family of large language models (LLMs). GPT-4, the latest model in this series, has achieved exceptional performance, surpassing humans in standardized exams and outperforming other models in academic benchmarks. ChatGPT, despite being trained primarily on human language, is capable of generating valid source code and assisting in debugging tasks, highlighting its versatility. Traditional rule-based static code analyzers, while effective in identifying software vulnerabilities, can sometimes miss nuanced or evolving threats due to their rule-based nature. Large Language Models (LLMs), such as OpenAI's ChatGPT, offer a novel approach to addressing this challenge by leveraging vast textual data to understand and generate code, potentially improving the identification and rectification of software vulnerabilities. [9, 13, 17, 24, 28, 38, 40].
RQ2 Answer: Various types of LLMs can be utilized in software vulnerabilities and cybersecurity threats detection include BERT, GPT, RoBERTa, XLNet, ALBERT, T5, BART, ELECTRA, Longformer, CodeBERT, GraphCodeBERT, CodeT5, PLBART, FalconLLM, SecureFalcon, and OpenAI's GPT family. These LLMs demonstrate proficiency in tasks such as vulnerability detection, malware classification, code summarization, and code generation, contributing significantly to cybersecurity research and practice.
|
7.3 RQ3. How can LLMs be used to detect and handle software vulnerabilities and cyber security threats?
We have discussed applications of LLMs in detecting and handling software vulnerabilities in Section 6. Now, in this section, through a literature review of past works, we will show you some examples where LLMs have been used for detection and handling of software vulnerabilities.
Authors in [4] have use SecurityBERT and FalconLLM parallel to each other to create a new cyber threat detection model called, SecurityLLM. SecurityBERT operates as a cyber threat detection mechanism, while FalconLLM is an incident response and recovery system. The integration of FalconLLM and Security-BERT, two distinct techniques, can improve the identification of network-based threats. SecurityLLM model can identify fourteen (14) different types of attacks with an overall accuracy of 98%.
Researchers have employed large language models, specifically, GPT3.5, to enhance the efficacy of penetration testing endeavors. This study focuses on two primary applications: leveraging these models for high-level strategic planning in security assessments and utilizing them for finding weak spots a simulated vulnerable computing environment. In the latter context, they have established a closed-feedback loop between the actions generated by the large language model and the vulnerable virtual machine accessed via Secure Shell (SSH). This framework enables the model to scrutinize the state of the virtual machine for vulnerabilities and propose specific attack vectors, subsequently executed automatically within the virtual environment [5].
SecureFalcon, a new model based on a fine-tuned version of FalconLLM, which can effectively differentiate vulnerable and non-vulnerable C code samples. Tested on a unique dataset Form AI, SecureFalcon achieved an impressive 94% accuracy rate, demonstrating its effectiveness. This approach not only reduces false positives compared to traditional static analysis methods, but also offers a groundbreaking solution for software vulnerability detection. Notably, SecureFalcon accomplishes this with a relatively small number of parameters (121 million and 44 million) [7].
Finding and fixing bugs in software code is a time-consuming task for developers, and automated program repair (APR) techniques aim to lessen this burden. Researchers in [8] proposes a new approach that builds on LLM-based repair techniques. It leverages a recently developed interactive prompting method called Tree of Thoughts (ToT). request a Large Language Model (LLM), GPT-4, to suggest various potential locations for a software bug. By gathering and analyzing the model's collective responses, they then prompt it to provide suggestions for fixing the identified bugs. An initial assessment indicates that their method successfully resolves several intricate bugs that were previously unsolved by GPT-4, even when considering prompt customization.
Authors of [9] examined the effectiveness of Large Language Models (LLMs), specifically focusing on OpenAI's GPT-4, in identifying software vulnerabilities compared to traditional static code analyzers like Snyk and Fortify. The evaluation encompassed various repositories, including those from NASA and the Department of Defense. Findings revealed that GPT-4 detected approximately four times more vulnerabilities than its counterparts and proposed viable solutions for each, demonstrating a low false positive rate. Analysis of 129 code samples across eight programming languages highlighted PHP and JavaScript as having the highest vulnerability rates. GPT-4's suggested code modifications resulted in a significant 90% decrease in vulnerabilities, with a minimal 11% increase in code lines. Notably, the study emphasized LLMs' capability for self-auditing and providing potential fixes for identified vulnerabilities, underscoring their precision.
In [10] the authors goal was to investigate the potential of these LLMs for zero-shot vulnerability repair. They used several Large Language Models (LLMs) for their experiments. Specifically, OpenAI’s Codex and AI21’s Jurassic J-11. They also evaluated the performance of five commercially available, black-box, “off-the-shelf” LLMs, as well as an open-source model and their own locally-trained model. LLMs exhibited promise in addressing real-world code vulnerabilities, successfully repairing projects, a performance comparable to that of the state-of-the-art repair tool ExtractFix. However, the quality of repairs varied, with some fixes effectively resolving bugs while others introduced new issues or appeared implausible. Manual inspection of highly-rated repairs indicated that a notable portion of the "successful" fixes might be unreliable. Despite operating in a zero-shot setting without specific training for repair tasks and limited context from prompts, LLMs performed admirably, even outperforming ExtractFix in certain scenarios. However, LLMs faced challenges when addressing vulnerabilities requiring extensive code changes or complex semantic modifications, highlighting limitations in their understanding and context comprehension.
The authors in [12] propose leveraging Large Pre-Trained Language Models (PLMs) for Automated Program Repair (APR) to address limitations in traditional and learning-based APR techniques. They emphasize the potential of PLMs, trained on vast amounts of text/code tokens, to generate patches without relying on bug-fixing datasets. The study evaluates 9 recent state-of-the-art PLMs across different repair settings and programming languages, demonstrating their effectiveness in fixing real-world bugs. The research highlights the scalability of PLMs, with larger models generally achieving better performance. Furthermore, the authors explore the importance of suffix code in infilling-style APR and suggest practical guidelines for improving PLM-based APR, such as increasing sample size and incorporating fix template information.
In their study [13], the authors evaluate four LLMs (GPT-3.5-Turbo and GPT-3.5-Turbo-0613 for direct prompting, Davinci, and Codegen-2B-multi for fine-tuning.) on two software vulnerability (SQL injection and buffer overflow) detection tasks. To simulate a real-world scenario, the researchers designed their experiments such that a developer submits a code excerpt to an LLM, prompting it to identify potential security vulnerabilities. Both fine-tuned and zero-shot models are evaluated to simulate a variety of real-world situations. Table II shows the result of their work. We can clearly see that LLMs outperform other methods, but they also have a higher FPR.
Table II: Comparing different approaches based on the Code Gadget Database [13]
System
|
Technique
|
FPR (%)
|
FNR (%)
|
TPR (%)
|
P (%)
|
F1 (%)
|
Flawfinder
|
Static Analysis
|
44.7
|
69.0
|
31.0
|
25.0
|
27.7
|
RATS
|
Static Analysis
|
42.2
|
78.9
|
21.1
|
19.4
|
20.2
|
Checkmarx
|
Static Analysis
|
43.1
|
41.1
|
58.9
|
39.6
|
47.3
|
VulDeePecker
|
Deep Learning
|
5.7
|
7.0
|
93.0
|
88.1
|
90.5
|
CodeGen
|
LLM
|
68.67
|
32
|
68
|
49.75
|
57.46
|
Davinci
|
LLM
|
62.67
|
6
|
94
|
60
|
73.23
|
Ensembled CodeGen+Davinci
|
LLM
|
74.22
|
3.96
|
96.04
|
57.4
|
71.85
|
Authors of [14] tried to tackle the challenge of detecting logic vulnerabilities in smart contracts, which have resulted in significant financial losses. They identify a gap in existing analysis tools, which struggle to audit about 80% of Web3 security bugs due to a lack of domain-specific property description and checking. To address this issue, the authors propose GPTScan, a tool that combines GPT4 with static analysis for smart contract logic vulnerability detection. Unlike existing approaches that solely rely on GPT and suffer from high false positives, GPTScan utilizes GPT4 as a versatile code understanding tool. It breaks down each logic vulnerability type into scenarios and properties, enabling GPTScan to match candidate vulnerabilities with GPT4 and instruct it to intelligently recognize key variables and statements. Evaluation on diverse datasets demonstrates that GPTScan achieves high precision and recall rates, effectively detecting ground-truth logic vulnerabilities, including new ones missed by human auditors. Furthermore, GPTScan is shown to be fast, cost-effective, and capable of reducing false positives through static confirmation. We will explain the workflow of GPTScan in the next section.
In [15], the authors used five popular Large Language Models of Code (LLMCs) with representative pre-training architectures. These models include: CodeBERT, GraphCodeBERT, PLBART, CodeT5, and UniXcoder. The authors used these models in the context of Automated Program Repair (APR). They considered three typical program repair scenarios involving three programming languages (Java, C/C++, and JavaScript). They took into account both single-hunk and multi-hunk bugs/vulnerabilities. The LLMCs were fine-tuned on widely-used datasets (BFP, SequenceR, CPatMiner, VulRepair, and TFix) and compared with existing state-of-the-art APR tools. The authors also investigated the impact of different design choices, which include code abstractions, code representations, and model evaluation metrics.
The study found that LLMCs in the fine-tuning paradigm can significantly outperform previous state-of-the-art APR tools. The authors provided insights into choosing appropriate strategies to guide LLMCs for better performance. They also revealed several limitations of LLMCs for APR and made suggestions for future research on LLMC-based APR.
Researchers in [17, 24] Created VulDetect, which is developed as a classification model based on the large language model GPT-2. utilize a significantly large dataset and explore various model architectures, including MegatronBERT and GPT-2. They also incorporate code gadgets for input data, instead of removing labels and comments from the source file.
Authors of [53] utilized the BERTBase model for detecting software vulnerabilities. They fine-tuned it using a dataset comprising 100,000 C/C++ source files and evaluated its performance with 123 vulnerabilities. Comparing it to standard LSTM and BiLSTM models, they found that the BERTBase model and BERT with RNN heads surpassed the performance of the conventional models. Their dataset and model achieved the highest detection accuracy of 93.49%.
Researchers in [18] address the longstanding goal of repairing software bugs using automated solutions, particularly focusing on the task of vulnerability repair. They note that while some automated program repair (APR) tools leverage natural language processing (NLP) techniques, the significant differences between natural languages (NL) and programming languages (PL) may hinder their effectiveness in handling PL tasks. Moreover, existing tools primarily focus on bug repair tasks, with limited exploration into vulnerability repair. To tackle these issues, the authors propose leveraging large-scale pre-trained PL models, such as CodeBERT and GraphCodeBERT, specifically tailored for vulnerability repair based on PL characteristics.
The authors explore the real-world performance of state-of-the-art data-driven approaches for vulnerability repair using these pre-trained PL models. Their approach involves fine-tuning the pre-trained models for vulnerability repair tasks, allowing them to better capture PL features and handle multi-line vulnerability repair scenarios. Through their experimentation, they demonstrate that their approach achieves advanced results, with high accuracy rates for both single-line and multi-line vulnerability repair tasks. Specifically, their solution achieves a maximum accuracy of 95.47% for single-line vulnerability repair and 90.06% for multi-line vulnerability repair. They also evaluate their approach across various types of vulnerabilities, including CWE-121/190/369/401/457, and compare its performance with existing APR tools such as Tufano et al., DLFix, SequenceR, and CoCoNut, highlighting its effectiveness and generalization capabilities.
Researchers in [21] address the evolving cyber-attack landscape faced by enterprises, caused by new vulnerabilities and attack techniques. They emphasize the necessity for security management tools to accurately assess cyber-risks by identifying associations among attack techniques, weaknesses, and vulnerabilities. Existing repositories often lack completeness and rely on manual interpretations, which are slow and ineffective. To address these challenges, the authors propose a framework called VWC-MAP (Vulnerabilities and Weakness to Common Attack Pattern Mapping). VWC-MAP leverages natural language processing (NLP) techniques to automatically associate vulnerabilities with relevant attack techniques based on their textual descriptions. The framework employs a two-tiered classification approach, classifying vulnerabilities to weaknesses and weaknesses to attack techniques.
The authors introduce novel automated approaches for mapping weaknesses to attack techniques, utilizing Text-to-Text and link prediction techniques. They enhance the scalability of existing tools like V2W-BERT, which maps vulnerabilities to weaknesses, using Distributed Data-Parallel (DDP) technique for faster training. For associating weaknesses to attack patterns, they employ a Text-to-Text model (Google T5) and incorporate link prediction techniques considering the hierarchical relationships of attack patterns. Experimental results demonstrate the effectiveness of VWC-MAP in associating vulnerabilities to weakness-types and weaknesses to new attack patterns with high accuracy. This work contributes a comprehensive automated mapping of CVE-CWE-CAPEC associations, facilitated by large language models, aiming to impact both research and practical applications in cyber-defense.
The author's work in [25] focuses on enhancing software vulnerability detection through deep learning methods, specifically addressing the challenge of detecting vulnerabilities across long code slices with contextual dependencies. They introduced VulD-Transformer, a novel approach utilizing Transformer models tailored for code slice-level vulnerability detection. Unlike previous methods, VulD-Transformer aims to capture remote contextual dependencies between code statements effectively. To achieve this, the authors first extract code slices containing data and control dependencies using vulnerability syntax features and Program Dependency Graphs (PDGs). Then, they design a Transformer-based vulnerability detection model to enhance feature learning, particularly for remote code statements.
The experimental evaluation on synthetic and real datasets demonstrates the effectiveness of VulD-Transformer compared to existing approaches, showcasing improvements in accuracy, recall, and F1-measure, especially for code slices longer than 256 tokens. In terms of performance, the authors states that compared to the VulDeePecker, SySeVR-BGRU, SySeVR-ABGRU, and Russell approaches, VulD-Transformer achieves 6.12%, 8.01%, and 7.63% improvement on average.
Research done in [28] aim to assess the effectiveness of vulnerability detection using ChatGPT 4 by exploring various prompt designs tailored specifically for this purpose. They propose improvements to the basic prompt and incorporate structural and sequential auxiliary information from the source code to enhance ChatGPT's vulnerability detection capabilities. Leveraging ChatGPT's ability to remember multi-round dialogue, they introduce a chain-of-thought prompting approach to further improve detection performance. The study involves extensive experimentation on two vulnerability datasets, where they evaluate the effectiveness of prompt-enhanced vulnerability detection using ChatGPT. Additionally, the authors analyze the strengths and weaknesses of using ChatGPT for vulnerability detection, providing insights into the potential of prompt engineering for large language models (LLMs) in this domain. The paper outlines a workflow starting from prompt design enhancements to experimental validation and concludes with discussions on the implications and validity threats of their findings.
The study presents several key findings regarding the performance and capabilities of ChatGPT in vulnerability detection. Firstly, it demonstrates that ChatGPT outperforms two baseline methods (CFGNN [71] and Bugram [32]) in terms of both accuracy and coverage, indicating its effectiveness in identifying vulnerabilities within code snippets.
The inclusion of a task role in the prompt has shown potential to enhance ChatGPT's performance in vulnerability detection, albeit with programming-language-specific improvements. However, the use of a simple basic prompt leads to ChatGPT's response being biased towards the keywords present in the prompt, affecting its ability to provide comprehensive vulnerability detection, especially in C/C++ programs.
The study reveals that while ChatGPT exhibits better proficiency in identifying vulnerabilities in Java programs compared to C/C++ programs with the basic prompt, it struggles to comprehend vulnerabilities comprehensively across both languages. The effectiveness of incorporating different auxiliary information varies between programming languages, with API calls being more effective for Java functions and data flow information contributing slightly to the understanding of C/C++ vulnerable programs. Furthermore, the application of chain-of-thought prompting yields differing effects on Java and C/C++ datasets, with significant improvements observed in the latter but a degradation in detection performance noted in the former. Despite this, ChatGPT demonstrates accurate understanding of code functionality in vulnerability detection scenarios.
Additionally, augmenting prompts with high-quality code summaries has been found to enhance ChatGPT's detection performance, although the impact varies depending on the programming language. Moreover, strategically placing API calls before the code and data flow information after the code has been shown to improve performance, with API call information contributing more to correct predictions of non-vulnerable samples and data flow information aiding in accurate predictions of vulnerable samples.
Lastly, ChatGPT exhibits proficiency in detecting vulnerabilities related to grammar or certain boundary-related types, but struggles with types that are contextually irrelevant or require a deeper understanding of the context. Overall, these findings highlight both the strengths and limitations of ChatGPT in vulnerability detection and offer insights into optimizing its performance through prompt design and auxiliary information incorporation.
In [64] The authors conducted a thorough survey of a wide range of LLMs (include GPT-4, Gemini 1.0 Pro, Wizard Coder, Code LLAMA, GPT-3.5, Mixtral-MoE, Mistral, StarCoder, LLAMA 2, StarChat-β, and MagiCoder) and prompts in various scenarios of vulnerability detection. They analyze a larger number of LLM responses with multiple raters compared to previous studies. Secondly, they investigated whether LLMs can accurately identify the types, locations, and causes of vulnerabilities, akin to industry-standard static analysis-based detectors. Additionally, they pinpoint the capabilities and code structures with which LLMs struggle.
The study conducted a comprehensive assessment of Language Model-based vulnerability detection, focusing on both its performance and error analysis. Results revealed that the models exhibited only marginal improvement over random guessing, with balanced accuracy ranging from 0.5 to 0.63. Notably, the models struggled to distinguish between buggy and fixed code versions, often making identical predictions for 76% of pairs. To address these issues, the paper introduced CoT, a method combining Static Analysis and Contrastive pairs, which showed enhancements in certain model performances. Error analysis indicated that Language Models frequently made errors in Code Understanding, Common Knowledge, Hallucination, and Logic when explaining vulnerabilities, with 57% of responses containing errors. Additionally, when subjected to complex debugging tasks from DbgBench, Language Models significantly underperformed compared to humans, accurately identifying only 6 out of 27 bugs. These findings underscore the limitations of Language Models in vulnerability detection, and the dataset of errors identified offers insights for potential enhancements in future Language Model-based vulnerability detection methods.
RQ3 Answer: Large language models (LLMs) are employed in diverse ways to detect and handle software vulnerabilities and cybersecurity threats. Examples include the creation of new cyber threat detection models like SecurityLLM or SecureFalcon for differentiating vulnerable code samples, and leveraging LLMs such as GPT-4 for automated program repair. These approaches demonstrate the versatility and effectiveness of LLMs in enhancing cybersecurity practices.
|
RQ4. What is the workflow of an LLM model in the context of detecting and handling software vulnerabilities and cyber security threats?
The potential for leveraging Large Language Models (LLMs) in the domain of software security and vulnerability is substantial. To enhance our readers' understanding, we will elucidate the workflow of several researchers who have employed LLMs for this specific purpose.
Automated code repair framework: The authors in [2], introduced a novel model that merges the capabilities of Large Language Models (LLMs) with Formal Verification strategies, presenting an automated code repair framework illustrated in Fig. 2. In this method, users input a test code to the Bounded Model Checker (BMC) module for initial verification or falsification. If the initial verification proves unsuccessful, the original code, along with details of the property violation generated by the BMC module, is transferred to the LLM module. The LLM module then generates modified code, which undergoes another round of verification by the BMC module in an iterative fashion. This collaborative process enables the model to automatically verify and repair software vulnerabilities, integrating formal verification techniques with the capabilities of Large Language Models.
Assessing over 1000 specifically generated C programs for this study, the results demonstrate that the integration of Bounded Model Checking (BMC) and Large Language Models (LLM) effectively identifies software vulnerabilities and suggests corrective patches. Notably, the devised approach exhibits the ability to rectify vulnerable code, addressing issues such as buffer overflow and pointer dereference failures with a commendable success rate of up to 80%.
SecurityLLM: In figure 3 we have the workflow of SecurityLLM [4]. It involves two primary components: SecurityBERT and FalconLLM. Initially, SecurityBERT functions as a cyber threat detection mechanism by collecting cybersecurity data from various open-source databases and repositories. It then extracts relevant features from network traffic logs, transforms the data into textual representation using Fixed-Length Language Encoding (FLLE), and tokenizes it using ByteLevelBPETokenizer. Subsequently, SecurityBERT embeds the data using a transformer-based architecture, leveraging self-attention mechanisms to capture contextual representations of the text. Meanwhile, FalconLLM serves as an incident response and recovery system, integrating with SecurityBERT to enhance the identification of network-based threats. Together, these techniques enable SecurityLLM to detect and manage cybersecurity threats effectively.
GPT-4 as a code analyzer: Authors of [9] utilized the latest OpenAI models, particularly GPT-4, accessed through a chat interface with a system context set to “act as the world's greatest static code analyzer for all major programming languages. I will give you a code snippet, and you will identify the language and analyze it for vulnerabilities. Give the output in a format: filename, vulnerabilities detected as a numbered list, and proposed fixes as a separate numbered list.” Seven different LLMs from OpenAI are employed, ranging in parameter sizes from 350M to 1.7 trillion, covering a wide spectrum of capabilities. The models are queried automatically using the API to identify vulnerabilities and propose fixes in sample code snippets across eight popular programming languages (C, Ruby, PHP, Java, Javascript, C#, Go, and Python). Additionally, a Single Codebase of Security Vulnerabilities is utilized, consisting of 128 code snippets representing thirty-three vulnerable categories across different programming languages. Six public repositories from GitHub are submitted to the automated static code scanner, Snyk, to illustrate identifiable vulnerabilities and language problems addressed by LLMs. Snyk provides comprehensive vulnerability intelligence metrics for evaluation. The study concludes by submitting corrected code samples from GPT-4 to Snyk for comparison against the vulnerable codebase, aiming to assess the self-correction capabilities of LLMs objectively validated by a third-party static code scanner.
The study's results indicated that both the static code analyzer (HP Fortify) and the Large Language Model (LLM) OpenAI's GPT-4 2023AUG3 version successfully identified three vulnerabilities in an Objective-C method. However, the LLM provided a detailed explanation of the vulnerabilities and proposed three fixes for each, enhancing the understanding and mitigation process. Comparing vulnerability detection between GPT-3 and Snyk, GPT-3 identified significantly more vulnerabilities, with a low false positive rate observed. Furthermore, GPT-4 identified nearly twice as many vulnerabilities as GPT-3 and four times as many as Snyk, indicating its effectiveness in vulnerability detection. GPT-4 proposed a comparable number of code fixes to the identified vulnerabilities, supporting its reliability in addressing security flaws.
The first step in the GPTScan model is to break down each logic vulnerability type into scenarios and properties. This involves understanding the context and characteristics of different vulnerabilities, which can range from reentrancy to timestamp dependence. Once these vulnerabilities are broken down, GPTScan then matches these candidate vulnerabilities with the Generative Pre-training Transformer (GPT).
GPTScan: In figure 4 we can see a high-level overview of GPTScan workflow [14]. GPTScan's workflow begins with a thorough analysis of the smart contract project. It first parses the project, which can consist of standalone Solidity files or complex frameworks containing multiple Solidity files. Through call graph analysis, GPTScan identifies the functions that are reachable within the project, considering both direct accessibility and potential indirect access via other functions.
Once the candidate functions are identified, GPTScan employs a multi-dimensional filtering approach to narrow down the functions for further analysis. This filtering process is essential to manage the complexity of large projects and to focus on functions that are most likely to contain vulnerabilities. It includes project-wide file filtering, which excludes non-Solidity files and third-party library files, and filtering out functions from common libraries like OpenZeppelin to reduce false positives.
After the initial filtering, GPTScan matches candidate functions with pre-abstracted scenarios and properties of relevant vulnerability types using Generative Pre-training Transformer (GPT). Unlike existing approaches that rely on high-level vulnerability descriptions, GPTScan breaks down vulnerabilities into code-level scenarios and properties. This approach enables GPT to interpret code-level semantics directly, improving the accuracy of vulnerability detection.
Once potential vulnerabilities are identified through GPT matching, GPTScan proceeds to recognize key variables and statements within the matched functions using GPT. These variables and statements are then subjected to static analysis modules for further validation. The static analysis tools employed by GPTScan include methods such as static data flow tracing, value comparison checks, order checks, and function call argument checks. These techniques help confirm the existence of vulnerabilities by analyzing the data flow, value comparisons, execution order, and function call arguments within the code.
Throughout the workflow, GPTScan addresses three main challenges: handling complex project structures, enabling effective GPT recognition, and ensuring reliable confirmation of potential vulnerabilities. By employing multi-dimensional function filtering, breaking down vulnerabilities into scenarios and properties, and utilizing static confirmation techniques, GPTScan achieves high precision in detecting logic vulnerabilities in smart contracts.
The evaluation of GPTScan reveals a low false positive rate of 4.39% when analyzing non-vulnerable top contracts like Top200. Additionally, it demonstrates a precision of 90.91% when assessing DefiHacks, indicating its suitability for extensive scanning of on-chain token contracts. Moreover, even when scrutinizing sizable contract projects within Web3Bugs, GPTScan maintains a satisfactory precision of 57.14%. These results, presented in Table III, provide insights into GPTScan's performance in identifying false positives and its precision across various contract datasets.
Table III: GPTScan False Positive Rate Analysis Results
Dataset Name
|
TP
|
TN
|
FP
|
FN
|
Sum
|
Top200
|
0
|
283
|
13
|
0
|
296
|
Web3Bugs
|
40
|
154
|
30
|
8
|
232
|
DefiHacks
|
10
|
19
|
1
|
4
|
34
|
LLMs for APR: The workflow for fine-tuning LLMs for Automated Program Repair (APR) [15], as shown in figure 5, involves a series of steps aimed at optimizing the model's ability to understand and generate code fixes. Initially, data pre-processing transforms raw source code into a format suitable for LLM processing, employing techniques like code abstraction and different code representations to enhance the model's understanding of fixing patterns. Subsequently, model training and tuning extend LLMs into the Neural Machine Translation (NMT) architecture, focusing on encoder-only and encoder-decoder models for their superior performance. Through iterative training on the dataset, the model learns domain knowledge for defect repair. During this process, checkpoints are evaluated using various metrics to identify the best-performing model for patch generation.
Following model evaluation, the patch generation phase employs the beam search strategy to synthesize patches from multiple repair models. Plausible patches are filtered using test cases, and manual validation is conducted to assess the correctness of the generated patches. This validation process helps ensure the accuracy of the generated fixes. Overall, this workflow aims to systematically explore preprocessing techniques, model architectures, evaluation metrics, and patch generation strategies to enhance the LLM's capability for automated program repair.
The paper demonstrates that Large Language Models of Code (LLMCs) exhibit strong repair capabilities across various scenarios under the Neural Machine Translation (NMT) fine-tuning paradigm. Even without employing post-processing strategies, LLMCs achieve impressive results, surpassing many existing Automated Program Repair (APR) approaches. The study provides practical guidelines for optimizing LLMC designs to enhance their repair capabilities and addresses complex defects effectively. Despite these successes, the paper also identifies limitations during evaluation, highlighting areas for improvement and suggesting future research directions. Additionally, the results presented in the paper establish valuable benchmarks for subsequent research in LLMC-based APR, underscoring the significant potential of LLMCs for practical application in software repair tasks.
VulDetect: In figure 6 we have the workflow of VulDetect [17, 24]. The authors introduce, a classification model designed for automatic software vulnerability detection, leveraging the extensive language model GPT-2. VulDetect employs a fine-tuned GPT-2 model to identify vectors associated with vulnerable code segments extracted from the target source code. Their Natural Language Processing (NLP) vulnerability model accepts a lengthy character string as input, typically a C source file. In the subsequent step, a tokenizer partitions the string into individual words and sub-words. Notably, syntax characters such as periods, semicolons, parentheses, and brackets are treated as distinct entities. Following this, the encoder transforms these words into vector representations. These vectorized words are then inputted into the model either sequentially (token by token) or in larger segments. In the implementation of this vulnerability detection technique, the output vector corresponds to the number of vulnerability classes present in the training dataset. Specifically, with 124 unique vulnerability classes, the output vector has a dimension of 124. This output vector is further processed through a Softmax function, which normalizes it into a probability distribution where the sum of all probabilities equals 1. Each element within the output vector indicates the predicted probability of the corresponding vulnerability class being present in the analyzed code file.
In the defense performance evaluation, the researchers opted to assess the efficacy of their technique by conducting tests on three classifiers (GPT-2, CodeBERT, and LSTM) using two standardized benchmark datasets (SARD and SeVC). The objective was to gauge the ability of their technique in detecting vulnerabilities. Results presented in Table IV illustrate a notably higher classification accuracy when implementing the VulDetect technique. Notably, the GPT-2 classifier demonstrated the highest accuracy, reaching up to 92.59% when tested on the SARD dataset, while the LSTM classifier exhibited the lowest accuracy, achieving only 65.78% when evaluated on the SeVC dataset. Overall, GPT-2 consistently outperformed the other architectures (CodeBERT and LSTM), which is in line with its reputation as a transformer-based model renowned for achieving state-of-the-art performance in various linguistic tasks, including sentiment analysis and sentence classification.
Table IV: Comparison of Model Detection Accuracy on SeVC and SARD Datasets.
Dataset
|
Model
|
Detection Accuracy
|
SeVC
|
GPT-2
|
87.63
|
|
CodeBERT
|
83.23
|
|
LSTM
|
71.59
|
SARD
|
GPT-2
|
92.59
|
|
CodeBERT
|
89.28
|
|
LSTM
|
76.86
|
Vulnerability repair solution: The workflow of vulnerability repair solution created in [19] is shown in Fig. 7. It employs pre-trained programming language (PL) models for automated vulnerability repair. Initially, the vulnerability code and its corresponding fixed code are extracted as bug-fix pairs (BFPs), serving as training data. Subsequently, the pre-trained PL model undergoes fine-tuning for the downstream task of vulnerability repair. Throughout the experiments, an optimal model is identified based on the evaluation using BLEU scores. In the experimental testing phase, the vulnerability code, comprising the code at the vulnerability location along with its contextual content, is inputted to enable the model to generate the predicted fix code.
Due to constraints regarding space and time, the authors opt to employ BERT-style pre-trained models for their current experiments. Initial investigations into these models reveal that not all BERT-style variants align with the automated program repair (APR) task at hand. For instance, CuBERT, trained solely on the Python corpus, proves unsuitable for the authors' C-based test dataset. Eventually, CodeBERT and GraphCodeBERT emerge as potential candidates for the APR task. Both models support multiple programming languages and offer downstream tasks akin to vulnerability repair. Notably, GraphCodeBERT integrates data flow information, enhancing its ability to manage intricate code logic relationships. This feature facilitates a comparative analysis between GraphCodeBERT and CodeBERT, enabling an evaluation of the advantages conferred by data flow features.
The experimental outcomes presented in table V, demonstrate the effectiveness of different repair solutions in addressing vulnerabilities. The repair accuracy, representing the repair capability of each solution, is analyzed alongside the percentage of successfully repaired vulnerabilities out of the total number. Notably, SequenceR emerges as the top performer in single-line vulnerability repair, closely followed by the authors' approach, which achieves comparable results. Specifically, CodeBERTFix and GraphCodeBERTFix exhibit high repair accuracy rates of 95.47% and 94.04% on single-line repairs, respectively. Moreover, the authors' approach excels in multi-line vulnerability repair, surpassing the performance of other solutions such as DLFix, SequenceR, CoCoNut, and the approach by Tufano et al. GraphCodeBERT, leveraging data flow graph information, demonstrates superior capability in multi-line repair compared to CodeBERT.
Table V: Experiment results
|
CWE 121
|
CWE 190
|
CWE 369
|
CWE 401
|
CWE 457
|
ALL
|
|
Single
|
Multi
|
Single
|
Multi
|
Single
|
Multi
|
Single
|
Multi
|
Single
|
Multi
|
Single
|
Multi
|
Tufano et al.
|
69/198
|
256/451
|
136/627
|
124/412
|
31/102
|
7/74
|
37/54
|
64/134
|
111/144
|
30/46
|
384/1125
|
481/1117
|
DLFix
|
70/198
|
0/451
|
351/627
|
0/412
|
33/102
|
0/74
|
48/54
|
0/134
|
76/144
|
0/46
|
578/1125
|
0/1117
|
SequenceR
|
183/198
|
0/451
|
623/627
|
0/412
|
100/102
|
0/74
|
51/54
|
0/134
|
136/144
|
0/46
|
1093/1125
|
0/1117
|
CoCoNut
|
176/198
|
0/451
|
410/627
|
0/412
|
82/102
|
0/74
|
49/54
|
0/134
|
112/144
|
0/46
|
829/1125
|
0/1117
|
CodeBERT
|
178/198
|
395/451
|
620/627
|
333/412
|
97/102
|
54/74
|
52/54
|
90/134
|
127/144
|
34/46
|
1074/1125
|
906/1117
|
GraphCodeBERT
|
181/198
|
410/451
|
600/627
|
379/412
|
91/102
|
60/74
|
48/54
|
124/134
|
138/144
|
33/46
|
1058/1125
|
1006/1117
|
VWC-MAP: The VWC-MAP [21] (Vulnerabilities-Weakness-Common Attack Pattern Mapping) is a two-tiered framework designed for automated classification of vulnerabilities into attack patterns via weaknesses based on their text descriptions.
First Tier - Classifying Vulnerabilities to Weakness: In the first tier, the model classifies vulnerabilities to weaknesses. This means it maps a given vulnerability to the corresponding weakness type. This is done by applying natural language processing (NLP) techniques to the text descriptions of the vulnerabilities.
Second Tier - Classifying Weakness to Attack Techniques: In the second tier, the model classifies weaknesses to attack techniques. This means it maps a given weakness to the corresponding attack technique. This is also done by applying NLP techniques to the text descriptions of the weaknesses.
The VWC-MAP framework uses CWE as an intermediate step because most CAPECs focus on exploiting CWEs, while CVEs are real-world instances of CWEs.
The authors have also presented two novel automated approaches for mapping weakness to attack techniques by applying Text-to-Text and link prediction techniques.
Link Prediction network: The Link Prediction network (shown in figure 8) in the VWC-MAP framework employs a Siamese architecture of a Neural Network. It begins by taking TF-IDF vectors of the CWE and CAPEC as input features. These features are then transformed into new dimensional vectors by a Feature Transformer Network. The transformed vectors are combined using a concatenation of feature subtraction and multiplication operations. The combined representation is then fed into a Link Classifier Network, which makes a binary prediction about the associations. Although pre-trained language models like BERT could be used as the Feature Transformer Network, the authors found that a basic Neural Network model quickly overfits the given data based on a few keywords. Therefore, it does not provide any extra benefit with the added complexity of BERT.
Text-to-text model: The task of mapping CWE to CAPEC can be conceptualized as a text generation challenge. The text-to-text model, known as Google’s T5, offers an alternative to link prediction for CWE-CAPEC mapping. While link prediction methods depend on negative examples and keyword-based decisions, T5 leverages transfer learning capabilities to generate CAPEC descriptions directly from CWE text descriptions. This task is more challenging as it requires generating entire sequences of CAPEC text, demanding a deeper understanding than keyword-based decision-making.
During training, T5 utilizes attention networks to prioritize keywords and understand their relationship with CWEs and CAPECs. Handling many-to-many relationships between CWEs and CAPECs presents a challenge, which is addressed by incorporating special commands during training. These commands, such as 'One Weakness to Attack' and 'Two Weakness to Attack', aid in modeling the relationships between CWEs and multiple CAPECs. Additionally, relationships between CWEs themselves are modeled using commands like 'Weakness Child of Weakness'. During inference, commands are passed along with CWE descriptions to generate corresponding CAPEC descriptions. The generated texts are then vectorized and compared with existing CAPECs to find the best match using cosine similarity. The T5 model's capabilities extend to few-shot learning and allow for the generation of CAPEC definitions corresponding to user-specified CWE descriptions.
The experimental results, which were cross-validated by cybersecurity experts, demonstrate that VWC-MAP can associate vulnerabilities to weakness types with up to 87% accuracy, and weaknesses to new attack patterns with up to 80% accuracy.
In summary, VWC-MAP is a novel framework that automatically maps CVEs to CWEs and CAPECs. This is highly beneficial for cyber risk management tools that require automated association among CVEs, CWEs, and CAPECs to cope with the rapid emergence of new vulnerabilities, weaknesses, or attack techniques.
VulD-Transformer: VulD-Transformer [25], as shown in Figure 9, is a comprehensive framework for source code vulnerability detection. It comprises four main components: the input module, code parser, vulnerability detector, and output module. Each part plays a crucial role in the detection process.
1. Input Module: This module accepts the source code as input and initiates the detection process.
2. Code Parser: The code parser processes the source code according to specific syntax rules to generate code slices, which are segments of code containing potential vulnerabilities. This process involves several steps:
2.1. Generating a program dependency graph (PDG) using the Joern tool.
2.2. Identifying vulnerability candidates within the PDG based on predefined syntax rules.
2.3. Generating code slices from the identified vulnerability candidates.
2.4. Cleaning and normalizing the code slices to remove irrelevant characters and standardizing identifiers and variable names.
3. Vulnerability Detector: The vulnerability detector assesses whether the code slices contain vulnerabilities. It includes two main operations:
3.1. Vector conversion: Utilizing FastText, the code slices are converted into vectors. FastText's character-level embedding allows capturing subtle relationships between identifiers in the source code, enhancing the detection accuracy.
3.2. Vulnerability detector: The vulnerability detector learns the vector representation of the code slices, obtains the global features of the code slice, and obtains the final vulnerability detection result at the last layer. The vulnerability detection model leverages the Transformer architecture to learn vector representations of the code slices. Only the encoding part of the standard Transformer structure is used in the vulnerability detection task in this model. The decoder, which belongs to the generative model often used in natural language generation, is not used.
This model comprises several key components:
3.2.1. Position encoding: Adding positional information to the code slice vectors to preserve their sequential order.
3.2.2. Multi-head attention mechanism: Capturing long-distance dependencies between code tokens by computing mutual attention.
3.2.3. Feed-forward layer: Consisting of fully connected layers with ReLU activation to extract higher-level features from the code slice representations.
3.2.4. Add & Norm layer: Adding and normalizing the output of the attention and feed-forward layers.
3.2.5 Fully connected layer with softmax activation: Producing the final vulnerability detection results, categorizing code slices as either non-vulnerable or vulnerable.
4. Output Module: This module presents the vulnerability detection results to the user, indicating whether vulnerabilities are detected in the analyzed source code.
The framework's effectiveness relies on its ability to accurately generate code slices and leverage advanced deep learning techniques, such as the Transformer architecture, to detect vulnerabilities within these slices. By integrating techniques for code parsing, vectorization, and vulnerability detection, VulD-Transformer offers a comprehensive solution for identifying vulnerabilities in source code, contributing to enhanced software security.
The authors investigated the effectiveness of VulD-Transformer for vulnerability detection and whether incorporating FastText word vectors and a Transformer encoder improves its performance. To assess this, they employed evaluation metrics including Accuracy (A), Recall (R), and F1-measure (F1).
In RQ1, experiment 1 assessed vulnerability detection across various code slice lengths, where VulD-Transformer excelled, particularly in longer slices (> 128 tokens), showcasing its adeptness at learning contextual information. Experiment 2 investigated vulnerability detection for different syntax rules, with VulD-Transformer showcasing superior performance, notably on AE and PU datasets. Experiment 3 tested VulD-Transformer's detection capability on real software vulnerability datasets, achieving higher accuracy, recall, and F1-measure compared to other methods (VulDeePecker, SySeVR-BGRU, SySeVR-ABGRU, and Russell).
In RQ2, the impact of incorporating FastText word vectors and a Transformer encoder was examined, revealing improvements in VulD-Transformer's vulnerability detection. Models utilizing FastText vectors exhibited enhanced accuracy, recall, and F1-score, especially when combined with the Transformer encoder. In conclusion, VulD-Transformer emerges as an effective vulnerability detection method, particularly adept at longer code slices, and its effectiveness is further enhanced by accommodating FastText word vectors and a Transformer encoder.
Using LLM to create dataset: Researchers in [30] introduced the FormAI dataset, a comprehensive collection of 112,000 AI-generated C programs, each labeled with vulnerabilities, to foster research in AI-driven code generation and security. The authors utilize Large Language Models (LLMs), particularly the GPT-3.5-turbo model, to dynamically prompt the generation of diverse C programs, varying in complexity and task types. Leveraging formal verification through the Efficient SMT-based Bounded Model Checker (ESBMC), vulnerabilities within the generated code are identified, labeled, and associated with Common Weakness Enumeration (CWE) numbers. The dataset aims to provide valuable resources for training LLMs and machine learning algorithms while addressing critical concerns regarding the safety and security of AI-generated code. STELOCODER A DECODER-ONLY LLM FOR MULTI-LANGUAGE
Their methodology, shown in Figure 11, involves several steps to construct and classify the FormAI dataset. Initially, GPT-3.5-turbo is prompted to generate C programs for diverse tasks, ranging from complex network management to simple string manipulation. Each output program is then subjected to compilation using the GNU C compiler to ensure compilability. Subsequently, the ESBMC module performs formal verification to detect vulnerabilities within the compiled programs. Detected vulnerabilities, along with their specific details such as line numbers and function names, are recorded in a .csv file, facilitating further analysis. This process ensures that vulnerabilities are conclusively identified, minimizing the risk of false positives and providing a formal counterexample for each vulnerability detected.
7.4 RQ5: What is the best type of data sets to train LLMs for software vulnerability detection?
For detecting and handling software vulnerabilities and cybersecurity threats with LLMs, a combination of text-based and code-based datasets tends to be most effective
Text-Based Datasets: These are crucial for tasks involving bug fixing, code comprehension, and understanding textual content related to vulnerabilities. They aid LLMs in grasping the context around security issues, identifying patterns in security-related texts (such as bug reports or security advisories), and enhancing their ability to generate secure code or identify vulnerabilities in code [1, 7, 12].
Code-Based Datasets: Essential for LLMs to comprehend and analyze code, especially for identifying vulnerabilities within the codebase itself. This type of dataset helps LLMs understand the structure, logic, and potential flaws in software code, enabling them to identify vulnerabilities, suggest fixes, or even generate secure code [1, 61].
By combining these datasets, LLMs can learn to correlate textual information (like security advisories or bug reports) with the actual code, which is crucial in cybersecurity. Understanding the context within which vulnerabilities are reported and how they manifest in code allows LLMs to provide more comprehensive and accurate support in detecting, addressing, and potentially preventing security threats.
7.4.1 Examples of datasets used to train LLMs for software security and cybersecurity purposes.
Data collection for this purpose entails acquiring he training data from various open-source databases and repositories such as CVEfixes, Big-Vul, Draper, SARD, Juliet, Devign, REVEAL, DiverseVul, and many others, encompassing different security aspects [4, 30].
CVE dataset: The Common Vulnerabilities and Exposures (CVE) dataset is a comprehensive vulnerability dataset that is automatically collected and curated from CVE records in the public U.S. National Vulnerability Database (NVD). CVE datasets contain information on publicly disclosed cybersecurity vulnerabilities. These datasets are valuable for researchers, security professionals, and anyone interested in staying up-to-date on the latest threats. As of now, there are 228,713 CVE Records. It lists reported vulnerabilities in software systems.
CVE dataset typically includes:
- CVE ID: A unique identifier for the vulnerability assigned by Mitre, the CVE Program authority.
- Description: A detailed explanation of the vulnerability, including the affected software, potential impact, and how it can be exploited.
- CVSS Score: A scoring system (Common Vulnerability Scoring System) that reflects the severity of the vulnerability based on its exploitability, impact, and scope.
- Published Date: The date the vulnerability was publicly disclosed.
- References: Links to additional resources such as patches, workarounds, and exploit code.
- Affected Products: A list of software programs or systems that are vulnerable to the exploit.
To help researchers, we compiled a list of resources where CVE datasets can be found:
- The National Institute of Standards and Technology (NIST)[1]
- MITRE CVE (This is the official source for CVE data and provides downloads in various formats).[2]
- New CVE official website[3].
- Download page in new CVE website[4]. (This page offers downloads of the CVE List in legacy formats until June 30th, 2024, and the new recommended JSON 5.0 format).
- CVE Details[5] / (This website offers a searchable CVE database with additional information like exploit details and vendor risk scores).
CWE dataset: The CWE (Common Weakness Enumeration) dataset, sourced from MITRE, focuses on classifying software weaknesses. The CWE provides a provides a standardized way to classify software weaknesses. This is useful to developers, system analysts, software testers, and security researchers. The CWE dataset is organized as a relational database and covers a wide range of vulnerabilities [21].
Each CWE entry provides details about a specific weakness, including:
- ID: A unique identifier for the weakness.
- Name: A concise description of the weakness.
- Description: A more elaborate explanation of the weakness, its potential consequences, and how it can be exploited.
- Extended Description: Additional in-depth information about the weakness.
- Relationships: Connections to other CWEs and relevant security concepts.
- Hierarchies: Organization of CWEs within a structured classification scheme.
CAPEC dataset: The Common Attack Pattern Enumerations and Classifications (CAPEC) dataset, also from MITRE, focuses on classifying cyber-attack patterns. Similar to CWE, CAPEC provides a standardized language for understanding attacker methods. Entries are organized hierarchically and can be linked to show relationships between different attack patterns.
CVEfixes: The CVEfixes dataset was first introduced by Bhandari et al. [63]. This comprehensive vulnerability dataset is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). It aims to support data-driven security research based on source code and source code metrics related to fixes for CVEs by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level. The initial release of the dataset covers all published CVEs up to June 9, 2021, and includes information from 5,495 vulnerability fixing commits in 1,754 open-source projects, covering a total of 5,365 CVEs across 180 different Common Weakness Enumeration (CWE) types. Additionally, the dataset includes the source code before and after fixing for 18,249 files and 50,322 functions1 [13, 15, 63].
DiverseVul: DiverseVul was first introduced by [27], designed for the detection of software vulnerabilities through deep learning techniques. DiverseVul comprises 18,945 vulnerable functions across 155 Common Weakness Enumerations (CWEs) and 330,492 non-vulnerable functions, sourced from 7,514 commits. Notably, this dataset offers greater diversity and is twice the size of the previous largest and most diverse dataset, CVEFixes. Utilizing DiverseVul, the study investigates the efficacy of various deep learning architectures in vulnerability detection. It explores 11 distinct deep learning architectures from four model families: Graph Neural Networks (GNN), RoBERTa, GPT-2, and T5. Findings suggest that the increased diversity and volume of training data contribute to enhanced vulnerability detection, particularly for large language models.
The software assurance reference dataset (SARD): SARD is particularly attractive due to its inclusion of both security vulnerabilities and non-vulnerable alternatives. This feature enables the model to discern between these distinct categories effectively. Subsequently, it is possible to implement a preprocessing stage to eliminate any undesired artifacts that could potentially cause overfitting in the model [7, 17, 22, 24, 27, 28, 30, 39].
Semantics-based Vulnerability Candidate (SeVC): SeVC dataset contains 1,591 C/C++ open-source programs sourced from the National Vulnerability Database (NVD), along with 14,000 open-source programs from SARD. Within this dataset, there are a total of 420,627 SeVCs, among which 56,395 are identified as vulnerable, while 364,232 are deemed non-vulnerable. Additionally, the dataset encompasses four types of SeVCs: Library/API Function Calls, Array Usage, Pointer Usage, and Arithmetic Expression [7, 17, 22, 24].
Devign: The Devign dataset, initially presented in [62], stands as a real-world dataset designed for the identification of software vulnerabilities. It comprises function-level C/C++ source code extracted from two extensively utilized open-source software projects, namely QEMU and FFmpeg. The labeling and verification procedures were conducted manually by a team of security researchers across two distinct rounds [7, 17, 22, 24, 27, 28, 30, 43].
Big-Vul: The Big-Vul dataset is a collection of C/C++ vulnerabilities across several project repositories. It is a comprehensive dataset that includes code changes and Common Vulnerabilities and Exposures (CVE) summaries. Each entry in the dataset covers the period from 2002 to 2019 and consists of 21 features. The dataset is used for various research topics, such as detecting and fixing vulnerabilities, and analyzing the vulnerability-related code changes. It can be particularly useful for training and evaluating models designed for vulnerability detection [34, 73].
RQ5 Answer: For optimal training of Large Language Models (LLMs) in software vulnerability detection, a mix of text-based and code-based datasets is ideal. Text datasets aid in understanding security context, while code datasets help analyze vulnerabilities. Examples include CVE, CWE, CAPEC CVEfixes, DiverseVul, SARD, SeVC, and Devign, enhancing LLMs' effectiveness.
|
RQ6. In comparison to traditional methods/tools, how do LLMs perform in detecting and handling software vulnerabilities and cyber security threats?
Our research provides compelling evidence of the effectiveness of Large Language Models (LLMs) in detecting and handling software vulnerabilities. Several studies have demonstrated the superiority of LLMs over traditional methods:
SecurityLLM: Integration of SecurityBERT and FalconLLM resulted in the creation of a cyber threat detection model with an overall accuracy of 98%, capable of identifying fourteen different types of attacks.
GPT3.5 for Penetration Testing: Utilizing GPT3.5, researchers enhanced penetration testing by leveraging high-level strategic planning and identifying weak spots in vulnerable computing environments, achieving a closed-feedback loop between model-generated actions and vulnerable virtual machines.
GPT-4 vs. Static Code Analyzers: A study comparing GPT-4 with traditional static code analyzers like Snyk and Fortify found that GPT-4 detected approximately four times more vulnerabilities with a low false positive rate. GPT-4 also provided potential fixes for identified vulnerabilities, resulting in a significant decrease in vulnerabilities with minimal increase in code lines.
BERTBase for Vulnerability Detection: Fine-tuning the BERTBase model for vulnerability detection resulted in surpassing the performance of standard LSTM and BiLSTM models, achieving the highest detection accuracy of 93.49%.
VulDetect with GPT-2: The creation of VulDetect, a classification model based on GPT-2, demonstrated the effectiveness of LLMs when applied to a significantly large dataset, achieving superior performance compared to other model architectures.
GitHub Copilot for Code Generation: Models trained on source code rather than natural language, often referred to as code-based models or Large Language Models of Code (LLMCs), have emerged as powerful tools in software engineering and related fields. These models leverage algorithms and techniques tailored to analyze and understand programming languages and their syntax. By training on vast repositories of code, they acquire an understanding of programming logic, enabling them to generate functional code snippets and even entire programs that meet specified criteria. This capability holds immense promise for automating software development tasks, such as code completion, bug detection, and program synthesis. Furthermore, these code-based models can undergo rigorous testing procedures to ensure the reliability and robustness of the generated code. Researchers are continually advancing these models, exploring new architectures, training methodologies, and applications across various domains within computer science and software engineering.
GitHub Copilot and LaMDA Code are prominent example of large language models highly regarded within the developer community. GitHub Copilot developed in collaboration with OpenAI, leverages the power of OpenAI Codex, a sophisticated AI system trained on a vast dataset of public source code. Furthermore, models primarily trained on human language, exemplified by OpenAI's ChatGPT, demonstrate proficiency in similar tasks [31, 35, 72].
Transformer-Based Language Models (LLMs) vs. Traditional Methods: This study the superior performance of transformer-based language models (LLMs) over code analyzers or traditional static and recurrent neural network (RNN)-based methods was showcases in [22]. software vulnerability detection. Through a systematic evaluation framework, it was demonstrated that LLMs, particularly GPT-2 Large and GPT-2 XL, consistently outperformed BiLSTM and BiGRU models across various metrics, including false positive rate (FPR), false negative rate (FNR), and F1-score, in both binary and multi-class classification tasks. Furthermore, when compared to BERTBase and GPT-2 Base, LLMs exhibited better performance in identifying vulnerabilities across different categories, reinforcing their efficacy in software vulnerability detection tasks. These findings underscore the significance of utilizing LLMs as powerful tools for enhancing the security of software systems by effectively identifying potential vulnerabilities.
However, not all studies demonstrate superior performance from Large Language Models (LLMs). For instance [13] evaluations indicate that Large Language Models (LLMs) struggle in detecting software vulnerabilities. This is primarily due to high numbers of false positives identified by the models. While preprocessing techniques like constructing code gadgets may improve the LLMs' ability to recall actual vulnerabilities (recall rate), the number of false positives remains persistently high. The study also found that LLMs, particularly when fine-tuned, exhibit proficiency in recognizing common patterns associated with vulnerable code. For example, they can effectively identify code patterns using object-relational model libraries that could be susceptible to SQL injection vulnerabilities. Additionally, the authors observed that ChatGPT 4.0 can potentially understand the "intention" behind a given code snippet. They hypothesize that LLMs' strong recall performance might be partly attributed to their ability to identify vulnerable code patterns across multiple lines of code. This is a significant advantage compared to traditional static analysis methods, where manually crafting such rules is a time-consuming and expensive process.
Also, the results of the experiment conducted in [64, 75] showed that while the prompting methods improved the models’ performance, LLMs generally struggled with vulnerability detection. They reported a Balanced Accuracy of 0.5-0.63 and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average.
Over-all, these findings collectively underscore the capability of Large Language Models in detecting and handling software vulnerabilities, outperforming traditional methods such as and recurrent neural network models.
RQ6 Answer: Large Language Models (LLMs) show significant promise in detecting software vulnerabilities compared to traditional methods, offering high accuracy rates and the ability to recognize complex code patterns. Despite some challenges, such as high FPR and FNR, and struggles in distinguishing between buggy and fixed versions, with more advancement in this field, LLMs can be valuable tools for enhancing software security.
|
7.6 RQ7. What metrics assess large language Models in addressing software vulnerabilities and cyber threats?
Evaluation metrics are crucial for assessing the effectiveness and success of LLMs in software vulnerabilities and cybersecurity applications. These metrics provide a framework for quantifying the performance of these models across various tasks related to software vulnerabilities and cybersecurity threats, such as vulnerability detection, threat prediction, and automated patching. Given the diverse nature of these tasks, employing a range of evaluation metrics tailored to specific problem types is common practice.
Classification Tasks: For tasks involving classification, such as identifying types of vulnerabilities or predicting potential threats, F1-score, Precision, and Recall are commonly used metrics. They gauge the model's ability to classify code snippets accurately or identify specific security properties [1, 4, 17, 18, 22, 27, 36, 37].
Recommendation Tasks: Mean Reciprocal Rank (MRR) is a prevalent metric for recommendation systems related to code completion. Precision@k and F1-score@k are also employed in evaluating the precision and F1-score of recommended code snippets or completions [1]
Generation Tasks: BLEU (including its variants BLEU-4 and BLEU-DC) and Pass@k are widely used metrics for code-to-code translation models and code generation assessment. These metrics evaluate the quality and accuracy of generated code snippets compared to reference solutions. Additionally, other metrics like ROUGE/ROUGE-L, METEOR, EM (Exact Match), and ES (Edit Similarity) are utilized in specific studies to assess the quality of generated code or natural language code descriptions [1, 20, 58].
AUC-ROC: The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a metric that assesses the model's ability to distinguish between positive and negative cases across different classification thresholds. A higher AUC-ROC value indicates better overall performance.
Code Coverage: In the context of vulnerability detection, code coverage metrics assess the extent to which the LLM has analyzed the source code. Higher code coverage generally implies a more thorough analysis and potentially better vulnerability detection.
False Positive Rate and False Negative Rate: These metrics measure the rates at which the model incorrectly identifies non-vulnerable code as vulnerable (false positive) or fails to detect actual vulnerabilities (false negative). Minimizing both rates is crucial for reliable vulnerability detection.
These approach to evaluating LLM performance recognizes the nuanced nature of software engineering tasks, employing specific metrics tailored to the task at hand, whether it's classification, recommendation, or generation, to comprehensively measure model effectiveness and accuracy.
RQ7 Answer: Evaluation metrics for assessing LLMs in addressing software vulnerabilities and cyber threats include F1-score, Precision, Recall for classification tasks, MRR for recommendation tasks, BLEU and Pass@k for generation tasks, AUC-ROC for discrimination ability, and metrics like code coverage, false positive rate, and false negative rate for vulnerability detection. These metrics offer a comprehensive assessment of LLM performance across various tasks in software security.
|
7.7. RQ8: What are the challenges of using LLMs in cybersecurity tasks?
What are the limitations or challenges associated with using LLMs for software security, and how can they be mitigated? In this section we will try to shed some light to this question.
Prompt injection attacks: One of the significant security concerns related to LLMs, is prompt injection attacks [68, 69, 70]. Prompt injection attacks involve manipulating the input prompts given to a Large Language Model (LLM) in order to coax it into generating responses that are harmful, unauthorized, or disclose sensitive information. Malicious users may attempt to deceive the LLM into producing outputs that could compromise security, breach privacy, or cause harm in various ways. These attacks exploit vulnerabilities in the LLM's processing of prompts, potentially leading to disruptive outcomes or security breaches. This form of attack, especially potent within LLM-integrated applications, has recently been identified as the primary LLM-related threat by OWASP (Open Web Application Security Project) Foundation. Such manipulation can result in adverse consequences, such as providing incorrect guidance or unauthorized divulgence of confidential data [65, 66, 67, 74].
Prompt injection is important because it highlights the need to address issues related to prompt abuse and prompt leak as we transition further into the era of Large Language Models (LLMs). It is crucial to protect LLM-integrated applications from prompt injection threats, a fact recognized by many developers who have demonstrated increasing vigilance in the implementation of prompt protection systems and the quest for novel solutions [65, 66, 67, 74]. There are two main categories of prompt injections:
Direct Prompt Injections: Commonly referred to as "jailbreaking," direct prompt injections involve altering or exposing the system prompt, often resulting in partial loss of intellectual property. This process may entail creating prompts with the specific goal of bypassing safety and moderation measures implemented by creators of Large Language Models (LLMs) [74].
Indirect Prompt Injections: indirect prompt injections occur when an LLM accepts input from external sources that can be manipulated by an attacker, such as websites or files. In this scenario, attackers can trick the LLM into interpreting its input as "commands" rather than "data" for processing, leading to unexpected behavior in LLM-based applications or compromising the security of the entire system [74].
Vulnerability Detection Challenges: Detecting vulnerabilities with LLMs can be problematic due to their ability to generate various alternative responses for the same issue. This diversity, beneficial in language processing and text generation, may pose challenges in pinpointing the actual vulnerability. The presence of multiple solutions, although advantageous in certain contexts, can complicate the identification of even the simplest software security vulnerabilities [2].
LLM Code Generation Challenges: LLMs may struggle to generate accurate code when faced with multiple valid solutions, leading to functionally correct but contextually inappropriate code. They might perform well on specific tasks they were trained on but struggle with different tasks, languages, or domains outside of their training scope. Their performance can deteriorate significantly when inputs undergo semantic-preserving transformations [1, 20].
The study done in [29] analyzed 2033 programming tasks and 4066 ChatGPT-generated code snippets implemented in two popular programming languages: Java and Python, with 2,556 code with quality issues to comprehensively assess AI-generated code quality and uncover performance-influencing factors. While ChatGPT3.5 could produce functional code for various tasks, their research findings uncovered a range of code quality concerns in ChatGPT3.5-generated code, spanning from compilation and runtime errors to incorrect outputs and issues with maintainability. This discovery underscores the critical need to tackle these issues diligently to safeguard the sustained efficacy of AI-driven code generation and uphold the standards of high-quality software systems.
Hallucinations: LLMs frequently generate false information, known as "hallucinations," which appear statistically plausible. Studies indicate that incorporating external knowledge and automated feedback mechanisms can mitigate these hallucinations [5, 64, 74].
Code Quality Challenges: Code quality issues pose significant concerns due to their potential to incur financial and reputational losses.
LLM Deployment Challenges: Large Language Models (LLMs) have become instrumental in software development, but they come with their own set of challenges. Their enormous size demands significant resources, making deployment challenging in resource-limited scenarios. They rely heavily on large and diverse datasets for training, and limited or biased data can lead to inaccurate predictions. There are also concerns about privacy leaks with Personally Identifiable Information (PII) in training data [1].
LLM Evaluation: Existing evaluation metrics might not capture all aspects of model performance, such as interpretability, robustness, or sensitivity to certain errors. LLMs often lack interpretability and transparency in their decision-making processes, leading to uncertainty among developers. Concerns exist around the ownership of training data, derivative data, and potential adversarial attacks by seeding vulnerabilities into LLMs [1].
When comparing the performance of Large Language Models (LLMs) in software vulnerability and cybersecurity domains, a pertinent question arises regarding the fairness of such comparisons, given the variability in testing conditions across different scenarios. For instance, the robustness of the system could significantly impact LLM performance when both training and testing the model on the same dataset. Notably, not all researchers have access to supercomputers or high-performance workstations, which could affect the scalability and efficiency of model training and evaluation. Furthermore, changes made to the dataset over time or the selective use of specific portions of the dataset during testing may introduce additional variability, potentially influencing the outcomes of comparative analyses.
Insecure output management: Developers need to be careful since LLMs may produce harmful outputs. Insecure output management occurs when LLM outputs are not properly validated or sanitized before use, which can lead to security risks like Cross-Site Scripting and Cross-Site Request Forgery in web browsers. Furthermore, neglecting to validate LLM outputs may lead to downstream security exploits, including code execution that compromises systems and exposes data. Attackers can also exploit these outputs for privilege escalation and remote code execution on backend systems [74].
Poisoning of educational data: Training data poisoning refers to the deliberate manipulation of the data used to train models with malicious intent. Adversaries insert deceptive or biased examples into the training dataset during pre-training or fine-tuning to influence the model's learning process. This can involve introducing backdoors, biases, or other vulnerabilities that compromise the security, performance, and reliability of the model [74]. Manipulated training data can disrupt LLM models, leading to responses that may compromise security, accuracy, or ethical behavior.
Denial of service model: Training and running Large Models (LMs) requires substantial resources. An attacker can engage with LMs in a way that causes them to use resources excessively, thereby reducing the quality of service or even denying service to other users, and increasing compute costs. Attackers can create prompts that are computationally demanding in terms of context length or language patterns [74].
Supply chain vulnerabilities: The supply chain encompasses the complete journey from gathering data and training the model to its deployment. This process includes different elements like the training data, pre-trained models, and deployment infrastructure. Each element is susceptible to vulnerabilities: the crowd-sourced training data might be tainted, the pre-trained model could be compromised, or the third-party packages employed in LLM development could be insecure. Depending on the compromised components, services or datasets, they undermine system integrity, causing data breaches and system crashes [74].
Disclosure of sensitive information: Large Language Models (LLMs) are initially trained on varied datasets containing snippets of real-world information. When generating responses, these models may unintentionally disclose sensitive details. For instance, conversational agents like OpenAI’s ChatGPT and Google’s Gemini gather user prompts during interactions to improve their performance. However, this approach poses a security and privacy risk, as the model might generate outputs inadvertently revealing confidential or private information. Additionally, by employing meticulously constructed prompts, an attacker could deliberately exploit this vulnerability to reveal or expose sensitive details. Failure to protect against disclosure of sensitive information in LLM outputs can lead to legal consequences or loss of competitive advantage [74].
Insecure plugin design: Those LLM plugins lacking proper access control or input validation may lead to vulnerabilities such as SQL injection, and remote code execution. Frequently, these plugins accept user input as unrestricted text, making them susceptible to exploitation by attackers [74].
Too much agency: Large Language Model powered systems base their decisions on user prompts or inputs received from integrated components. Excessive autonomy or authorization granted to LLMs can introduce vulnerabilities susceptible to exploitation by malicious actors, potentially compromising the entire system. However, even without deliberate attacks, unintended user prompts, or wrong actions from connected systems can lead LLMs to generate misleading or unforeseen outputs, causing system malfunctions. As an illustrative example, consider an LLM-based file summarizer that utilizes a third-party plugin for user file access. This plugin, beyond reading capabilities, might also possess functionalities for file modification and deletion. If a user encounters discrepancies in the LLM's summary, their attempt to report the error to the application could inadvertently trigger the LLM to modify or delete the original files, highlighting the potential for unintended consequences. Ultimately, giving LLMs unchecked autonomy to act can lead to unintended consequences that compromise reliability, privacy and trust [74].
Overconfidence: The utilization of these models for source code generation presents a potential avenue for the inadvertent introduction of security vulnerabilities. These vulnerabilities can pose significant threats to the safety and security of applications and their users. The uncritical application of information or code generated by LLMs, without appropriate scrutiny, can lead to a cascade of negative consequences. These consequences include security breaches, the dissemination of misinformation, communication disruptions, legal issues, and reputational damage [74].
Model theft: The unauthorized copying or extraction of weights, parameters, or data from closed-source LLMs constitutes a form of intellectual property theft. This illicit practice can inflict significant economic losses on developers and damage brand reputation, ultimately jeopardizing a company's competitive edge. Perpetrators may exploit the purloined proprietary information for their own gain or utilize the stolen model for malicious purposes [74].
It's important to recognize both the opportunities and threats, presented by artificial intelligence (AI) technologies. We must acknowledge the transformative potential of LLMs, but we must also accept that there are emerging risks. As LLMs continue to advance, there is a simultaneous need to understand and control the associated risks. The key to maximizing automation's benefits and enhancing functionalities through LLMs across various fields, lies in mitigating the inherent risks associated with AI systems.
To effectively reduce risks, increased awareness and preparation are crucial. Software engineers play a vital role by analyzing past methods and incorporating those learnings into secure data and system mechanisms. New technologies are constantly emerging to monitor threats and mitigate risks. As pioneers in AI design and development, software engineers have a responsibility to identify new vulnerabilities and implement safeguards to protect users.
RQ8 Answer: Challenges of using LLMs in cybersecurity tasks includes but not limited to: prompt injection attacks, vulnerability detection difficulties, code generation challenges, hallucinations, deployment hurdles, evaluation limitations, insecure output management, data poisoning risks, denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, unchecked autonomy, overconfidence, and model theft. These challenges underscore the importance of understanding and mitigating risks associated with LLMs to maximize their benefits while ensuring security and reliability in various applications.
|
7.8 RQ9. How to enhance LLMs effectiveness in software vulnerability and cyber threat detection?
Integration with Existing Tools: To enhance software vulnerability and cyber threat detection, Large Language Models (LLMs) can be effectively combined with complementary methods and tools. Static code analyzers excel at detecting known vulnerabilities and coding errors, while dynamic analysis tools like fuzzers uncover runtime vulnerabilities. Integrating LLMs into security testing frameworks and threat intelligence platforms allows for comprehensive vulnerability detection and proactive threat mitigation.
Human expertise remains crucial for domain-specific knowledge and validation of LLM-generated results. LLMs also can be combined with BMC to create automated code repair frameworks.
Continuous monitoring systems with LLM integration enable prompt incident detection and response. Collaborative platforms facilitate knowledge sharing and coordinated efforts in addressing security issues. By combining these approaches, organizations can achieve comprehensive coverage and improve the effectiveness of software vulnerability and cyber threat detection.
Larger and More Diverse Datasets: Training LLMs on a wider range of vulnerable and non-vulnerable code across various programming languages can enhance their ability to generalize and identify vulnerabilities in unseen code. However, given the challenges associated with training LLMs on extensive datasets, there are endeavors aimed at enhancing LLM accuracy by training them on smaller datasets. We believe that achieving this objective could substantially enhance the public usability of LLMs.
Focus on Specific Vulnerabilities: In some scenarios, training LLMs on datasets focused on specific types of vulnerabilities can improve their accuracy in detecting those vulnerabilities, but this method is most suitable for scenarios where a system is only prone to specific attacks.
Preprocessing Techniques: Techniques like constructing code gadgets can help LLMs distinguish actual vulnerabilities from irrelevant patterns.
Prompt Engineering: Craft effective prompts to guide LLMs towards vulnerability detection tasks, improving focus and accuracy [28].
Continuous Learning: Regularly update LLMs with new vulnerability and threat data to enhance detection of emerging security risks.
Efforts toward optimizing model size, enhancing data diversity and quality, improving code generation in ambiguous scenarios, enhancing generalizability, developing better evaluation methodologies, and focusing on interpretability and ethical use of LLMs. By combining these strategies, we can unlock the full potential of LLMs and make them even more powerful tools in the fight against software vulnerabilities and cyber threats.
RQ9Answer: To enhance LLMs' effectiveness in software vulnerability and cyber threat detection, organizations can integrate them with existing tools like static code analyzers and dynamic analysis tools. Additionally, training LLMs on larger and more diverse datasets, focusing on specific vulnerabilities, and employing preprocessing techniques can improve their accuracy. Crafting effective prompts, ensuring continuous learning by updating LLMs with new data, and optimizing model size and data quality are essential. Developing better evaluation methodologies and prioritizing interpretability and ethical use further unlock the full potential of LLMs in addressing software vulnerabilities and cyber threats.
|
[1] https://www.kaggle.com/datasets/andrewkronser/cve-common-vulnerabilities-and-exposures
[2] https://cve.mitre.org/data/downloads/index.html
[3] https://www.cve.org/
[4] https://www.cve.org/Downloads#current-format
[5] https://www.cvedetails.com/