Using Large Language Models to Better Detect and Handle Software Vulnerabilities and Cyber Security Threats

doi:10.21203/rs.3.rs-4387414/v1

Download PDF

Research Article

Using Large Language Models to Better Detect and Handle Software Vulnerabilities and Cyber Security Threats

https://doi.org/10.21203/rs.3.rs-4387414/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Large Language Models (LLMs) have emerged as powerful tools in the domain of software vulnerability and cybersecurity tasks, offering promising capabilities in detecting and handling security threats. This article explores the utilization of LLMs in various aspects of cybersecurity, including vulnerability detection, threat prediction, and automated code repair. We explain the concept of LLMs, highlighting their various applications, and evaluates their effectiveness and challenges through literature review. We explore the effectiveness of various LLMs across different cybersecurity domains, showcasing their proficiency in tasks like malware detection and code summarization. Comparing LLMs to traditional methods, our work highlights their superior performance in identifying vulnerabilities and proposing fixes. Furthermore, we outline the workflow of LLM models, emphasizing their integration into cyber threat detection frameworks and incident response systems. We also discuss complementary methods and tools that enhance LLMs' capabilities, including static and dynamic code analyzers. Additionally, we synthesize findings from previous research, demonstrating how the utilization of LLMs has significantly enhanced productivity in identifying and addressing software vulnerabilities and cybersecurity threats. Finally, the study offers insights into optimizing the implementation of LLMs based on the lessons learned from existing literature.

Large Language Models (LLMs)

Software vulnerabilities

Cyber security

Software Vulnerability detection

Cyber Threat Detection.

The presence of implementation bugs within software applications can manifest as program crashes, data corruption, performance Decline, and exploitable security vulnerabilities. This underscores the criticality of early bug identification and rectification. Established methodologies, such as static analysis, possess limitations in their ability to accurately detect faults, and recent forays into deep learning-based approaches may not comprehensively capture the intricacies of software code. While static analysis and DL techniques offer demonstrable value in the initial stages of bug detection, they are prone to generating false positives and other challenges (Data Requirement, Interpretability, Feature Selection). These false detections can negatively impact developer productivity [36, 44, 45].

The recent breakthroughs in deep learning (DL) have garnered significant attention within the software engineering domain, prompting exploration of their potential to resolve longstanding issues. [46, 47]. Several approaches have been proposed to leverage Deep Learning (DL) techniques in addressing program defects. Some prominent examples include DLFix [48] and DeepRepair [49], which utilize DL to directly repair these defects. CURE [50] employs a deep learning model trained on a corpus of source code to parse, analyze, and model code structure. DEAR [51] takes a hybrid approach, combining spectrum-based fault localization with deep learning to enhance contextual understanding of the code. It is important to acknowledge that each of these methods possesses its own strengths and weaknesses, necessitating further research to determine the most effective strategies for DL-based program defect mitigation.

But the rise of Large Language Models (LLMs), powered by revolutionary Trans-formers architecture, has significantly impacted the landscape of Natural Language Processing (NLP), ushering in a new era of capabilities and applications resulting into unprecedented advancements [3, 4, 9]. recent studies have demonstrated the effectiveness of Large Language Models (LLMs) in detecting security vulnerabilities, sometimes surpassing deep learning (DL) and traditional methods [9].

1.1. Problem Statement (challenges)

As cyber dangers evolve rapidly in both complexity and severity, doubts arise regarding organizations' capability to fend off large-scale assaults and individuals' capacity to safeguard themselves. The 2023 Verizon Cost of Data Breach Report highlights that on average, it takes companies 197 days to detect a security breach and 69 days to contain it. Extending the incident response duration exposes companies to significant financial and operational setbacks, including unplanned outages and decreased productivity. The need for computers to handle and interpret vast volumes of language data, especially in natural language interactions and cybersecurity tasks like identifying flaws in software code, has become paramount [17,24, 28].

Conventional methods for detecting vulnerabilities, which hinge on human experts defining weaknesses, are slow and laborious, therefore, there is an ever-increasing need for automated tools to detect software flaws. Machine learning methods exhibit demonstrably superior capabilities in these areas, achieving more accurate detection and proposing more effective solutions. However, it is crucial to acknowledge that even LLMs are susceptible to factual hallucinations and solution misinterpretations, potentially leading to false positive or negative outputs. Therefore, significant research efforts are still required to develop a unified method that integrates LLMs with other techniques to create a robust and comprehensive system for detecting and handling software vulnerabilities and security threats.

1.2 Research goals

The main goal of this paper is to explore the potential applications of LLMs in the domain of software vulnerabilities and cybersecurity threat detection. Our secondary objective is to furnish researchers with foundational knowledge of LLMS and their practical applications. Having examined prior research, we will provide a comprehensive set of guidelines and recommendations to assist researchers in effectively utilizing LLMs to identify and address software vulnerabilities and cybersecurity threats. We will also address the challenges inherent in employing LLMs within the realm of software vulnerability and cybersecurity threat detection, offering practical suggestions for overcoming these obstacles. By tackling these challenges, our aim is to help unleash the complete capabilities of LLMs in fostering a safer and more effective software development environment.

1.3 Overview

In the forthcoming sections, we begin with the methodology section, outlining our approach to source material gathering, and analyzing them. Then we have the background section laying the groundwork to provide an overview of LLMs, their categories, and their applications within the domain of software engineering and cyber security.

Next, we articulate the challenges faced in effectively employing LLMs for cybersecurity and software vulnerability management, setting the stage for our research goals and objectives. Following this, the research questions are posed. The focus of our study is revealed in the subsequent section, where we explain how Large Language Models (LLMs) are employed for detecting and managing software vulnerabilities and cybersecurity threats through a review of relevant literature. Subsequently, the research questions are systematically addressed and answered.

To comprehensively explore the landscape of using Large Language Models (LLMs) in the detection and handling of software vulnerabilities and cybersecurity threats, a systematic literature review was conducted. The following steps outline the methodological approach employed:

2.1 Identification of Keywords:

We started our research by using "LLMs" AND "cyber security” as the first keywords. After finding the related articles and examining them, the following keywords were found and used to find the next articles and guide the search process:

Large Language Models, SecurityBERT, FalconLLM, Pre-trained, Cyber Threat Detection, Software Fix, Security Testing, Penetration Testing, Software Vulnerability detection, Code Languages, Pre-trained Transformers, Natural Language Processing, Quality Assurance, ChatGPT, GPT-4, Bard, Code Generation, Vulnerability Discovery, transformer, Overfitting, prompt, and code analysis.

2.2 Search in Google Scholar:

Searches were conducted on Google Scholar using the identified keywords to retrieve a broad range of articles related to LLMs, software vulnerabilities, and cybersecurity threats.

2.3 Review of Search Results:

The search results were meticulously reviewed, with a focus on titles, abstracts, and keywords to identify articles aligned with the research objectives.

2.4 Snowballing Technique:

The snowballing technique was employed by examining the references cited in the selected articles to identify additional sources that may contribute to the study.

2.5 Iterative Refinement:

The search queries were iteratively refined, and the process was repeated to ensure a thorough and up-to-date review of the literature.

2.6 Inclusion and Exclusion Criteria:

The following criteria were used for inclusion and exclusion of the articles:

Irrelevance: Sources that do not directly contribute to addressing the research questions or objectives.
Non-English Publications: Excluding sources not published in the English language, unless language diversity is a specific focus.
Incomplete Data: Eliminating sources with insufficient or unclear information relevant to the study.
Non-Academic Sources: Non-academic or non-professional sources, such as blog posts or promotional materials, unless they provide unique insights.

2.7 Critical Review:

Selected articles underwent a critical review, with a focus on extracting key insights, methodologies, and findings. The quality of the research was assessed to ensure the inclusion of robust and relevant contributions.

2.8 Synthesis and Integration:

Information gathered from the selected articles was synthesized to identify common themes, trends, and gaps in the existing literature, forming the basis for the subsequent sections of the research paper.

2.9 Use of AI Assistance

In cases where there were still unanswered questions after reviewing all the articles, we utilized Microsoft Copilot to ask our questions. We then used the following prompt to find our answers from articles on the arXiv.org website:

“Please give a list of articles from arXiv.org that can help answer the following question. The question is: …”

This method allowed us to leverage the capabilities of AI and the extensive database of arXiv.org to fill in any gaps in our research.

Citation and Referencing:

The selected articles were cited using the ACM citation style, ensuring proper acknowledgment of the sources and inclusion in the reference section of the paper.

3.1 Introduction to Large Language Models (LLMs)

What is a LLM? Large Language Models (LLMs) are a subset of Pre-trained Language Models (PLMs) that are trained on a large corpus of text data. Examples of LLMs include models like GPT-4, BERT [23], and RoBERTa. They have demonstrated remarkable capabilities across various Natural Language Processing (NLP) tasks and stand as the current state-of-the-art in AI models, constructed upon neural network architectures, particularly deep learning. [23, 55]. The term "LLM" is used to distinguish language models based on their size, specifically referring to large-sized PLMs. However, there isn't a formal consensus on the minimum parameter scale for LLMs, as the model's capacity is intertwined with both data size and total computing resources [1, 6, 13].

Celebrated for their extensive training datasets, intricate neural network structures, and advanced linguistic abilities, LLMs showcase remarkable proficiency in mimicking, generating, and fine-tuning natural language. Developers employ a comprehensive body of texts to train LLMs, covering diverse linguistic styles, domains, and subjects. This training process enables LLMs to cultivate a statistical grasp of language, producing coherent, context-aware text and excelling in various NLP tasks. The substantial neural networks within LLMs, comprising tens or hundreds of billions of parameters, facilitate the discernment of subtle semantic links and effective generalization across diverse language tasks. The scale and architecture of LLMs augment their capacity to encompass vast linguistic knowledge, showcasing a form of artificial creativity reflective of their training data [41].

Additionally, LLMs are finding applications in various real-world tasks such as machine translation, content creation, chatbots, and factual topic summarization. While these models offer impressive capabilities, it's important to be aware of potential biases inherited from their training data and the need for responsible development. The field of LLM research is rapidly evolving, with ongoing efforts to improve their capabilities, address limitations, and explore new applications.

3.2 LLMs categorization

To provide an overview of the vast world of Large Language Models (LLMs) and their categorization, we draw upon the comprehensive work conducted by [1, 3, 6, 12]. The authors have made a commendable contribution to categorizing LLMs, shedding light on their diverse applications and characteristics. The most powerful Large Language Models (LLMs) currently utilize the Transformer architecture, which was introduced and published in 2017.[43] Depending on the application, transformer architectures can be used within encoder-only, decoder-only, or encoder-decoder networks. The mentioned paper, categorizes the mainstream LLMs into encoder-only, encoder-decoder, decoder-only [1], and Sparse Model architecture [3].

3.2.1 Encoder-only

Encoder-only models are adept at understanding and summarizing information from input sequences. They don't generate new sequences but produce fixed-size representations suitable for tasks like classification. Despite their effectiveness in tasks requiring comprehensive understanding, like SoC security verification, their inability to generate outputs based on the learned context limits their application in scenarios demanding further generation. [3, 6, 12, 31]

LLMs, particularly encoder-only types like BERT (Bidirectional Encoder Representations from Transformers) and its variants, focus on encoding input sentences to capture word relationships and contextual information [1, 6, 23, 31].

BERT, based on transformer's encoder architecture, is known for its bidirectional attention mechanism that considers both left and right context during training. Specialized models such as CodeBERT, GraphCodeBERT, RoBERTa, ALBERT, BERTOverflow, and CodeRetriever have tailored features for software engineering (SE) tasks, incorporating program structures, new pre-training tasks, or engaging different modalities for improved application in code-related tasks [1, 6, 23].

CodeBERT improves code understanding by predicting subsequent tokens, aiding tasks like code completion and bug detection [1]. GraphCodeBERT recognizes code element relationships as a graph, enhancing code summarization and program analysis tasks [1]. These models excel in tasks like code review, bug report understanding, and named entity recognition within code entities [1, 6, 12].

3.2.2. Encoder-decoder LLMs

Encoder-decoder Large Language Models (LLMs) are a type of language model that utilizes both an encoder and a decoder module. The encoder module is responsible for encoding the input sentence into a hidden-space, and the decoder is used to generate the target output text. This structure allows for more flexible training strategies [42]. One of the most well-known implementations of the encoder-decoder architecture is the transformer, introduced in [43].

Models like BART, PLBART, T5, CodeT5, AlphaCode, and CoTexT showcase flexibility in tasks such as summarization, translation, and question-answering [1, 3, 6, 12, 31].

3.2.3 Decoder-only LLMs

Decoder-only LLMs exclusively use the decoder module for sequential token prediction, contrasting the encoder-decoder architecture. Models like the GPT series (GPT-1, GPT-2, GPT-3, GPT-3.5, GPT-4, ChatGPT), CodeGPT, InstructGPT, Codex, Copilot, GPT-J, GPT-Neo, GPT-NeoX, LLaMA, and Vicuna follow this architecture. They excel in downstream tasks without additional prediction heads or fine-tuning, making them valuable tools in SE tasks [1, 6, 12]. Decoder-only LLMs have established impressive benchmarks in numerous NLP tasks, especially in the generation of free-form text.[3]

2022 saw a surge in decoder-only LLM development, further accelerated in 2023 with major companies like Google, Meta, and Microsoft launching products such as Bard, LLaMA, Llama 2, Bing Chat, etc. These newer models haven't found widespread application in SE yet, presenting untapped potential for specific tasks. The ongoing progress in decoder-only architectures signifies active exploration and innovation in this domain.

In 2023, there was a notable surge in the utilization of decoder-only LLMs in research. A significant shift was made in focus and resources toward exploring and leveraging the decoder-only architecture as the primary approach in LLMs for software engineering research and applications [1].

3.6 Sparse Model

Sparse models, particularly those inspired by the mixture-of-experts approach, are emerging as a prominent area of advancement in large language model (LLM) architecture. These models focus on activating a select subset of parameters for each input, leading to improved computational efficiency without sacrificing model capacity. Unlike dense models where every input activates all parameters, sparse models, especially MoE-based ones, activate specific "expert" parameters tailored to each input. Examples like Switch Transformer and GShard demonstrate how such sparse activation methods can achieve, or even surpass, the performance of dense models while requiring less computation. In the context of SoC security, where data can be both voluminous and complex, sparse models offer a valuable balance between computational efficiency and task-specific accuracy [3].

The distinct capabilities and limitations inherent to each Large Language Model (LLM) architecture underscore the significance of judiciously selecting the most suitable model architecture. This selection is pivotal in enhancing precision for a given task. The choice of architecture should be guided by a comprehensive understanding of the specific requirements of the task and the strengths and weaknesses of the available LLM architectures. We've shown the evolutionary journey of Large Language Models (LLMs) over time by extracting Figure1 from the thorough and comprehensive research conducted in [72], where a comprehensive survey of Large Language Models was carried out.

Given the importance of comprehending the distinctions and similarities among the premier methodologies for software security and vulnerability detection, table (I) has been constructed to facilitate a more profound understanding of the subject matter. This comparative analysis serves as a valuable resource in navigating the complexities of software security, thereby enabling more informed decision-making in the selection and implementation of appropriate security measures.

Table I: comparing the characteristics of Large Language Models (LLMs) with other established Software Vulnerability Detection Methods.

	Purpose and Design	Code Representation	Learning and Adaptation	Generalization	Feedback and Iteration	Coverage	Basis of Operation	Adaptability	Primary Use Case
LLMs	understanding and generating human-like text, including code	Represent code as sequences of tokens	Continuously learn from training data; adapt based on seen patterns	Can generalize across various coding languages, patterns and styles	Can provide explanations and justifications for generated code	Broad range of functionalities	Statistical analysis and Pattern recognition	Highly adaptable and flexible due to learning from training data	Text understanding, generation, and contextual reasoning
Manual Code Review	Identify security vulnerabilities through human expertise	Uses source code	Static knowledge and experience	Limited to reviewer's expertise	Manual, qualitative feedback	Limited	Human expertise and knowledge	Limited to reviewer's skills	Security vulnerability detection and code review
DAST (Dynamic Application Security Testing)	Designed to identify security vulnerabilities in running applications through simulated attacks	Running application	Rely on predefined rules and signatures; don’t traditionally learn (Limited learning from observed application behavior)	Precise and specific; based on known patterns/signatures (Specific to tested attack vectors)	Automated reports with limited context	Broader coverage, but limited control	Rule-based (Simulating attacker behavior)	limited but generally designed to be adaptable to different applications	Security vulnerability detection
Fuzz Testing (Fuzzing)	Designed to identify unknown vulnerabilities by providing invalid, unexpected, or random data as inputs	Inputs and program behavior	Rely on predefined rules and signatures; don’t traditionally learn	Can generalize across various coding patterns/styles	Limited feedback on root cause	Broad input coverage, but limited code paths	Sending unexpected inputs	Limited to input variations	Security vulnerability detection through unexpected inputs
IAST (Interactive Application Security Testing)	Designed to identify vulnerabilities in running applications by combining static and dynamic analysis	Running application with instrumentation	Learns from application behavior during testing	Tailored to specific application behavior	Detailed, context-aware feedback	Comprehensive application behavior coverage	Monitoring application behavior and interactions	Learns from application behavior	Security vulnerability detection in deployed applications
Code Sniffers	Designed to detect and flag programming errors, bugs, stylistic errors, and suspicious constructs	Source code	Predefined rules	Specific to coding standards	Automated reports based on rule violations	Focuses on specific coding rules	Predefined coding rules	Fixed set of rules	Enforcing coding standards and best practices
Software Composition Analysis (SCA)	Designed to identify known vulnerabilities in open-source components	Bill of Materials (BOM), component metadata	Analyzes vulnerability databases	Focuses on known vulnerabilities in components	List of vulnerable components	Identifies known vulnerabilities in components	Matching component versions against vulnerability databases	Static vulnerability databases	Security vulnerability detection
Runtime Application Self-Protection (RASP)	Designed to detect and prevent real-time attacks	Running application	Learns from attack patterns	Adapts to application behavior and attack patterns	Real-time alerts and mitigation suggestions	Protects against known and some unknown attacks	Analyzing application behavior for suspicious activity	Adapts to application behavior and attacks	Real-time attack prevention
Manual Penetration Testing	Identify security vulnerabilities through simulated attacker actions	Source code	Static knowledge and experience	Specific to tester's skills and knowledge	Detailed reports and recommendations	Comprehensive, but depends on tester's skills	Human expertise and tools	Depends on tester's skills and tools	Security vulnerability assessment and exploitation
Taxonomical Analysis	Classify software based on features and functionalities	Doesn’t directly deal with code representation; focuses on organizing artifacts.	Limited (manual analysis)	High (applicable to various systems)	Indirect (through classification)	Wide range of software components	Relies on predefined criteria (functionality, purpose, domain).	High (flexible categorization)	Organizing software repositories, aiding in comprehension.
Static Analysis Tools	Identify issues (bugs, vulnerabilities) by analyzing source code or binaries.	Analyzes code directly (syntax, semantics, control flow, data flow).	Doesn’t learn (Predefined rules and patterns)	Limited (specific to implemented rules)	Immediate and Direct (reports specific issues)	Deep (focuses on code details)	Relies on program structure, control flow graphs, data flow analysis.	High (customizable rules)	Code quality assurance, bug detection
Lexical Analysis	Create a token stream by processing source code.	Operates at the level of individual tokens (e.g., keywords, identifiers, operators).	Limited (predefined token types)	High (applicable to all programming languages)	Direct (identifies individual tokens)	Shallow (focuses on basic elements)	Language grammar, individual tokens	Low (fixed token types)	Front-end for compilers and interpreters
Software Metrics	Quantify code quality, maintainability, complexity, and other characteristics.	Analyzes code structure, size, complexity, relationships.	Limited (predefined metrics)	High (applicable to various projects)	Indirect (metrics need interpretation)	Varied (different metrics cover different aspects)	Relies on mathematical formulas, measurement theory and statistical analysis.	High (customizable metrics)	Quality assessment, project management, identifying areas for improvement.

Utilization of Large Language Models (LLMs) to detect and handle software vulnerabilities and security threats, can be considered a part of Software Development Life Cycle (SDLC). To better understand the role of LLMs in SDLC, we will first take a look at the different stages of SDLC, which can be subcategorized into 6 stages [1].

Requirements engineering

Software design

Software development

Software quality assurance (include software vulnerabilities)

Software management

Software maintenance (include cyber security threats)

In the context of the Software Development Life Cycle (SDLC), Large Language Models (LLMs) primarily contribute to the detection and management of software vulnerabilities and cybersecurity threats during stages 4 (Software Quality Assurance) and 6 (Software Maintenance). However, the utility of LLMs is not confined to these stages alone; they also find application in stages 1 (Requirements Engineering), 2 (Software Design), and 3 (Software Development). For the purpose of this study, our primary focus will be on stages 4 and 6, providing a comprehensive exploration of the role of LLMs in these stages. A brief introduction and examples will be provided for stages 1, 2, 3, and 5 to offer a holistic view of the LLMs role in SDLC. The subsequent sections of this article will delve into a detailed examination of stages 4 and 6.

5.1 Utilization of LLMs in Requirements Engineering

Large Language Models (LLMs) are very helpful in various tasks related to planning and defining software requirements.

Clearing Up Confusion

ChatGPT and ELECTRA are good at figuring out what unclear software requirements actually mean, making them easier to understand.

Sorting Requirements

Models based on BERT are great at separating functional (what the software should do) and non-functional (how the software should work) requirements. This helps in understanding the project at its early stages.

Identifying Terms

BERT-based methods, when used with K-means clustering (a type of grouping method), can successfully find terms that are used in different ways across various fields. XLNet has also shown promise in this area due to its permutation-based training objective.

Finding Connections

BERT and T5 models have been creatively used to find connections between entities in Requirement Engineering, showing promising results.

Automating Traceability

T-BERT is effective in transferring knowledge for traceability (the ability to trace the history, usage, or location of an item) between natural language artifacts and programming language artifacts. This shows potential for practical use and reliability in tracking software and system changes [1].

5.2 Utilization of LLMs in software design

In the field of Graphical User Interface (GUI) retrieval, BERT-based learning-to-rank (LTR) models are used. These models face the challenge of ranking GUI documents based on text, and their performance is evaluated based on natural language (NL)-based GUI ranking. Large Language Models (LLMs) also play a role in rapid prototyping, where they enhance the process with prompt design techniques [28]. This provides a structured methodology for addressing challenges in LLM for Software Engineering (LLM4SE). In addition, the SpecSyn Framework uses LLMs for automatic software specification synthesis. This approach has shown to outperform previous tools in both single and multiple sentence analysis, as measured by the F1 score, a measure of a test’s accuracy [1].

5.3 Utilization of LLMs in Software development

In the software development landscape, Large Language Models (LLMs) like GPT series, BERT series, Codex, InCoder, Copilot, and CodeGeeX have become instrumental in enhancing productivity in various code-related tasks, including program understanding, analysis, and generation. The era of LLMs in code generation began with the debut of CodeX, which forms the basis for GitHub Copilot. Following this, other models like InCoder, Google Alphacode, and Amazon CodeWhisperer have further advanced the effectiveness of code generation. Recently, OpenAI unveiled ChatGPT, an AI-powered chatbot renowned for its exceptional language understanding and human-like responses. Despite ChatGPT's success in generating code accurately and solving programming problems, concerns about the quality of its generated code persist among leading software companies.

LLMs can translate natural language descriptions into code, aiding in code completion, automatic generation, and annotation-to-code conversion. These models offer context-based code suggestions, semantic understanding, and personalized recommendations. They also generate human-readable descriptions from source code, improving code readability, documentation, and comprehension. Furthermore, they interpret code comments, semantics, and dependencies, thereby improving efficiency in code maintenance and integration [1, 11, 29].

LLMs also play a significant role in API recommendations, synthesis, and documentation. Models like GPT-4 and LLaMA-based architectures accurately generate API calls and adapt in real-time to documentation changes. They also assist in detecting and warning about poor API documentation quality, thereby enabling automated monitoring and improvement of API documentation. In addition, LLMs suggest performance-enhancing code edits, leading to significant speedups in program execution. They also recommend code examples by leveraging open-source projects, thereby improving developer efficiency. Lastly, they aid in identifier normalization, aligning identifier vocabulary with natural language, and enhancing code comprehension [1, 11].

5.4 Utilization of LLMs in software management

Research papers describing the utilization of LLMs in software management are still limited. Alhamed et al. [52] conduct an evaluation of the application of BERT in the task of effort estimation for software maintenance. Their study underscores BERT’s potential to offer valuable insights and aid in the decision-making process while also highlighting the associated challenges and need for further investigation.

6.1 Utilization of LLMs in Software quality assurance

In software quality assurance, Large Language Models (LLMs) are leveraged across various tasks that are mostly related to software vulnerabilities detection. Such as:

Test Generation: Automated test case creation: LLMs aid in generating diverse test cases, improving coverage, and identifying potential defects [1].

Natural language-based test case generation

Collaboration between developers and testers is fostered by generating test cases from natural language descriptions [1].

Vulnerability Detection

LLMs by leveraging their understanding of code structures and semantics, contribute to detecting vulnerabilities, offering improved accuracy compared to traditional methods [1, 9, 33]

Function-level vulnerability detection

Fine-tuning models like BERT for vulnerability detection has proven effective, as seen in studies that combine sequence and graph embedding for function-level vulnerability detection [1].

Test Automation: LLMs enhance automated testing techniques like mutation testing and fuzzing. They introduce faults in the codebase to evaluate the effectiveness of test suites and generate diverse input programs to uncover vulnerabilities and bugs. Incorporating LLMs into testing techniques improves test coverage, detects bugs efficiently, and aids in building more robust software systems [1, 9]

Formal verification and repair

LLMs combined with formal methods assist in automatically repairing software based on formal verification.

Formal verification methods are bolstered by LLMs, particularly when combined with bounded model checking. These models can automatically repair software based on formal methods, showcasing their ability to understand intricate software structures and generate accurate repairs, ensuring stable and secure performance [1].

Bug Localization

Augmenting bug reports with token and paragraph-level operations and training BERT-based models with augmented data significantly enhances bug localization. These techniques expand the training data and improve models' precision in identifying the specific source code segments responsible for reported bugs or defects.

Failure-inducing Test Identification: LLMs can assist in identifying fault-inducing test cases by leveraging their understanding of expected behavior from erroneous programs. Combining LLMs with difference testing techniques helps pinpoint subtle code differences, leading to more accurate identification of fault-inducing test cases [1].

Flaky Test Prediction

environments where test cases exhibit non-deterministic behavior, LLMs like CodeBERT assist in predicting flaky tests. Their predictions aid developers in focusing debugging efforts on potentially problematic test cases, reducing human effort and execution time spent on debugging [1].

6.2 Utilization of LLMs in in software maintenance

In the field of software maintenance, Large Language Models (LLMs) are extensively utilized for a variety of tasks. These tasks include, but are not limited to, the detection and mitigation of cybersecurity threats and software vulnerabilities.

Program Repair: LLMs like BERT, CodeBERT, and ChatGPT have shown effectiveness in generating accurate patches for bugs and defects. ChatGPT's interactive design enables continuous feedback loops, enhancing accuracy in program repair [1].

Code Review

LLMs aid reviewers in understanding code intent, detecting errors, and suggesting improvements, thereby enhancing code quality [1, 9, 12, 31]

Debugging

LLMs simulate scientific debugging processes, generate hypotheses about code problems, and debug their own generated code [1].

Bug Report Analysis

LLMs analyze bug reports, provide repair suggestions, and aid in better understanding the underlying causes, expediting error-fixing processes [1, 31].

Code Clone Detection: BERT's application in code clone detection, particularly in identifiers, enhances clone detection across all layers [1]. Logging: Models like T5 assist in automatically generating and summarizing logs, aiding in understanding software behavior and identifying issues [1].

Bug Prediction and Triage

LLMs like BERT effectively predict long-lived bugs and assist in triaging bugs, improving error detection and resolution [1].

Bug Report Replay and Duplicate Detection

LLMs aid in automated error replay based on natural language understanding and are used for detecting duplicate bug reports, reducing redundancy in the software development process [1].

Decompilation and Merge Conflicts Repair

LLMs like ChatGPT aid in recovering symbolic names during decompilation and partially automate program merge conflicts repair [1].

Sentiment Analysis and Tag Recommendation

Transformer models outperform existing tools in sentiment analysis related to software products and help recommend tags for software Q&A sites [1].

Vulnerability Repair and Traceability Recovery

Challenges persist in LLMs generating functionally correct code for zero-point vulnerability remediation. Additionally, LLMs enhance traceability link predictions, refining traceability recovery [1].

To provide a comprehensive overview of the LLMs use in the field of software vulnerabilities and cyber security, it is important to fully comprehend how these models are currently being applied, the challenges they face, and their potential. Thus, we aim to provide a systematic literature review of the application of LLMs in this fields. This study thus aims to answer the following research questions:

RQ1: Why should we use LLMs for software vulnerabilities and cyber security threats detection?
RQ2: What specific types of LLMs are utilized in software vulnerabilities and cyber security threats detection?
RQ3: How can LLMs be used to detect and handle software vulnerabilities and cyber security threats?
RQ4: What is the workflow of an LLM model in the context of detecting and handling software vulnerabilities and cyber security threats?
RQ5: What is the best type of data sets to train LLMs for software vulnerability detection?
In comparison to traditional methods/tools, how do LLMs perform in detecting and handling software vulnerabilities and cyber security threats?
What metrics are used to assess LLMs in addressing software vulnerabilities and cyber threats?
RQ8: What are the challenges of using LLMs in cybersecurity tasks?
How to enhance LLM effectiveness in software vulnerability and cyber threat detection?

7.1 RQ1: Why should we use LLMs for software vulnerabilities and cyber security threats detection?

As businesses increasingly integrate with the digital realm, the landscape of cyber threats evolves rapidly, growing more intricate and severe. This surge in connectivity amplifies the urgency for robust cybersecurity measures. Amidst this challenge, emerging research highlights the potential of natural language processing (NLP) techniques in fortifying cyber defense mechanisms. Specifically, NLP applications exhibit promise in detecting vulnerabilities within software code, a critical aspect in preventing cyber-attacks. It's well-documented that software bugs serve as prime entry points for malicious actors, precipitating cyber crimes. Despite advancements, the persistence of software vulnerabilities remains evident, as illustrated by the annual update of the Common Vulnerabilities and Exposures (CVE) list. Traditional error identification methods, once relied upon, now face scrutiny due to their susceptibility to inaccuracies and misdiagnoses. This underscores the pressing need for innovative approaches, including those grounded in NLP, to enhance cyber resilience in an increasingly interconnected world. [17, 24]. Artificial intelligence's ability to process vast datasets in real-time, extract insights, and anticipate potential threats has the potential to revolutionize proactive cybersecurity measures [57]. By using large language models like GPT along with suitable datasets such as the SARD benchmark dataset and SeVC dataset, we can analyze to find weak code in different programming languages including C++/C and Java [59].

While machine learning has been used for vulnerability detection, traditional methods require complex and error-prone manual feature engineering to define what information the machine should analyze. Deep learning approaches using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been explored, but they necessitate specifically formatted code data, which can be challenging. Giving rise to the Transformer-based neural network architectures, known for their success in natural language processing. The goal is to develop a model that automatically learns both syntactic (structure) and semantic (meaning) information from various programming languages, paving the way for even more powerful vulnerability detection systems using large language models. [17, 18, 24] Moreover, transformer-based language models hold appeal and promise over RNNs due to their capacity for parallelization in model computation, facilitating quicker processing compared to RNNs. This is crucial for decreasing the time required for model training and testing, especially when dealing with large-sized models like transformer-based ones. Additionally, their capability to transition from natural language processing tasks to related tasks, via transfer learning, allows for broadening their applicability across various domains. [22]

Leveraging transformer-based models such as GPT4 for vulnerability detection, presents numerous benefits, notably enhanced accuracy and natural language processing capabilities. These models obviate the necessity for manual input in static analysis tools, streamlining the detection process into a faster, more automated procedure. [17, 24]

RQ1 Answer: Large language models, like GPT-4, should be utilized for software vulnerabilities and cybersecurity threat detection due to their ability to automate the identification of code vulnerabilities across various programming languages. They offer enhanced accuracy and natural language processing capabilities, streamlining the detection process and facilitating quicker processing compared to traditional methods.

7.2 RQ2 What specific types of LLMs are utilized in software vulnerabilities and cyber security threats detection?

Numerous large language models (LLMs) have been explored for their potential in detecting software vulnerabilities and cybersecurity threats. These models include BERT, GPT-3, RoBERTa, XLNet, ALBERT, T5, BART, ELECTRA, Longformer, CodeBERT, GraphCodeBERT, CodeT5, PLBART, and others. These LLMs, with their ability to process and understand code, have been utilized in tasks such as vulnerability detection, malware classification, code summarization, and code generation, contributing to the advancement of cybersecurity research and practice.

BERT: Researchers are exploring the application of BERT and the adaptation of pretrained language models like CodeBERT, GraphCodeBERT across a range of cybersecurity domains, encompassing malware detection in Android apps, spam email identification, intrusion detection in automotive systems, and anomaly detection in system logs. The Bidirectional Encoder Representations from Transformers (BERT) model has garnered significant interest for its remarkable capability to grasp contextual nuances in text sequences. As a pretrained transformer-based language model, BERT has demonstrated remarkable proficiency in various NLP tasks. Its inherent ability to comprehend intricate dependencies and variations within sequences has spurred investigations into its utility for cyber threat detection. Leveraging BERT's contextual comprehension, security experts have devised novel approaches to address diverse cybersecurity challenges [4, 17, 24].

CodeBERT: Feng et al. [54] introduced CodeBERT, a pretrained model, trained on source codes from diverse programming languages. Through this pretraining, CodeBERT acquires an understanding of both programming languages (PL) and natural language (NL). This model is tailored to support various NL-PL applications, including natural language code search and code documentation generation. Built upon a transformer-based neural architecture, CodeBERT is trained using a specialized objective function that combines the task of detecting replaced tokens with pretraining. This methodology enables the utilization of both NL-PL pairs and unimodal data during training, with NL-PL pairs serving as input tokens and unimodal data enhancing the performance of the generators. CodeBERT's architecture comprises 12 layers and 125 million parameters.

FalconLLM: Falcon 40B and its more advanced counterpart, Falcon 180B, hold significant potential in identifying and managing software vulnerabilities and cybersecurity risks. With their autoregressive decoder-only architecture and impressive parameter counts of 40 billion and 180 billion (180B is trained on 3.5 trillion tokens), FalconLLM and stands out as a strong solution in incident response and recovery systems, as evidenced by its outstanding performance on the Open Source LLM Leaderboard. Leveraging advanced language comprehension capabilities, FalconLLM meticulously analyzes textual logs and incident reports, extracting relevant details and identifying patterns indicative of potential threats. Additionally, FalconLLM excels in evaluating the severity and potential impact of incidents, providing customized mitigation strategies and recovery plans to response teams. Its adaptive learning mechanism continuously integrates new data, enabling retrospective analysis of past incidents and iterative enhancement of response procedures. This proactive and iterative approach empowers organizations to mount quicker, more effective responses to cyber threats, thus minimizing potential damages [4].

SecureFalcon: Researchers trained a large language model called FalconLLM40B using an extensive training procedure on 384 GPUs (A100 40GB), giving rise to SecureFalcon, an innovative model architecture constructed upon the foundation of FalconLLM. The model’s training procedure incorporated a 3D parallelism strategy that involved tensor parallelism of 8 (TP=8), pipeline parallelism of 4 (PP=4), and data parallelism of 12 (DP=12). They employed specific hyperparameters to optimize the model’s learning process, balancing efficiency and precision [7].

OpenAI's GPT: GPT stands for "generative pre-trained Transformer," which is a family of Transformer-based models developed by OpenAI. These models have demonstrated remarkable performance across various language-related tasks. ChatGPT, a commercial product by OpenAI, is based on this family of large language models (LLMs). GPT-4, the latest model in this series, has achieved exceptional performance, surpassing humans in standardized exams and outperforming other models in academic benchmarks. ChatGPT, despite being trained primarily on human language, is capable of generating valid source code and assisting in debugging tasks, highlighting its versatility. Traditional rule-based static code analyzers, while effective in identifying software vulnerabilities, can sometimes miss nuanced or evolving threats due to their rule-based nature. Large Language Models (LLMs), such as OpenAI's ChatGPT, offer a novel approach to addressing this challenge by leveraging vast textual data to understand and generate code, potentially improving the identification and rectification of software vulnerabilities. [9, 13, 17, 24, 28, 38, 40].

RQ2 Answer: Various types of LLMs can be utilized in software vulnerabilities and cybersecurity threats detection include BERT, GPT, RoBERTa, XLNet, ALBERT, T5, BART, ELECTRA, Longformer, CodeBERT, GraphCodeBERT, CodeT5, PLBART, FalconLLM, SecureFalcon, and OpenAI's GPT family. These LLMs demonstrate proficiency in tasks such as vulnerability detection, malware classification, code summarization, and code generation, contributing significantly to cybersecurity research and practice.

7.3 RQ3. How can LLMs be used to detect and handle software vulnerabilities and cyber security threats?

We have discussed applications of LLMs in detecting and handling software vulnerabilities in Section 6. Now, in this section, through a literature review of past works, we will show you some examples where LLMs have been used for detection and handling of software vulnerabilities.

Authors in [4] have use SecurityBERT and FalconLLM parallel to each other to create a new cyber threat detection model called, SecurityLLM. SecurityBERT operates as a cyber threat detection mechanism, while FalconLLM is an incident response and recovery system. The integration of FalconLLM and Security-BERT, two distinct techniques, can improve the identification of network-based threats. SecurityLLM model can identify fourteen (14) different types of attacks with an overall accuracy of 98%.

Researchers have employed large language models, specifically, GPT3.5, to enhance the efficacy of penetration testing endeavors. This study focuses on two primary applications: leveraging these models for high-level strategic planning in security assessments and utilizing them for finding weak spots a simulated vulnerable computing environment. In the latter context, they have established a closed-feedback loop between the actions generated by the large language model and the vulnerable virtual machine accessed via Secure Shell (SSH). This framework enables the model to scrutinize the state of the virtual machine for vulnerabilities and propose specific attack vectors, subsequently executed automatically within the virtual environment [5].

SecureFalcon, a new model based on a fine-tuned version of FalconLLM, which can effectively differentiate vulnerable and non-vulnerable C code samples. Tested on a unique dataset Form AI, SecureFalcon achieved an impressive 94% accuracy rate, demonstrating its effectiveness. This approach not only reduces false positives compared to traditional static analysis methods, but also offers a groundbreaking solution for software vulnerability detection. Notably, SecureFalcon accomplishes this with a relatively small number of parameters (121 million and 44 million) [7].

Finding and fixing bugs in software code is a time-consuming task for developers, and automated program repair (APR) techniques aim to lessen this burden. Researchers in [8] proposes a new approach that builds on LLM-based repair techniques. It leverages a recently developed interactive prompting method called Tree of Thoughts (ToT). request a Large Language Model (LLM), GPT-4, to suggest various potential locations for a software bug. By gathering and analyzing the model's collective responses, they then prompt it to provide suggestions for fixing the identified bugs. An initial assessment indicates that their method successfully resolves several intricate bugs that were previously unsolved by GPT-4, even when considering prompt customization.

Authors of [9] examined the effectiveness of Large Language Models (LLMs), specifically focusing on OpenAI's GPT-4, in identifying software vulnerabilities compared to traditional static code analyzers like Snyk and Fortify. The evaluation encompassed various repositories, including those from NASA and the Department of Defense. Findings revealed that GPT-4 detected approximately four times more vulnerabilities than its counterparts and proposed viable solutions for each, demonstrating a low false positive rate. Analysis of 129 code samples across eight programming languages highlighted PHP and JavaScript as having the highest vulnerability rates. GPT-4's suggested code modifications resulted in a significant 90% decrease in vulnerabilities, with a minimal 11% increase in code lines. Notably, the study emphasized LLMs' capability for self-auditing and providing potential fixes for identified vulnerabilities, underscoring their precision.

In [10] the authors goal was to investigate the potential of these LLMs for zero-shot vulnerability repair. They used several Large Language Models (LLMs) for their experiments. Specifically, OpenAI’s Codex and AI21’s Jurassic J-11. They also evaluated the performance of five commercially available, black-box, “off-the-shelf” LLMs, as well as an open-source model and their own locally-trained model. LLMs exhibited promise in addressing real-world code vulnerabilities, successfully repairing projects, a performance comparable to that of the state-of-the-art repair tool ExtractFix. However, the quality of repairs varied, with some fixes effectively resolving bugs while others introduced new issues or appeared implausible. Manual inspection of highly-rated repairs indicated that a notable portion of the "successful" fixes might be unreliable. Despite operating in a zero-shot setting without specific training for repair tasks and limited context from prompts, LLMs performed admirably, even outperforming ExtractFix in certain scenarios. However, LLMs faced challenges when addressing vulnerabilities requiring extensive code changes or complex semantic modifications, highlighting limitations in their understanding and context comprehension.

The authors in [12] propose leveraging Large Pre-Trained Language Models (PLMs) for Automated Program Repair (APR) to address limitations in traditional and learning-based APR techniques. They emphasize the potential of PLMs, trained on vast amounts of text/code tokens, to generate patches without relying on bug-fixing datasets. The study evaluates 9 recent state-of-the-art PLMs across different repair settings and programming languages, demonstrating their effectiveness in fixing real-world bugs. The research highlights the scalability of PLMs, with larger models generally achieving better performance. Furthermore, the authors explore the importance of suffix code in infilling-style APR and suggest practical guidelines for improving PLM-based APR, such as increasing sample size and incorporating fix template information.

In their study [13], the authors evaluate four LLMs (GPT-3.5-Turbo and GPT-3.5-Turbo-0613 for direct prompting, Davinci, and Codegen-2B-multi for fine-tuning.) on two software vulnerability (SQL injection and buffer overflow) detection tasks. To simulate a real-world scenario, the researchers designed their experiments such that a developer submits a code excerpt to an LLM, prompting it to identify potential security vulnerabilities. Both fine-tuned and zero-shot models are evaluated to simulate a variety of real-world situations. Table II shows the result of their work. We can clearly see that LLMs outperform other methods, but they also have a higher FPR.

Table II: Comparing different approaches based on the Code Gadget Database [13]

System	Technique	FPR (%)	FNR (%)	TPR (%)	P (%)	F1 (%)
Flawfinder	Static Analysis	44.7	69.0	31.0	25.0	27.7
RATS	Static Analysis	42.2	78.9	21.1	19.4	20.2
Checkmarx	Static Analysis	43.1	41.1	58.9	39.6	47.3
VulDeePecker	Deep Learning	5.7	7.0	93.0	88.1	90.5
CodeGen	LLM	68.67	32	68	49.75	57.46
Davinci	LLM	62.67	6	94	60	73.23
Ensembled CodeGen+Davinci	LLM	74.22	3.96	96.04	57.4	71.85

Authors of [14] tried to tackle the challenge of detecting logic vulnerabilities in smart contracts, which have resulted in significant financial losses. They identify a gap in existing analysis tools, which struggle to audit about 80% of Web3 security bugs due to a lack of domain-specific property description and checking. To address this issue, the authors propose GPTScan, a tool that combines GPT4 with static analysis for smart contract logic vulnerability detection. Unlike existing approaches that solely rely on GPT and suffer from high false positives, GPTScan utilizes GPT4 as a versatile code understanding tool. It breaks down each logic vulnerability type into scenarios and properties, enabling GPTScan to match candidate vulnerabilities with GPT4 and instruct it to intelligently recognize key variables and statements. Evaluation on diverse datasets demonstrates that GPTScan achieves high precision and recall rates, effectively detecting ground-truth logic vulnerabilities, including new ones missed by human auditors. Furthermore, GPTScan is shown to be fast, cost-effective, and capable of reducing false positives through static confirmation. We will explain the workflow of GPTScan in the next section.

In [15], the authors used five popular Large Language Models of Code (LLMCs) with representative pre-training architectures. These models include: CodeBERT, GraphCodeBERT, PLBART, CodeT5, and UniXcoder. The authors used these models in the context of Automated Program Repair (APR). They considered three typical program repair scenarios involving three programming languages (Java, C/C++, and JavaScript). They took into account both single-hunk and multi-hunk bugs/vulnerabilities. The LLMCs were fine-tuned on widely-used datasets (BFP, SequenceR, CPatMiner, VulRepair, and TFix) and compared with existing state-of-the-art APR tools. The authors also investigated the impact of different design choices, which include code abstractions, code representations, and model evaluation metrics.

The study found that LLMCs in the fine-tuning paradigm can significantly outperform previous state-of-the-art APR tools. The authors provided insights into choosing appropriate strategies to guide LLMCs for better performance. They also revealed several limitations of LLMCs for APR and made suggestions for future research on LLMC-based APR.

Researchers in [17, 24] Created VulDetect, which is developed as a classification model based on the large language model GPT-2. utilize a significantly large dataset and explore various model architectures, including MegatronBERT and GPT-2. They also incorporate code gadgets for input data, instead of removing labels and comments from the source file.

Authors of [53] utilized the BERTBase model for detecting software vulnerabilities. They fine-tuned it using a dataset comprising 100,000 C/C++ source files and evaluated its performance with 123 vulnerabilities. Comparing it to standard LSTM and BiLSTM models, they found that the BERTBase model and BERT with RNN heads surpassed the performance of the conventional models. Their dataset and model achieved the highest detection accuracy of 93.49%.

Researchers in [18] address the longstanding goal of repairing software bugs using automated solutions, particularly focusing on the task of vulnerability repair. They note that while some automated program repair (APR) tools leverage natural language processing (NLP) techniques, the significant differences between natural languages (NL) and programming languages (PL) may hinder their effectiveness in handling PL tasks. Moreover, existing tools primarily focus on bug repair tasks, with limited exploration into vulnerability repair. To tackle these issues, the authors propose leveraging large-scale pre-trained PL models, such as CodeBERT and GraphCodeBERT, specifically tailored for vulnerability repair based on PL characteristics.

The authors explore the real-world performance of state-of-the-art data-driven approaches for vulnerability repair using these pre-trained PL models. Their approach involves fine-tuning the pre-trained models for vulnerability repair tasks, allowing them to better capture PL features and handle multi-line vulnerability repair scenarios. Through their experimentation, they demonstrate that their approach achieves advanced results, with high accuracy rates for both single-line and multi-line vulnerability repair tasks. Specifically, their solution achieves a maximum accuracy of 95.47% for single-line vulnerability repair and 90.06% for multi-line vulnerability repair. They also evaluate their approach across various types of vulnerabilities, including CWE-121/190/369/401/457, and compare its performance with existing APR tools such as Tufano et al., DLFix, SequenceR, and CoCoNut, highlighting its effectiveness and generalization capabilities.

Researchers in [21] address the evolving cyber-attack landscape faced by enterprises, caused by new vulnerabilities and attack techniques. They emphasize the necessity for security management tools to accurately assess cyber-risks by identifying associations among attack techniques, weaknesses, and vulnerabilities. Existing repositories often lack completeness and rely on manual interpretations, which are slow and ineffective. To address these challenges, the authors propose a framework called VWC-MAP (Vulnerabilities and Weakness to Common Attack Pattern Mapping). VWC-MAP leverages natural language processing (NLP) techniques to automatically associate vulnerabilities with relevant attack techniques based on their textual descriptions. The framework employs a two-tiered classification approach, classifying vulnerabilities to weaknesses and weaknesses to attack techniques.

The authors introduce novel automated approaches for mapping weaknesses to attack techniques, utilizing Text-to-Text and link prediction techniques. They enhance the scalability of existing tools like V2W-BERT, which maps vulnerabilities to weaknesses, using Distributed Data-Parallel (DDP) technique for faster training. For associating weaknesses to attack patterns, they employ a Text-to-Text model (Google T5) and incorporate link prediction techniques considering the hierarchical relationships of attack patterns. Experimental results demonstrate the effectiveness of VWC-MAP in associating vulnerabilities to weakness-types and weaknesses to new attack patterns with high accuracy. This work contributes a comprehensive automated mapping of CVE-CWE-CAPEC associations, facilitated by large language models, aiming to impact both research and practical applications in cyber-defense.

The author's work in [25] focuses on enhancing software vulnerability detection through deep learning methods, specifically addressing the challenge of detecting vulnerabilities across long code slices with contextual dependencies. They introduced VulD-Transformer, a novel approach utilizing Transformer models tailored for code slice-level vulnerability detection. Unlike previous methods, VulD-Transformer aims to capture remote contextual dependencies between code statements effectively. To achieve this, the authors first extract code slices containing data and control dependencies using vulnerability syntax features and Program Dependency Graphs (PDGs). Then, they design a Transformer-based vulnerability detection model to enhance feature learning, particularly for remote code statements.

The experimental evaluation on synthetic and real datasets demonstrates the effectiveness of VulD-Transformer compared to existing approaches, showcasing improvements in accuracy, recall, and F1-measure, especially for code slices longer than 256 tokens. In terms of performance, the authors states that compared to the VulDeePecker, SySeVR-BGRU, SySeVR-ABGRU, and Russell approaches, VulD-Transformer achieves 6.12%, 8.01%, and 7.63% improvement on average.

Research done in [28] aim to assess the effectiveness of vulnerability detection using ChatGPT 4 by exploring various prompt designs tailored specifically for this purpose. They propose improvements to the basic prompt and incorporate structural and sequential auxiliary information from the source code to enhance ChatGPT's vulnerability detection capabilities. Leveraging ChatGPT's ability to remember multi-round dialogue, they introduce a chain-of-thought prompting approach to further improve detection performance. The study involves extensive experimentation on two vulnerability datasets, where they evaluate the effectiveness of prompt-enhanced vulnerability detection using ChatGPT. Additionally, the authors analyze the strengths and weaknesses of using ChatGPT for vulnerability detection, providing insights into the potential of prompt engineering for large language models (LLMs) in this domain. The paper outlines a workflow starting from prompt design enhancements to experimental validation and concludes with discussions on the implications and validity threats of their findings.

The study presents several key findings regarding the performance and capabilities of ChatGPT in vulnerability detection. Firstly, it demonstrates that ChatGPT outperforms two baseline methods (CFGNN [71] and Bugram [32]) in terms of both accuracy and coverage, indicating its effectiveness in identifying vulnerabilities within code snippets.

The inclusion of a task role in the prompt has shown potential to enhance ChatGPT's performance in vulnerability detection, albeit with programming-language-specific improvements. However, the use of a simple basic prompt leads to ChatGPT's response being biased towards the keywords present in the prompt, affecting its ability to provide comprehensive vulnerability detection, especially in C/C++ programs.

The study reveals that while ChatGPT exhibits better proficiency in identifying vulnerabilities in Java programs compared to C/C++ programs with the basic prompt, it struggles to comprehend vulnerabilities comprehensively across both languages. The effectiveness of incorporating different auxiliary information varies between programming languages, with API calls being more effective for Java functions and data flow information contributing slightly to the understanding of C/C++ vulnerable programs. Furthermore, the application of chain-of-thought prompting yields differing effects on Java and C/C++ datasets, with significant improvements observed in the latter but a degradation in detection performance noted in the former. Despite this, ChatGPT demonstrates accurate understanding of code functionality in vulnerability detection scenarios.

Additionally, augmenting prompts with high-quality code summaries has been found to enhance ChatGPT's detection performance, although the impact varies depending on the programming language. Moreover, strategically placing API calls before the code and data flow information after the code has been shown to improve performance, with API call information contributing more to correct predictions of non-vulnerable samples and data flow information aiding in accurate predictions of vulnerable samples.

Lastly, ChatGPT exhibits proficiency in detecting vulnerabilities related to grammar or certain boundary-related types, but struggles with types that are contextually irrelevant or require a deeper understanding of the context. Overall, these findings highlight both the strengths and limitations of ChatGPT in vulnerability detection and offer insights into optimizing its performance through prompt design and auxiliary information incorporation.

In [64] The authors conducted a thorough survey of a wide range of LLMs (include GPT-4, Gemini 1.0 Pro, Wizard Coder, Code LLAMA, GPT-3.5, Mixtral-MoE, Mistral, StarCoder, LLAMA 2, StarChat-β, and MagiCoder) and prompts in various scenarios of vulnerability detection. They analyze a larger number of LLM responses with multiple raters compared to previous studies. Secondly, they investigated whether LLMs can accurately identify the types, locations, and causes of vulnerabilities, akin to industry-standard static analysis-based detectors. Additionally, they pinpoint the capabilities and code structures with which LLMs struggle.

The study conducted a comprehensive assessment of Language Model-based vulnerability detection, focusing on both its performance and error analysis. Results revealed that the models exhibited only marginal improvement over random guessing, with balanced accuracy ranging from 0.5 to 0.63. Notably, the models struggled to distinguish between buggy and fixed code versions, often making identical predictions for 76% of pairs. To address these issues, the paper introduced CoT, a method combining Static Analysis and Contrastive pairs, which showed enhancements in certain model performances. Error analysis indicated that Language Models frequently made errors in Code Understanding, Common Knowledge, Hallucination, and Logic when explaining vulnerabilities, with 57% of responses containing errors. Additionally, when subjected to complex debugging tasks from DbgBench, Language Models significantly underperformed compared to humans, accurately identifying only 6 out of 27 bugs. These findings underscore the limitations of Language Models in vulnerability detection, and the dataset of errors identified offers insights for potential enhancements in future Language Model-based vulnerability detection methods.

RQ3 Answer: Large language models (LLMs) are employed in diverse ways to detect and handle software vulnerabilities and cybersecurity threats. Examples include the creation of new cyber threat detection models like SecurityLLM or SecureFalcon for differentiating vulnerable code samples, and leveraging LLMs such as GPT-4 for automated program repair. These approaches demonstrate the versatility and effectiveness of LLMs in enhancing cybersecurity practices.

RQ4. What is the workflow of an LLM model in the context of detecting and handling software vulnerabilities and cyber security threats?

The potential for leveraging Large Language Models (LLMs) in the domain of software security and vulnerability is substantial. To enhance our readers' understanding, we will elucidate the workflow of several researchers who have employed LLMs for this specific purpose.

Automated code repair framework: The authors in [2], introduced a novel model that merges the capabilities of Large Language Models (LLMs) with Formal Verification strategies, presenting an automated code repair framework illustrated in Fig. 2. In this method, users input a test code to the Bounded Model Checker (BMC) module for initial verification or falsification. If the initial verification proves unsuccessful, the original code, along with details of the property violation generated by the BMC module, is transferred to the LLM module. The LLM module then generates modified code, which undergoes another round of verification by the BMC module in an iterative fashion. This collaborative process enables the model to automatically verify and repair software vulnerabilities, integrating formal verification techniques with the capabilities of Large Language Models.

Assessing over 1000 specifically generated C programs for this study, the results demonstrate that the integration of Bounded Model Checking (BMC) and Large Language Models (LLM) effectively identifies software vulnerabilities and suggests corrective patches. Notably, the devised approach exhibits the ability to rectify vulnerable code, addressing issues such as buffer overflow and pointer dereference failures with a commendable success rate of up to 80%.

SecurityLLM: In figure 3 we have the workflow of SecurityLLM [4]. It involves two primary components: SecurityBERT and FalconLLM. Initially, SecurityBERT functions as a cyber threat detection mechanism by collecting cybersecurity data from various open-source databases and repositories. It then extracts relevant features from network traffic logs, transforms the data into textual representation using Fixed-Length Language Encoding (FLLE), and tokenizes it using ByteLevelBPETokenizer. Subsequently, SecurityBERT embeds the data using a transformer-based architecture, leveraging self-attention mechanisms to capture contextual representations of the text. Meanwhile, FalconLLM serves as an incident response and recovery system, integrating with SecurityBERT to enhance the identification of network-based threats. Together, these techniques enable SecurityLLM to detect and manage cybersecurity threats effectively.

GPT-4 as a code analyzer: Authors of [9] utilized the latest OpenAI models, particularly GPT-4, accessed through a chat interface with a system context set to “act as the world's greatest static code analyzer for all major programming languages. I will give you a code snippet, and you will identify the language and analyze it for vulnerabilities. Give the output in a format: filename, vulnerabilities detected as a numbered list, and proposed fixes as a separate numbered list.” Seven different LLMs from OpenAI are employed, ranging in parameter sizes from 350M to 1.7 trillion, covering a wide spectrum of capabilities. The models are queried automatically using the API to identify vulnerabilities and propose fixes in sample code snippets across eight popular programming languages (C, Ruby, PHP, Java, Javascript, C#, Go, and Python). Additionally, a Single Codebase of Security Vulnerabilities is utilized, consisting of 128 code snippets representing thirty-three vulnerable categories across different programming languages. Six public repositories from GitHub are submitted to the automated static code scanner, Snyk, to illustrate identifiable vulnerabilities and language problems addressed by LLMs. Snyk provides comprehensive vulnerability intelligence metrics for evaluation. The study concludes by submitting corrected code samples from GPT-4 to Snyk for comparison against the vulnerable codebase, aiming to assess the self-correction capabilities of LLMs objectively validated by a third-party static code scanner.

The study's results indicated that both the static code analyzer (HP Fortify) and the Large Language Model (LLM) OpenAI's GPT-4 2023AUG3 version successfully identified three vulnerabilities in an Objective-C method. However, the LLM provided a detailed explanation of the vulnerabilities and proposed three fixes for each, enhancing the understanding and mitigation process. Comparing vulnerability detection between GPT-3 and Snyk, GPT-3 identified significantly more vulnerabilities, with a low false positive rate observed. Furthermore, GPT-4 identified nearly twice as many vulnerabilities as GPT-3 and four times as many as Snyk, indicating its effectiveness in vulnerability detection. GPT-4 proposed a comparable number of code fixes to the identified vulnerabilities, supporting its reliability in addressing security flaws.

The first step in the GPTScan model is to break down each logic vulnerability type into scenarios and properties. This involves understanding the context and characteristics of different vulnerabilities, which can range from reentrancy to timestamp dependence. Once these vulnerabilities are broken down, GPTScan then matches these candidate vulnerabilities with the Generative Pre-training Transformer (GPT).

GPTScan: In figure 4 we can see a high-level overview of GPTScan workflow [14]. GPTScan's workflow begins with a thorough analysis of the smart contract project. It first parses the project, which can consist of standalone Solidity files or complex frameworks containing multiple Solidity files. Through call graph analysis, GPTScan identifies the functions that are reachable within the project, considering both direct accessibility and potential indirect access via other functions.

Once the candidate functions are identified, GPTScan employs a multi-dimensional filtering approach to narrow down the functions for further analysis. This filtering process is essential to manage the complexity of large projects and to focus on functions that are most likely to contain vulnerabilities. It includes project-wide file filtering, which excludes non-Solidity files and third-party library files, and filtering out functions from common libraries like OpenZeppelin to reduce false positives.

After the initial filtering, GPTScan matches candidate functions with pre-abstracted scenarios and properties of relevant vulnerability types using Generative Pre-training Transformer (GPT). Unlike existing approaches that rely on high-level vulnerability descriptions, GPTScan breaks down vulnerabilities into code-level scenarios and properties. This approach enables GPT to interpret code-level semantics directly, improving the accuracy of vulnerability detection.

Once potential vulnerabilities are identified through GPT matching, GPTScan proceeds to recognize key variables and statements within the matched functions using GPT. These variables and statements are then subjected to static analysis modules for further validation. The static analysis tools employed by GPTScan include methods such as static data flow tracing, value comparison checks, order checks, and function call argument checks. These techniques help confirm the existence of vulnerabilities by analyzing the data flow, value comparisons, execution order, and function call arguments within the code.

Throughout the workflow, GPTScan addresses three main challenges: handling complex project structures, enabling effective GPT recognition, and ensuring reliable confirmation of potential vulnerabilities. By employing multi-dimensional function filtering, breaking down vulnerabilities into scenarios and properties, and utilizing static confirmation techniques, GPTScan achieves high precision in detecting logic vulnerabilities in smart contracts.

The evaluation of GPTScan reveals a low false positive rate of 4.39% when analyzing non-vulnerable top contracts like Top200. Additionally, it demonstrates a precision of 90.91% when assessing DefiHacks, indicating its suitability for extensive scanning of on-chain token contracts. Moreover, even when scrutinizing sizable contract projects within Web3Bugs, GPTScan maintains a satisfactory precision of 57.14%. These results, presented in Table III, provide insights into GPTScan's performance in identifying false positives and its precision across various contract datasets.

Table III: GPTScan False Positive Rate Analysis Results

Dataset Name	TP	TN	FP	FN	Sum
Top200	0	283	13	0	296
Web3Bugs	40	154	30	8	232
DefiHacks	10	19	1	4	34

LLMs for APR: The workflow for fine-tuning LLMs for Automated Program Repair (APR) [15], as shown in figure 5, involves a series of steps aimed at optimizing the model's ability to understand and generate code fixes. Initially, data pre-processing transforms raw source code into a format suitable for LLM processing, employing techniques like code abstraction and different code representations to enhance the model's understanding of fixing patterns. Subsequently, model training and tuning extend LLMs into the Neural Machine Translation (NMT) architecture, focusing on encoder-only and encoder-decoder models for their superior performance. Through iterative training on the dataset, the model learns domain knowledge for defect repair. During this process, checkpoints are evaluated using various metrics to identify the best-performing model for patch generation.

Following model evaluation, the patch generation phase employs the beam search strategy to synthesize patches from multiple repair models. Plausible patches are filtered using test cases, and manual validation is conducted to assess the correctness of the generated patches. This validation process helps ensure the accuracy of the generated fixes. Overall, this workflow aims to systematically explore preprocessing techniques, model architectures, evaluation metrics, and patch generation strategies to enhance the LLM's capability for automated program repair.

The paper demonstrates that Large Language Models of Code (LLMCs) exhibit strong repair capabilities across various scenarios under the Neural Machine Translation (NMT) fine-tuning paradigm. Even without employing post-processing strategies, LLMCs achieve impressive results, surpassing many existing Automated Program Repair (APR) approaches. The study provides practical guidelines for optimizing LLMC designs to enhance their repair capabilities and addresses complex defects effectively. Despite these successes, the paper also identifies limitations during evaluation, highlighting areas for improvement and suggesting future research directions. Additionally, the results presented in the paper establish valuable benchmarks for subsequent research in LLMC-based APR, underscoring the significant potential of LLMCs for practical application in software repair tasks.

VulDetect: In figure 6 we have the workflow of VulDetect [17, 24]. The authors introduce, a classification model designed for automatic software vulnerability detection, leveraging the extensive language model GPT-2. VulDetect employs a fine-tuned GPT-2 model to identify vectors associated with vulnerable code segments extracted from the target source code. Their Natural Language Processing (NLP) vulnerability model accepts a lengthy character string as input, typically a C source file. In the subsequent step, a tokenizer partitions the string into individual words and sub-words. Notably, syntax characters such as periods, semicolons, parentheses, and brackets are treated as distinct entities. Following this, the encoder transforms these words into vector representations. These vectorized words are then inputted into the model either sequentially (token by token) or in larger segments. In the implementation of this vulnerability detection technique, the output vector corresponds to the number of vulnerability classes present in the training dataset. Specifically, with 124 unique vulnerability classes, the output vector has a dimension of 124. This output vector is further processed through a Softmax function, which normalizes it into a probability distribution where the sum of all probabilities equals 1. Each element within the output vector indicates the predicted probability of the corresponding vulnerability class being present in the analyzed code file.

In the defense performance evaluation, the researchers opted to assess the efficacy of their technique by conducting tests on three classifiers (GPT-2, CodeBERT, and LSTM) using two standardized benchmark datasets (SARD and SeVC). The objective was to gauge the ability of their technique in detecting vulnerabilities. Results presented in Table IV illustrate a notably higher classification accuracy when implementing the VulDetect technique. Notably, the GPT-2 classifier demonstrated the highest accuracy, reaching up to 92.59% when tested on the SARD dataset, while the LSTM classifier exhibited the lowest accuracy, achieving only 65.78% when evaluated on the SeVC dataset. Overall, GPT-2 consistently outperformed the other architectures (CodeBERT and LSTM), which is in line with its reputation as a transformer-based model renowned for achieving state-of-the-art performance in various linguistic tasks, including sentiment analysis and sentence classification.

Table IV: Comparison of Model Detection Accuracy on SeVC and SARD Datasets.

Dataset	Model	Detection Accuracy
SeVC	GPT-2	87.63
	CodeBERT	83.23
	LSTM	71.59
SARD	GPT-2	92.59
	CodeBERT	89.28
	LSTM	76.86

Vulnerability repair solution: The workflow of vulnerability repair solution created in [19] is shown in Fig. 7. It employs pre-trained programming language (PL) models for automated vulnerability repair. Initially, the vulnerability code and its corresponding fixed code are extracted as bug-fix pairs (BFPs), serving as training data. Subsequently, the pre-trained PL model undergoes fine-tuning for the downstream task of vulnerability repair. Throughout the experiments, an optimal model is identified based on the evaluation using BLEU scores. In the experimental testing phase, the vulnerability code, comprising the code at the vulnerability location along with its contextual content, is inputted to enable the model to generate the predicted fix code.

Due to constraints regarding space and time, the authors opt to employ BERT-style pre-trained models for their current experiments. Initial investigations into these models reveal that not all BERT-style variants align with the automated program repair (APR) task at hand. For instance, CuBERT, trained solely on the Python corpus, proves unsuitable for the authors' C-based test dataset. Eventually, CodeBERT and GraphCodeBERT emerge as potential candidates for the APR task. Both models support multiple programming languages and offer downstream tasks akin to vulnerability repair. Notably, GraphCodeBERT integrates data flow information, enhancing its ability to manage intricate code logic relationships. This feature facilitates a comparative analysis between GraphCodeBERT and CodeBERT, enabling an evaluation of the advantages conferred by data flow features.

The experimental outcomes presented in table V, demonstrate the effectiveness of different repair solutions in addressing vulnerabilities. The repair accuracy, representing the repair capability of each solution, is analyzed alongside the percentage of successfully repaired vulnerabilities out of the total number. Notably, SequenceR emerges as the top performer in single-line vulnerability repair, closely followed by the authors' approach, which achieves comparable results. Specifically, CodeBERTFix and GraphCodeBERTFix exhibit high repair accuracy rates of 95.47% and 94.04% on single-line repairs, respectively. Moreover, the authors' approach excels in multi-line vulnerability repair, surpassing the performance of other solutions such as DLFix, SequenceR, CoCoNut, and the approach by Tufano et al. GraphCodeBERT, leveraging data flow graph information, demonstrates superior capability in multi-line repair compared to CodeBERT.

Table V: Experiment results

CWE 121

CWE 190

CWE 369

CWE 401

CWE 457

ALL

Single

Multi

Single

Multi

Single

Multi

Single

Multi

Single

Multi

Single

Multi

Tufano et al.

69/198

256/451

136/627

124/412

31/102

7/74

37/54

64/134

111/144

30/46

384/1125

481/1117

DLFix

70/198

0/451

351/627

0/412

33/102

0/74

48/54

0/134

76/144

0/46

578/1125

0/1117

SequenceR

183/198

0/451

623/627

0/412

100/102

0/74

51/54

0/134

136/144

0/46

1093/1125

0/1117

CoCoNut

176/198

0/451

410/627

0/412

82/102

0/74

49/54

0/134

112/144

0/46

829/1125

0/1117

CodeBERT

178/198

395/451

620/627

333/412

97/102

54/74

52/54

90/134

127/144

34/46

1074/1125

906/1117

GraphCodeBERT

181/198

410/451

600/627

379/412

91/102

60/74

48/54

124/134

138/144

33/46

1058/1125

1006/1117

VWC-MAP: The VWC-MAP [21] (Vulnerabilities-Weakness-Common Attack Pattern Mapping) is a two-tiered framework designed for automated classification of vulnerabilities into attack patterns via weaknesses based on their text descriptions.

First Tier - Classifying Vulnerabilities to Weakness: In the first tier, the model classifies vulnerabilities to weaknesses. This means it maps a given vulnerability to the corresponding weakness type. This is done by applying natural language processing (NLP) techniques to the text descriptions of the vulnerabilities.

Second Tier - Classifying Weakness to Attack Techniques: In the second tier, the model classifies weaknesses to attack techniques. This means it maps a given weakness to the corresponding attack technique. This is also done by applying NLP techniques to the text descriptions of the weaknesses.

The VWC-MAP framework uses CWE as an intermediate step because most CAPECs focus on exploiting CWEs, while CVEs are real-world instances of CWEs.

The authors have also presented two novel automated approaches for mapping weakness to attack techniques by applying Text-to-Text and link prediction techniques.

Link Prediction network: The Link Prediction network (shown in figure 8) in the VWC-MAP framework employs a Siamese architecture of a Neural Network. It begins by taking TF-IDF vectors of the CWE and CAPEC as input features. These features are then transformed into new dimensional vectors by a Feature Transformer Network. The transformed vectors are combined using a concatenation of feature subtraction and multiplication operations. The combined representation is then fed into a Link Classifier Network, which makes a binary prediction about the associations. Although pre-trained language models like BERT could be used as the Feature Transformer Network, the authors found that a basic Neural Network model quickly overfits the given data based on a few keywords. Therefore, it does not provide any extra benefit with the added complexity of BERT.

Text-to-text model: The task of mapping CWE to CAPEC can be conceptualized as a text generation challenge. The text-to-text model, known as Google’s T5, offers an alternative to link prediction for CWE-CAPEC mapping. While link prediction methods depend on negative examples and keyword-based decisions, T5 leverages transfer learning capabilities to generate CAPEC descriptions directly from CWE text descriptions. This task is more challenging as it requires generating entire sequences of CAPEC text, demanding a deeper understanding than keyword-based decision-making.

During training, T5 utilizes attention networks to prioritize keywords and understand their relationship with CWEs and CAPECs. Handling many-to-many relationships between CWEs and CAPECs presents a challenge, which is addressed by incorporating special commands during training. These commands, such as 'One Weakness to Attack' and 'Two Weakness to Attack', aid in modeling the relationships between CWEs and multiple CAPECs. Additionally, relationships between CWEs themselves are modeled using commands like 'Weakness Child of Weakness'. During inference, commands are passed along with CWE descriptions to generate corresponding CAPEC descriptions. The generated texts are then vectorized and compared with existing CAPECs to find the best match using cosine similarity. The T5 model's capabilities extend to few-shot learning and allow for the generation of CAPEC definitions corresponding to user-specified CWE descriptions.

The experimental results, which were cross-validated by cybersecurity experts, demonstrate that VWC-MAP can associate vulnerabilities to weakness types with up to 87% accuracy, and weaknesses to new attack patterns with up to 80% accuracy.

In summary, VWC-MAP is a novel framework that automatically maps CVEs to CWEs and CAPECs. This is highly beneficial for cyber risk management tools that require automated association among CVEs, CWEs, and CAPECs to cope with the rapid emergence of new vulnerabilities, weaknesses, or attack techniques.

VulD-Transformer: VulD-Transformer [25], as shown in Figure 9, is a comprehensive framework for source code vulnerability detection. It comprises four main components: the input module, code parser, vulnerability detector, and output module. Each part plays a crucial role in the detection process.

1. Input Module: This module accepts the source code as input and initiates the detection process.

2. Code Parser: The code parser processes the source code according to specific syntax rules to generate code slices, which are segments of code containing potential vulnerabilities. This process involves several steps:

2.1. Generating a program dependency graph (PDG) using the Joern tool.

2.2. Identifying vulnerability candidates within the PDG based on predefined syntax rules.

2.3. Generating code slices from the identified vulnerability candidates.

2.4. Cleaning and normalizing the code slices to remove irrelevant characters and standardizing identifiers and variable names.

3. Vulnerability Detector: The vulnerability detector assesses whether the code slices contain vulnerabilities. It includes two main operations:

3.1. Vector conversion: Utilizing FastText, the code slices are converted into vectors. FastText's character-level embedding allows capturing subtle relationships between identifiers in the source code, enhancing the detection accuracy.

3.2. Vulnerability detector: The vulnerability detector learns the vector representation of the code slices, obtains the global features of the code slice, and obtains the final vulnerability detection result at the last layer. The vulnerability detection model leverages the Transformer architecture to learn vector representations of the code slices. Only the encoding part of the standard Transformer structure is used in the vulnerability detection task in this model. The decoder, which belongs to the generative model often used in natural language generation, is not used.

This model comprises several key components:

3.2.1. Position encoding: Adding positional information to the code slice vectors to preserve their sequential order.

3.2.2. Multi-head attention mechanism: Capturing long-distance dependencies between code tokens by computing mutual attention.

3.2.3. Feed-forward layer: Consisting of fully connected layers with ReLU activation to extract higher-level features from the code slice representations.

3.2.4. Add & Norm layer: Adding and normalizing the output of the attention and feed-forward layers.

3.2.5 Fully connected layer with softmax activation: Producing the final vulnerability detection results, categorizing code slices as either non-vulnerable or vulnerable.

4. Output Module: This module presents the vulnerability detection results to the user, indicating whether vulnerabilities are detected in the analyzed source code.

The framework's effectiveness relies on its ability to accurately generate code slices and leverage advanced deep learning techniques, such as the Transformer architecture, to detect vulnerabilities within these slices. By integrating techniques for code parsing, vectorization, and vulnerability detection, VulD-Transformer offers a comprehensive solution for identifying vulnerabilities in source code, contributing to enhanced software security.

The authors investigated the effectiveness of VulD-Transformer for vulnerability detection and whether incorporating FastText word vectors and a Transformer encoder improves its performance. To assess this, they employed evaluation metrics including Accuracy (A), Recall (R), and F1-measure (F1).

In RQ1, experiment 1 assessed vulnerability detection across various code slice lengths, where VulD-Transformer excelled, particularly in longer slices (> 128 tokens), showcasing its adeptness at learning contextual information. Experiment 2 investigated vulnerability detection for different syntax rules, with VulD-Transformer showcasing superior performance, notably on AE and PU datasets. Experiment 3 tested VulD-Transformer's detection capability on real software vulnerability datasets, achieving higher accuracy, recall, and F1-measure compared to other methods (VulDeePecker, SySeVR-BGRU, SySeVR-ABGRU, and Russell).

In RQ2, the impact of incorporating FastText word vectors and a Transformer encoder was examined, revealing improvements in VulD-Transformer's vulnerability detection. Models utilizing FastText vectors exhibited enhanced accuracy, recall, and F1-score, especially when combined with the Transformer encoder. In conclusion, VulD-Transformer emerges as an effective vulnerability detection method, particularly adept at longer code slices, and its effectiveness is further enhanced by accommodating FastText word vectors and a Transformer encoder.

Using LLM to create dataset: Researchers in [30] introduced the FormAI dataset, a comprehensive collection of 112,000 AI-generated C programs, each labeled with vulnerabilities, to foster research in AI-driven code generation and security. The authors utilize Large Language Models (LLMs), particularly the GPT-3.5-turbo model, to dynamically prompt the generation of diverse C programs, varying in complexity and task types. Leveraging formal verification through the Efficient SMT-based Bounded Model Checker (ESBMC), vulnerabilities within the generated code are identified, labeled, and associated with Common Weakness Enumeration (CWE) numbers. The dataset aims to provide valuable resources for training LLMs and machine learning algorithms while addressing critical concerns regarding the safety and security of AI-generated code. STELOCODER A DECODER-ONLY LLM FOR MULTI-LANGUAGE

Their methodology, shown in Figure 11, involves several steps to construct and classify the FormAI dataset. Initially, GPT-3.5-turbo is prompted to generate C programs for diverse tasks, ranging from complex network management to simple string manipulation. Each output program is then subjected to compilation using the GNU C compiler to ensure compilability. Subsequently, the ESBMC module performs formal verification to detect vulnerabilities within the compiled programs. Detected vulnerabilities, along with their specific details such as line numbers and function names, are recorded in a .csv file, facilitating further analysis. This process ensures that vulnerabilities are conclusively identified, minimizing the risk of false positives and providing a formal counterexample for each vulnerability detected.

7.4 RQ5: What is the best type of data sets to train LLMs for software vulnerability detection?

For detecting and handling software vulnerabilities and cybersecurity threats with LLMs, a combination of text-based and code-based datasets tends to be most effective

Text-Based Datasets: These are crucial for tasks involving bug fixing, code comprehension, and understanding textual content related to vulnerabilities. They aid LLMs in grasping the context around security issues, identifying patterns in security-related texts (such as bug reports or security advisories), and enhancing their ability to generate secure code or identify vulnerabilities in code [1, 7, 12].

Code-Based Datasets: Essential for LLMs to comprehend and analyze code, especially for identifying vulnerabilities within the codebase itself. This type of dataset helps LLMs understand the structure, logic, and potential flaws in software code, enabling them to identify vulnerabilities, suggest fixes, or even generate secure code [1, 61].

By combining these datasets, LLMs can learn to correlate textual information (like security advisories or bug reports) with the actual code, which is crucial in cybersecurity. Understanding the context within which vulnerabilities are reported and how they manifest in code allows LLMs to provide more comprehensive and accurate support in detecting, addressing, and potentially preventing security threats.

7.4.1 Examples of datasets used to train LLMs for software security and cybersecurity purposes.

Data collection for this purpose entails acquiring he training data from various open-source databases and repositories such as CVEfixes, Big-Vul, Draper, SARD, Juliet, Devign, REVEAL, DiverseVul, and many others, encompassing different security aspects [4, 30].

CVE dataset: The Common Vulnerabilities and Exposures (CVE) dataset is a comprehensive vulnerability dataset that is automatically collected and curated from CVE records in the public U.S. National Vulnerability Database (NVD). CVE datasets contain information on publicly disclosed cybersecurity vulnerabilities. These datasets are valuable for researchers, security professionals, and anyone interested in staying up-to-date on the latest threats. As of now, there are 228,713 CVE Records. It lists reported vulnerabilities in software systems.

CVE dataset typically includes:

CVE ID: A unique identifier for the vulnerability assigned by Mitre, the CVE Program authority.
Description: A detailed explanation of the vulnerability, including the affected software, potential impact, and how it can be exploited.
CVSS Score: A scoring system (Common Vulnerability Scoring System) that reflects the severity of the vulnerability based on its exploitability, impact, and scope.
Published Date: The date the vulnerability was publicly disclosed.
References: Links to additional resources such as patches, workarounds, and exploit code.
Affected Products: A list of software programs or systems that are vulnerable to the exploit.

To help researchers, we compiled a list of resources where CVE datasets can be found:

The National Institute of Standards and Technology (NIST)[1]
MITRE CVE (This is the official source for CVE data and provides downloads in various formats).[2]
New CVE official website[3].
Download page in new CVE website[4]. (This page offers downloads of the CVE List in legacy formats until June 30th, 2024, and the new recommended JSON 5.0 format).
CVE Details[5] / (This website offers a searchable CVE database with additional information like exploit details and vendor risk scores).

CWE dataset: The CWE (Common Weakness Enumeration) dataset, sourced from MITRE, focuses on classifying software weaknesses. The CWE provides a provides a standardized way to classify software weaknesses. This is useful to developers, system analysts, software testers, and security researchers. The CWE dataset is organized as a relational database and covers a wide range of vulnerabilities [21].

Each CWE entry provides details about a specific weakness, including:

ID: A unique identifier for the weakness.
Name: A concise description of the weakness.
Description: A more elaborate explanation of the weakness, its potential consequences, and how it can be exploited.
Extended Description: Additional in-depth information about the weakness.
Relationships: Connections to other CWEs and relevant security concepts.
Hierarchies: Organization of CWEs within a structured classification scheme.

CAPEC dataset: The Common Attack Pattern Enumerations and Classifications (CAPEC) dataset, also from MITRE, focuses on classifying cyber-attack patterns. Similar to CWE, CAPEC provides a standardized language for understanding attacker methods. Entries are organized hierarchically and can be linked to show relationships between different attack patterns.

CVEfixes: The CVEfixes dataset was first introduced by Bhandari et al. [63]. This comprehensive vulnerability dataset is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). It aims to support data-driven security research based on source code and source code metrics related to fixes for CVEs by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level. The initial release of the dataset covers all published CVEs up to June 9, 2021, and includes information from 5,495 vulnerability fixing commits in 1,754 open-source projects, covering a total of 5,365 CVEs across 180 different Common Weakness Enumeration (CWE) types. Additionally, the dataset includes the source code before and after fixing for 18,249 files and 50,322 functions1 [13, 15, 63].

DiverseVul: DiverseVul was first introduced by [27], designed for the detection of software vulnerabilities through deep learning techniques. DiverseVul comprises 18,945 vulnerable functions across 155 Common Weakness Enumerations (CWEs) and 330,492 non-vulnerable functions, sourced from 7,514 commits. Notably, this dataset offers greater diversity and is twice the size of the previous largest and most diverse dataset, CVEFixes. Utilizing DiverseVul, the study investigates the efficacy of various deep learning architectures in vulnerability detection. It explores 11 distinct deep learning architectures from four model families: Graph Neural Networks (GNN), RoBERTa, GPT-2, and T5. Findings suggest that the increased diversity and volume of training data contribute to enhanced vulnerability detection, particularly for large language models.

The software assurance reference dataset (SARD): SARD is particularly attractive due to its inclusion of both security vulnerabilities and non-vulnerable alternatives. This feature enables the model to discern between these distinct categories effectively. Subsequently, it is possible to implement a preprocessing stage to eliminate any undesired artifacts that could potentially cause overfitting in the model [7, 17, 22, 24, 27, 28, 30, 39].

Semantics-based Vulnerability Candidate (SeVC): SeVC dataset contains 1,591 C/C++ open-source programs sourced from the National Vulnerability Database (NVD), along with 14,000 open-source programs from SARD. Within this dataset, there are a total of 420,627 SeVCs, among which 56,395 are identified as vulnerable, while 364,232 are deemed non-vulnerable. Additionally, the dataset encompasses four types of SeVCs: Library/API Function Calls, Array Usage, Pointer Usage, and Arithmetic Expression [7, 17, 22, 24].

Devign: The Devign dataset, initially presented in [62], stands as a real-world dataset designed for the identification of software vulnerabilities. It comprises function-level C/C++ source code extracted from two extensively utilized open-source software projects, namely QEMU and FFmpeg. The labeling and verification procedures were conducted manually by a team of security researchers across two distinct rounds [7, 17, 22, 24, 27, 28, 30, 43].

Big-Vul: The Big-Vul dataset is a collection of C/C++ vulnerabilities across several project repositories. It is a comprehensive dataset that includes code changes and Common Vulnerabilities and Exposures (CVE) summaries. Each entry in the dataset covers the period from 2002 to 2019 and consists of 21 features. The dataset is used for various research topics, such as detecting and fixing vulnerabilities, and analyzing the vulnerability-related code changes. It can be particularly useful for training and evaluating models designed for vulnerability detection [34, 73].

RQ5 Answer: For optimal training of Large Language Models (LLMs) in software vulnerability detection, a mix of text-based and code-based datasets is ideal. Text datasets aid in understanding security context, while code datasets help analyze vulnerabilities. Examples include CVE, CWE, CAPEC CVEfixes, DiverseVul, SARD, SeVC, and Devign, enhancing LLMs' effectiveness.

RQ6. In comparison to traditional methods/tools, how do LLMs perform in detecting and handling software vulnerabilities and cyber security threats?

Our research provides compelling evidence of the effectiveness of Large Language Models (LLMs) in detecting and handling software vulnerabilities. Several studies have demonstrated the superiority of LLMs over traditional methods:

SecurityLLM: Integration of SecurityBERT and FalconLLM resulted in the creation of a cyber threat detection model with an overall accuracy of 98%, capable of identifying fourteen different types of attacks.

GPT3.5 for Penetration Testing: Utilizing GPT3.5, researchers enhanced penetration testing by leveraging high-level strategic planning and identifying weak spots in vulnerable computing environments, achieving a closed-feedback loop between model-generated actions and vulnerable virtual machines.

GPT-4 vs. Static Code Analyzers: A study comparing GPT-4 with traditional static code analyzers like Snyk and Fortify found that GPT-4 detected approximately four times more vulnerabilities with a low false positive rate. GPT-4 also provided potential fixes for identified vulnerabilities, resulting in a significant decrease in vulnerabilities with minimal increase in code lines.

BERTBase for Vulnerability Detection: Fine-tuning the BERTBase model for vulnerability detection resulted in surpassing the performance of standard LSTM and BiLSTM models, achieving the highest detection accuracy of 93.49%.

VulDetect with GPT-2: The creation of VulDetect, a classification model based on GPT-2, demonstrated the effectiveness of LLMs when applied to a significantly large dataset, achieving superior performance compared to other model architectures.

GitHub Copilot for Code Generation: Models trained on source code rather than natural language, often referred to as code-based models or Large Language Models of Code (LLMCs), have emerged as powerful tools in software engineering and related fields. These models leverage algorithms and techniques tailored to analyze and understand programming languages and their syntax. By training on vast repositories of code, they acquire an understanding of programming logic, enabling them to generate functional code snippets and even entire programs that meet specified criteria. This capability holds immense promise for automating software development tasks, such as code completion, bug detection, and program synthesis. Furthermore, these code-based models can undergo rigorous testing procedures to ensure the reliability and robustness of the generated code. Researchers are continually advancing these models, exploring new architectures, training methodologies, and applications across various domains within computer science and software engineering.

GitHub Copilot and LaMDA Code are prominent example of large language models highly regarded within the developer community. GitHub Copilot developed in collaboration with OpenAI, leverages the power of OpenAI Codex, a sophisticated AI system trained on a vast dataset of public source code. Furthermore, models primarily trained on human language, exemplified by OpenAI's ChatGPT, demonstrate proficiency in similar tasks [31, 35, 72].

Transformer-Based Language Models (LLMs) vs. Traditional Methods: This study the superior performance of transformer-based language models (LLMs) over code analyzers or traditional static and recurrent neural network (RNN)-based methods was showcases in [22]. software vulnerability detection. Through a systematic evaluation framework, it was demonstrated that LLMs, particularly GPT-2 Large and GPT-2 XL, consistently outperformed BiLSTM and BiGRU models across various metrics, including false positive rate (FPR), false negative rate (FNR), and F1-score, in both binary and multi-class classification tasks. Furthermore, when compared to BERTBase and GPT-2 Base, LLMs exhibited better performance in identifying vulnerabilities across different categories, reinforcing their efficacy in software vulnerability detection tasks. These findings underscore the significance of utilizing LLMs as powerful tools for enhancing the security of software systems by effectively identifying potential vulnerabilities.

However, not all studies demonstrate superior performance from Large Language Models (LLMs). For instance [13] evaluations indicate that Large Language Models (LLMs) struggle in detecting software vulnerabilities. This is primarily due to high numbers of false positives identified by the models. While preprocessing techniques like constructing code gadgets may improve the LLMs' ability to recall actual vulnerabilities (recall rate), the number of false positives remains persistently high. The study also found that LLMs, particularly when fine-tuned, exhibit proficiency in recognizing common patterns associated with vulnerable code. For example, they can effectively identify code patterns using object-relational model libraries that could be susceptible to SQL injection vulnerabilities. Additionally, the authors observed that ChatGPT 4.0 can potentially understand the "intention" behind a given code snippet. They hypothesize that LLMs' strong recall performance might be partly attributed to their ability to identify vulnerable code patterns across multiple lines of code. This is a significant advantage compared to traditional static analysis methods, where manually crafting such rules is a time-consuming and expensive process.

Also, the results of the experiment conducted in [64, 75] showed that while the prompting methods improved the models’ performance, LLMs generally struggled with vulnerability detection. They reported a Balanced Accuracy of 0.5-0.63 and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average.

Over-all, these findings collectively underscore the capability of Large Language Models in detecting and handling software vulnerabilities, outperforming traditional methods such as and recurrent neural network models.

RQ6 Answer: Large Language Models (LLMs) show significant promise in detecting software vulnerabilities compared to traditional methods, offering high accuracy rates and the ability to recognize complex code patterns. Despite some challenges, such as high FPR and FNR, and struggles in distinguishing between buggy and fixed versions, with more advancement in this field, LLMs can be valuable tools for enhancing software security.

7.6 RQ7. What metrics assess large language Models in addressing software vulnerabilities and cyber threats?

Evaluation metrics are crucial for assessing the effectiveness and success of LLMs in software vulnerabilities and cybersecurity applications. These metrics provide a framework for quantifying the performance of these models across various tasks related to software vulnerabilities and cybersecurity threats, such as vulnerability detection, threat prediction, and automated patching. Given the diverse nature of these tasks, employing a range of evaluation metrics tailored to specific problem types is common practice.

Classification Tasks: For tasks involving classification, such as identifying types of vulnerabilities or predicting potential threats, F1-score, Precision, and Recall are commonly used metrics. They gauge the model's ability to classify code snippets accurately or identify specific security properties [1, 4, 17, 18, 22, 27, 36, 37].

Recommendation Tasks: Mean Reciprocal Rank (MRR) is a prevalent metric for recommendation systems related to code completion. Precision@k and F1-score@k are also employed in evaluating the precision and F1-score of recommended code snippets or completions [1]

Generation Tasks: BLEU (including its variants BLEU-4 and BLEU-DC) and Pass@k are widely used metrics for code-to-code translation models and code generation assessment. These metrics evaluate the quality and accuracy of generated code snippets compared to reference solutions. Additionally, other metrics like ROUGE/ROUGE-L, METEOR, EM (Exact Match), and ES (Edit Similarity) are utilized in specific studies to assess the quality of generated code or natural language code descriptions [1, 20, 58].

AUC-ROC: The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a metric that assesses the model's ability to distinguish between positive and negative cases across different classification thresholds. A higher AUC-ROC value indicates better overall performance.

Code Coverage: In the context of vulnerability detection, code coverage metrics assess the extent to which the LLM has analyzed the source code. Higher code coverage generally implies a more thorough analysis and potentially better vulnerability detection.

False Positive Rate and False Negative Rate: These metrics measure the rates at which the model incorrectly identifies non-vulnerable code as vulnerable (false positive) or fails to detect actual vulnerabilities (false negative). Minimizing both rates is crucial for reliable vulnerability detection.

These approach to evaluating LLM performance recognizes the nuanced nature of software engineering tasks, employing specific metrics tailored to the task at hand, whether it's classification, recommendation, or generation, to comprehensively measure model effectiveness and accuracy.

RQ7 Answer: Evaluation metrics for assessing LLMs in addressing software vulnerabilities and cyber threats include F1-score, Precision, Recall for classification tasks, MRR for recommendation tasks, BLEU and Pass@k for generation tasks, AUC-ROC for discrimination ability, and metrics like code coverage, false positive rate, and false negative rate for vulnerability detection. These metrics offer a comprehensive assessment of LLM performance across various tasks in software security.

7.7. RQ8: What are the challenges of using LLMs in cybersecurity tasks?

What are the limitations or challenges associated with using LLMs for software security, and how can they be mitigated? In this section we will try to shed some light to this question.

Prompt injection attacks: One of the significant security concerns related to LLMs, is prompt injection attacks [68, 69, 70]. Prompt injection attacks involve manipulating the input prompts given to a Large Language Model (LLM) in order to coax it into generating responses that are harmful, unauthorized, or disclose sensitive information. Malicious users may attempt to deceive the LLM into producing outputs that could compromise security, breach privacy, or cause harm in various ways. These attacks exploit vulnerabilities in the LLM's processing of prompts, potentially leading to disruptive outcomes or security breaches. This form of attack, especially potent within LLM-integrated applications, has recently been identified as the primary LLM-related threat by OWASP (Open Web Application Security Project) Foundation. Such manipulation can result in adverse consequences, such as providing incorrect guidance or unauthorized divulgence of confidential data [65, 66, 67, 74].

Prompt injection is important because it highlights the need to address issues related to prompt abuse and prompt leak as we transition further into the era of Large Language Models (LLMs). It is crucial to protect LLM-integrated applications from prompt injection threats, a fact recognized by many developers who have demonstrated increasing vigilance in the implementation of prompt protection systems and the quest for novel solutions [65, 66, 67, 74]. There are two main categories of prompt injections:

Direct Prompt Injections: Commonly referred to as "jailbreaking," direct prompt injections involve altering or exposing the system prompt, often resulting in partial loss of intellectual property. This process may entail creating prompts with the specific goal of bypassing safety and moderation measures implemented by creators of Large Language Models (LLMs) [74].

Indirect Prompt Injections: indirect prompt injections occur when an LLM accepts input from external sources that can be manipulated by an attacker, such as websites or files. In this scenario, attackers can trick the LLM into interpreting its input as "commands" rather than "data" for processing, leading to unexpected behavior in LLM-based applications or compromising the security of the entire system [74].

Vulnerability Detection Challenges: Detecting vulnerabilities with LLMs can be problematic due to their ability to generate various alternative responses for the same issue. This diversity, beneficial in language processing and text generation, may pose challenges in pinpointing the actual vulnerability. The presence of multiple solutions, although advantageous in certain contexts, can complicate the identification of even the simplest software security vulnerabilities [2].

LLM Code Generation Challenges: LLMs may struggle to generate accurate code when faced with multiple valid solutions, leading to functionally correct but contextually inappropriate code. They might perform well on specific tasks they were trained on but struggle with different tasks, languages, or domains outside of their training scope. Their performance can deteriorate significantly when inputs undergo semantic-preserving transformations [1, 20].

The study done in [29] analyzed 2033 programming tasks and 4066 ChatGPT-generated code snippets implemented in two popular programming languages: Java and Python, with 2,556 code with quality issues to comprehensively assess AI-generated code quality and uncover performance-influencing factors. While ChatGPT3.5 could produce functional code for various tasks, their research findings uncovered a range of code quality concerns in ChatGPT3.5-generated code, spanning from compilation and runtime errors to incorrect outputs and issues with maintainability. This discovery underscores the critical need to tackle these issues diligently to safeguard the sustained efficacy of AI-driven code generation and uphold the standards of high-quality software systems.

Hallucinations: LLMs frequently generate false information, known as "hallucinations," which appear statistically plausible. Studies indicate that incorporating external knowledge and automated feedback mechanisms can mitigate these hallucinations [5, 64, 74].

Code Quality Challenges: Code quality issues pose significant concerns due to their potential to incur financial and reputational losses.

LLM Deployment Challenges: Large Language Models (LLMs) have become instrumental in software development, but they come with their own set of challenges. Their enormous size demands significant resources, making deployment challenging in resource-limited scenarios. They rely heavily on large and diverse datasets for training, and limited or biased data can lead to inaccurate predictions. There are also concerns about privacy leaks with Personally Identifiable Information (PII) in training data [1].

LLM Evaluation: Existing evaluation metrics might not capture all aspects of model performance, such as interpretability, robustness, or sensitivity to certain errors. LLMs often lack interpretability and transparency in their decision-making processes, leading to uncertainty among developers. Concerns exist around the ownership of training data, derivative data, and potential adversarial attacks by seeding vulnerabilities into LLMs [1].

When comparing the performance of Large Language Models (LLMs) in software vulnerability and cybersecurity domains, a pertinent question arises regarding the fairness of such comparisons, given the variability in testing conditions across different scenarios. For instance, the robustness of the system could significantly impact LLM performance when both training and testing the model on the same dataset. Notably, not all researchers have access to supercomputers or high-performance workstations, which could affect the scalability and efficiency of model training and evaluation. Furthermore, changes made to the dataset over time or the selective use of specific portions of the dataset during testing may introduce additional variability, potentially influencing the outcomes of comparative analyses.

Insecure output management: Developers need to be careful since LLMs may produce harmful outputs. Insecure output management occurs when LLM outputs are not properly validated or sanitized before use, which can lead to security risks like Cross-Site Scripting and Cross-Site Request Forgery in web browsers. Furthermore, neglecting to validate LLM outputs may lead to downstream security exploits, including code execution that compromises systems and exposes data. Attackers can also exploit these outputs for privilege escalation and remote code execution on backend systems [74].

Poisoning of educational data: Training data poisoning refers to the deliberate manipulation of the data used to train models with malicious intent. Adversaries insert deceptive or biased examples into the training dataset during pre-training or fine-tuning to influence the model's learning process. This can involve introducing backdoors, biases, or other vulnerabilities that compromise the security, performance, and reliability of the model [74]. Manipulated training data can disrupt LLM models, leading to responses that may compromise security, accuracy, or ethical behavior.

Denial of service model: Training and running Large Models (LMs) requires substantial resources. An attacker can engage with LMs in a way that causes them to use resources excessively, thereby reducing the quality of service or even denying service to other users, and increasing compute costs. Attackers can create prompts that are computationally demanding in terms of context length or language patterns [74].

Supply chain vulnerabilities: The supply chain encompasses the complete journey from gathering data and training the model to its deployment. This process includes different elements like the training data, pre-trained models, and deployment infrastructure. Each element is susceptible to vulnerabilities: the crowd-sourced training data might be tainted, the pre-trained model could be compromised, or the third-party packages employed in LLM development could be insecure. Depending on the compromised components, services or datasets, they undermine system integrity, causing data breaches and system crashes [74].

Disclosure of sensitive information: Large Language Models (LLMs) are initially trained on varied datasets containing snippets of real-world information. When generating responses, these models may unintentionally disclose sensitive details. For instance, conversational agents like OpenAI’s ChatGPT and Google’s Gemini gather user prompts during interactions to improve their performance. However, this approach poses a security and privacy risk, as the model might generate outputs inadvertently revealing confidential or private information. Additionally, by employing meticulously constructed prompts, an attacker could deliberately exploit this vulnerability to reveal or expose sensitive details. Failure to protect against disclosure of sensitive information in LLM outputs can lead to legal consequences or loss of competitive advantage [74].

Insecure plugin design: Those LLM plugins lacking proper access control or input validation may lead to vulnerabilities such as SQL injection, and remote code execution. Frequently, these plugins accept user input as unrestricted text, making them susceptible to exploitation by attackers [74].

Too much agency: Large Language Model powered systems base their decisions on user prompts or inputs received from integrated components. Excessive autonomy or authorization granted to LLMs can introduce vulnerabilities susceptible to exploitation by malicious actors, potentially compromising the entire system. However, even without deliberate attacks, unintended user prompts, or wrong actions from connected systems can lead LLMs to generate misleading or unforeseen outputs, causing system malfunctions. As an illustrative example, consider an LLM-based file summarizer that utilizes a third-party plugin for user file access. This plugin, beyond reading capabilities, might also possess functionalities for file modification and deletion. If a user encounters discrepancies in the LLM's summary, their attempt to report the error to the application could inadvertently trigger the LLM to modify or delete the original files, highlighting the potential for unintended consequences. Ultimately, giving LLMs unchecked autonomy to act can lead to unintended consequences that compromise reliability, privacy and trust [74].

Overconfidence: The utilization of these models for source code generation presents a potential avenue for the inadvertent introduction of security vulnerabilities. These vulnerabilities can pose significant threats to the safety and security of applications and their users. The uncritical application of information or code generated by LLMs, without appropriate scrutiny, can lead to a cascade of negative consequences. These consequences include security breaches, the dissemination of misinformation, communication disruptions, legal issues, and reputational damage [74].

Model theft: The unauthorized copying or extraction of weights, parameters, or data from closed-source LLMs constitutes a form of intellectual property theft. This illicit practice can inflict significant economic losses on developers and damage brand reputation, ultimately jeopardizing a company's competitive edge. Perpetrators may exploit the purloined proprietary information for their own gain or utilize the stolen model for malicious purposes [74].

It's important to recognize both the opportunities and threats, presented by artificial intelligence (AI) technologies. We must acknowledge the transformative potential of LLMs, but we must also accept that there are emerging risks. As LLMs continue to advance, there is a simultaneous need to understand and control the associated risks. The key to maximizing automation's benefits and enhancing functionalities through LLMs across various fields, lies in mitigating the inherent risks associated with AI systems.

To effectively reduce risks, increased awareness and preparation are crucial. Software engineers play a vital role by analyzing past methods and incorporating those learnings into secure data and system mechanisms. New technologies are constantly emerging to monitor threats and mitigate risks. As pioneers in AI design and development, software engineers have a responsibility to identify new vulnerabilities and implement safeguards to protect users.

RQ8 Answer: Challenges of using LLMs in cybersecurity tasks includes but not limited to: prompt injection attacks, vulnerability detection difficulties, code generation challenges, hallucinations, deployment hurdles, evaluation limitations, insecure output management, data poisoning risks, denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, unchecked autonomy, overconfidence, and model theft. These challenges underscore the importance of understanding and mitigating risks associated with LLMs to maximize their benefits while ensuring security and reliability in various applications.

7.8 RQ9. How to enhance LLMs effectiveness in software vulnerability and cyber threat detection?

Integration with Existing Tools: To enhance software vulnerability and cyber threat detection, Large Language Models (LLMs) can be effectively combined with complementary methods and tools. Static code analyzers excel at detecting known vulnerabilities and coding errors, while dynamic analysis tools like fuzzers uncover runtime vulnerabilities. Integrating LLMs into security testing frameworks and threat intelligence platforms allows for comprehensive vulnerability detection and proactive threat mitigation.

Human expertise remains crucial for domain-specific knowledge and validation of LLM-generated results. LLMs also can be combined with BMC to create automated code repair frameworks.

Continuous monitoring systems with LLM integration enable prompt incident detection and response. Collaborative platforms facilitate knowledge sharing and coordinated efforts in addressing security issues. By combining these approaches, organizations can achieve comprehensive coverage and improve the effectiveness of software vulnerability and cyber threat detection.

Larger and More Diverse Datasets: Training LLMs on a wider range of vulnerable and non-vulnerable code across various programming languages can enhance their ability to generalize and identify vulnerabilities in unseen code. However, given the challenges associated with training LLMs on extensive datasets, there are endeavors aimed at enhancing LLM accuracy by training them on smaller datasets. We believe that achieving this objective could substantially enhance the public usability of LLMs.

Focus on Specific Vulnerabilities: In some scenarios, training LLMs on datasets focused on specific types of vulnerabilities can improve their accuracy in detecting those vulnerabilities, but this method is most suitable for scenarios where a system is only prone to specific attacks.

Preprocessing Techniques: Techniques like constructing code gadgets can help LLMs distinguish actual vulnerabilities from irrelevant patterns.

Prompt Engineering: Craft effective prompts to guide LLMs towards vulnerability detection tasks, improving focus and accuracy [28].

Continuous Learning: Regularly update LLMs with new vulnerability and threat data to enhance detection of emerging security risks.

Efforts toward optimizing model size, enhancing data diversity and quality, improving code generation in ambiguous scenarios, enhancing generalizability, developing better evaluation methodologies, and focusing on interpretability and ethical use of LLMs. By combining these strategies, we can unlock the full potential of LLMs and make them even more powerful tools in the fight against software vulnerabilities and cyber threats.

RQ9Answer: To enhance LLMs' effectiveness in software vulnerability and cyber threat detection, organizations can integrate them with existing tools like static code analyzers and dynamic analysis tools. Additionally, training LLMs on larger and more diverse datasets, focusing on specific vulnerabilities, and employing preprocessing techniques can improve their accuracy. Crafting effective prompts, ensuring continuous learning by updating LLMs with new data, and optimizing model size and data quality are essential. Developing better evaluation methodologies and prioritizing interpretability and ethical use further unlock the full potential of LLMs in addressing software vulnerabilities and cyber threats.

[1] https://www.kaggle.com/datasets/andrewkronser/cve-common-vulnerabilities-and-exposures

[2] https://cve.mitre.org/data/downloads/index.html

[3] https://www.cve.org/

[4] https://www.cve.org/Downloads#current-format

[5] https://www.cvedetails.com/

This article explored the potential of LLMs, examining their various applications across the Software Development Life Cycle (SDLC), particularly in the stages of software quality assurance and maintenance. We reviewed successful applications of LLMs in tasks like test generation, bug localization, code repair, and vulnerability detection.

Our analysis revealed that LLMs, when properly trained and integrated with existing tools, can significantly enhance software security. LLMs can outperform traditional static code analyzers in vulnerability detection and propose potential fixes, reducing the burden on developers. Furthermore, their ability to process and understand natural language empowers them to analyze textual information from logs and reports, aiding in threat identification and incident response. However, it is crucial to acknowledge the limitations of LLMs.

As LLM technology continues to evolve, so too will its applications in software security. By addressing current limitations and implementing best practices for training and integration, LLMs have the potential to revolutionize how we approach software security, enabling the development of more secure and robust systems.

Future research directions include exploring methods to improve the explainability and transparency of LLM decision-making processes. Additionally, continuous learning techniques can be incorporated to ensure LLMs stay up-to-date with the ever-changing threat landscape. By bridging the gap between cutting-edge research and practical implementation, LLMs can become a cornerstone of a comprehensive software security strategy. We hope our work guides and inspires more research, encourages new ideas, and helps collaborations in this growing field.

Conflicts of Interest Statement

The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest, or non-financial interest in the subject matter or materials discussed in this manuscript.

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J.: Haoyu Wang. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620v4 2024
Charalambous, Y., Tihanyi, N., Jain, R., Sun, Y., Ferrag, M., Amine, Cordeiro, L.: A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification. (2023). 10.48550/arXiv.2305.14752
Dipayan Saha, S., Tarek, K., Yahyaei, S.K., Saha, J., Zhou: Mark Tehranipoor, Farimah Farahmandi. LLM for SoC Security A Paradigm Shift. arXiv:2310.06046v1 2023
Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C.: Merouane Debbah, Thierry Lestable, Narinderjit Singh Thandi. Revolutionizing Cyber Threat Detection with Large Language Models. arXiv:2306.14263v2 2024.
Andreas Happe and Jürgen Cito: Getting pwn’d by AI: Penetration Testing with Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 2082–2086. (2023). https://doi.org/10.1145/3611643.3613083
Sakaoglu, S.: ‘KARTAL: Web Application Vulnerability Hunting Using Large Language Models: Novel method for detecting logical vulnerabilities in web applications with finetuned Large Language Models’, Dissertation, (2023)
Ferrag, M.A., Battah, A., Tihanyi, N., Debbah, M., Lestable, T.: Lucas C. Cordeiro. SecureFalcon The Next Cyber Reasoning System for Cyber Security. arXiv:2307.06616v1 2023
Weng, G., Andrzejak, A.: Automatic Bug Fixing via Deliberate Problem Solving with Large Language Models, in 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), Florence, Italy, 34–36. (2023). pp 10.1109/ISSREW60843.2023.00040
David Noever: Can Large Language Models Find and Fix Vulnerable Software. arXiv:2308.10345v1 2023
Hammond Pearce, B., Tan, B., Ahmad, R., Karri: Brendan Dolan-Gavitt. Examining Zero-Shot Vulnerability Repair with Large Language Models. arXiv:2112.02125v3 2022.
Jingxuan He and Martin Vechev: Large Language Models for Code: Security Hardening and Adversarial Testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23). Association for Computing Machinery, New York, NY, USA, 1865–1879. (2023). https://doi.org/10.1145/3576915.3623175
Xia, C.S., Wei, Y.: Lingming Zhang. Practical Program Repair in the Era of Large Pre-trained Language Models. arXiv:2210.14179v1 2022
Purba, M., Ghosh, A., Radford, B., Chu, B.: Software Vulnerability Detection using Large Language Models, in 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), Florence, Italy, 112–119. (2023). pp 10.1109/ISSREW60843.2023.00058
Yuqiang, S., Wu, D., Xue, Y., Liu, H., Wang, H., Xu, Z., Xie, X., Yang Liu:. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. arXiv:2308.03314v2 2023.
Huang, K., et al.: An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair, 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, Luxembourg, pp. 1162–1174, (2023). 10.1109/ASE56229.2023.00181
Katsadouros, E., Patrikakis, C.Z., Hurlburt, G.: Can Large Language Models Better Predict Software Vulnerability? in IT Professional. May-June. 25(3), 4–8 (2023). 10.1109/MITP.2023.3284628
Marwan Omar: Detecting software vulnerabilities using Language Models. arXiv:2302.11773v1 2023
Mamede, C., Pinconschi, E., Abreu, R., Campos, J.: Exploring Transformers for Multi-Label Classification of Java Vulnerabilities, 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), Guangzhou, China, 2022, pp. 43–52, 10.1109/QRS57517.2022.00015
Huang, K., Yang, S., Sun, H., Sun, C., Li, X., Zhang, Y., Repairing Security Vulnerabilities Using Pre-trained Programming Language Models, 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and, Workshops, N.: (DSN-W), Baltimore, MD, USA, pp. 111–116, (2022). 10.1109/DSN-W54100.2022.00027
Owura Asare: Security Evaluations of GitHub's Copilot. UWSpace. (2023). http://hdl.handle.net/10012/19675
Das, S.S., Dutta, A., Purohit, S., Serra, E., Halappanavar, M., Pothen, A.: Towards Automatic Mapping of Vulnerabilities to Attack Patterns using Large Language Models, 2022 IEEE International Symposium on Technologies for Homeland Security (HST), Boston, MA, USA, pp. 1–7, (2022). 10.1109/HST56032.2022.10025459
Chandra Thapa, S.I., Jang, M.E., Ahmed, S., Camtepe, J., Pieprzyk: and Surya Nepal. Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC '22). Association for Computing Machinery, New York, NY, USA, 481–496. (2022). https://doi.org/10.1145/3564625.356798
Devlin, J., Ming-Wei, C., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv org., arXiv:1810.04805v2 2019.
Omar, M., Shiaeles, S.: VulDetect: A novel technique for detecting software vulnerabilities using Language Models, 2023 IEEE International Conference on Cyber Security and Resilience (CSR), Venice, Italy, 2023, pp. 105–110, 10.1109/CSR57506.2023.10224924
Zhang, X., Zhang, F., Zhao, B.: Bo Zhou, and Boyang Xiao. VulD-Transformer: Source Code Vulnerability Detection via Transformer. In Proceedings of the 14th Asia-Pacific Symposium on Internetware (Internetware '23). Association for Computing Machinery, New York, NY, USA, 185–193. (2023). https://doi.org/10.1145/3609437.3609451
Szabó, Z., Bilicki, V.A.: New Approach to Web Application Security: Utilizing GPT Language Models for Source Code Inspection. Future Internet. 15, 326 (2023). https://doi.org/10.3390/fi15100326
Chen, Y., Ding, Z., Alowain, L., Chen, X., Wagner, D.: DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID '23). Association for Computing Machinery, New York, NY, USA, 654–668. (2023). https://doi.org/10.1145/3607199.3607242
Zhang, C., Liu, H., Zeng, J., Yang, K., Li, Y., Li, H.: Prompt-Enhanced Software Vulnerability Detection Using ChatGPT. arXiv:2308.12697v2 2024
Yue Liu, T., Le-Cong, R., Widyasari, C., Tantithamthavorn, L., Li, Xuan-Bach, D., Le: David Lo. Refining ChatGPT-Generated Code Characterizing and Mitigating Code Quality Issues. arXiv:2307.12596v2 2023
Tihanyi, N., Bisztray, T., Jain, R., Ferrag, M.A., Cordeiro, L.C., Mavroeidis, V.: The FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification. In Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2023). Association for Computing Machinery, New York, NY, USA, 33–43. (2023). https://doi.org/10.1145/3617555.3617874
Zheng, Z., Ning, K., Chen, J., Wang, Y., Chen, W., Guo, L.: Weicheng Wang. Towards an Understanding of Large Language Models in Software Engineering Tasks. arXiv:2308.11396v1 2023
Song Wang, D., Chollak, D., Movshovitz-Attias: and Lin Tan. Bugram: bug detection with n-gram language models. In ASE. 708–719. (2016)
Lanciano, G., Stein, Manuel, Hilt, V., Cucinotta, Tommaso: Analyzing Declarative Deployment Code with Large Language Models. 289–296. (2023). 10.5220/0011991200003488
Imgrund, E., Ganz, T., Härterich, M., Pirch, L., Risse, N., Rieck, K.: Broken Promises: Measuring Confounding Effects in Learning-based Vulnerability Discovery. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec '23). Association for Computing Machinery, New York, NY, USA, 149–160. (2023). https://doi.org/10.1145/3605764.3623915
Dwight Horne: PwnPilot: Reflections on Trusting Trust in the Age of Large Language Models and AI Code Assistants. Conference: The 2023 International Conference on Security and Management (SAM'23). (2023)
Li, H., Hao, Y., Zhai, Y.: Zhiyun Qian. The Hitchhiker's Guide to Program Analysis A Journey with Large Language Models. arXiv:2308.00245v3 2023
Khan, M.F.A., Ramsdell, M., Falor, E.: Hamid Karimi. Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation. arXiv:2311.02640v1 2023
Fredrik Heiding, B., Schneier, A., Vishwanath, J., Bernstein, Peter, S.: Park. Devising and Detecting Phishing large language models vs. Smaller Hum. Models arXiv:2308.12287v2 2023
McNulty, S., Patrick, APPLICATIONS OF TRANSFER LEARNING FROM MALICIOUS TO VULNERABLE BINARIES: Graduate Student Theses, Dissertations, & Professional Papers. 12187. (2023). https://scholarworks.umt.edu/etd/12187
Takashi Koide, N., Fukushi, H., Nakano: Daiki Chiba. Detecting Phishing Sites Using ChatGPT. arXiv:2306.05816v2 2024
Boyko, J., Cohen, J., Fox, N., Veiga, M.H., Jennifer, I.-H., Li, J., Liu, B., Modenesi, A.H., Rauch, K.N., Reid: Soumi Tribedi, Anastasia Visheratina, Xin Xie. An Interdisciplinary Outlook on Large Language Models for Scientific Research. https://doi.org/10.48550/arXiv.2311.04929
Shirui Pan, L., Luo, Y., Wang, C., Chen, J., Wang, X.W.: Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://doi.org/10.48550/arXiv.2306.08302
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.: Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762
Sadowski, C., Yi, J.: How developers use data race detection tools, in Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools, pp. 43–51. (2014)
Gadelha, M.Y.R., Steffinlongo, E., Cordeiro, L.C., Fischer, B., Nicole, D.A., 41st International Conference on Software Engineering: Smt-based refutation of spurious bug reports in the clang static analyzer, in Proceedings of the : Companion Proceedings, ICSE 2019, Montreal, QC, Canada, May 25–31, 2019, J. M. Atlee, T. Bultan, and J. Whittle, Eds. IEEE / ACM, pp. 11–14. (2019)
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection, in Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pp. 87–98. (2016)
Gupta, R., Pal, S., Kanade, A., Shevade, S.: Deepfix: Fixing com- mon c language errors by deep learning, in Proceedings of the aaai conference on artificial intelligence, vol. 31, no. 1, (2017)
Li, Y., Wang, S., Nguyen, T.N.: Dlfix: Context-based code trans- formation learning for automated program repair, in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 602–614. (2020)
White, M., Tufano, M., Martinez, M., Monperrus, M., Poshyvanyk, D.: Sorting and transforming program repair ingredients via deep learning code similarities, in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp. 479–490. (2019)
Jiang, N., Lutellier, T., Tan, L.: Cure: Code-aware neural machine translation for automatic program repair, in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 1161–1173. (2021)
Li, Y., Wang, S., Nguyen, T.N.: Dear: A novel deep learning-based approach for automated program repair, in Proceedings of the 44th International Conference on Software Engineering, pp. 511–523. (2022)
Mohammed Alhamed and Tim Storer: Evaluation of Context-Aware Language Models and Experts for Effort Estimation of Software Maintenance Issues. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 129–138. (2022)
Yaqin Zhou, S., Liu, J., Siow, X., Du, Liu, Y.: Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, (2019)
Zhangyin Feng, D., Guo, D., Tang, N., Duan, X., Feng, M., Gong, L., Shou, B., Qin, T., Liu, D., Jiang, et al.: Codebert: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, (2020)
Bommasani, R., et al.: On the opportunities and risks of foundation models, eng, arXiv.org, ISSN: 2331–8422. [Online]. Available: (2021). https://crfm.stanford.edu/assets/report.pdf
Hammond Pearce, B., Ahmad, B., Tan, B.D.-G.: and Ramesh Karri. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In IEEE S&P. (2022). https://doi.org/10.1109/SP46214.2022.9833571
KEREOPA-YORKE: Benjamin. Building Resilient SMEs: Harnessing Large Language Models for Cyber Security in Australia. arXiv preprint. arXiv:2306.02612v1, (2023)
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
Xin Zhou, Rakesh, M., Verma: Vulnerability detection via multimodal learning: Datasets and analysis. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pages 1225– 1227, (2022)
Mohammad Shoeybi, M., Patwary, R., Puri, P., LeGresley, J., Casper, Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053v4, 2020
Cheshkov, A., Zadorozhny, P.: Rodion Levichev. Evaluation of ChatGPT Model for Vulnerability Detection. arXiv:2304.07232v1
Yaqin Zhou, S., Liu, J., Siow, X., Du, Liu, Y.: Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, (2019)
Bhandari, G.P., Naseer, A., Moonen, L.: CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-source Software. In PROMISE. arXiv:2107.08760v1 (2021)
Steenhoek, B., Rahman, M.M., Roy, M.K., Alam, M.S., Barr, E.T.: Wei Le. A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection. arXiv:2403.17218v1 2024
Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y.: Yang Liu. Prompt Injection attack against LLM-integrated Applications. arXiv:2306.05499v2 2024
Liu, Y., Deng, G., Li, Y., Wang, K., Zhang, T., Liu, Y., Wang, H., Zheng, Y.: Yang Liu. Prompt Injection attack against LLM-integrated Applications. arXiv:2306.05499v1 2023
Pedro, R., Castro, D., Carreira, P.: Nuno Santos. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? arXiv:2308.01990v3 2023
Giovanni Apruzzese, H.S., Anderson, S., Dambra, D., Freeman, F., Pierazzi, Kevin, A.: Roundy. Real Attackers Don’t Compute Gradients: Bridging the Gap between Adversarial ML Research and Practice. In SaTML, (2023)
Kai Greshake, S., Abdelnabi, S., Mishra, C., Endres: Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In arXiv preprint, (2023)
Fábio Perez and Ian Ribeiro: Ignore Previous Prompt: Attack Techniques for Language Models. In NeurIPS ML Safety Workshop, (2022)
Zhang, J., Wang, X., Zhang, H., Sun, H., Xudong Liu, C.H.: and Yang Liu. Detecting Condition-Related Bugs with Control Flow Graph Neural Network. In ISSTA. 1370–1382. (2023)
Zhao, W.X., Zhou, K., Li, J., Tang, T., et al.: A Survey of Large Language Models. (2023). arXiv:2303.18223v13
Jiahao, F., Li, Y., Wang, S., Nguyen, T.N.: A C/C + + Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 508–512. (2020). https://doi.org/10.1145/3379597.3387501
Pankajakshan, R., Biswal, S., Govindarajulu, Y.: Gilad Gressel. Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal. arXiv:2403.13309v1 2024
Saad Ullah, M., Han, S., Pujar, H., Pearce, A., Coskun: Gianluca Stringhini. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. arXiv:2312.12575v2 2024

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Using Large Language Models to Better Detect and Handle Software Vulnerabilities and Cyber Security Threats

Status:

Version 1

Abstract

Figures

1 INTRODUCTION

2 METHODOLOGY

3 BACKGROUND

4 COMPARATIVE ANALYSIS OF SOFTWARE SECURITY APPROACHES AND VULNERABILITY DETECTION METHODS

5 SOFTWARE VULNERABILITIES DETECTION, A SUBCATEGORY OF SOFTWARE DEVELOPMENT LIFE CYCLE

5.1 Utilization of LLMs in Requirements Engineering

5.2 Utilization of LLMs in software design

5.3 Utilization of LLMs in Software development

5.4 Utilization of LLMs in software management

6 APPLICATIONS OF LLMS IN DETECTION AND HANDLING SOFTWARE VULNERABILITIES

6.1 Utilization of LLMs in Software quality assurance

6.2 Utilization of LLMs in in software maintenance

7 LLMS FOR SOFTWARE VULNERABILITIES AND CYBER SECURITY

CONCLUSION

Declarations

References

Additional Declarations

Status:

Version 1