Recent substantial advancements in generative AI and the ability to run models in commodity hardware via quantisation and compression open doors to applicability in broader domains such as root cause analysis (RCA) and anomaly detection [1].
The applicability of large language models (LLMs) in RCA has been broadly researched recently [2]. The evolution of generative artificial intelligence (GenAI) constitutes a turning point in reshaping the future of technology in various aspects [3]. Multiple venues have been explored to get optimal results such as:
-
prompting and augmentation for calibrated confidence estimation with GPT-4 on cloud incident root cause analysis (PACE-LM) [4], and
-
cloud RCA by autonomous agents with Tool-Augmented LLMs (RCAgent) in [3].
Another major cause of issues in the networking field can be software vulnerabilities which can have profound consequences, even though we use traditional methods like automated software testing, fault localisation, and repair to address them. However, statistical analysis tools have a high rate of false positives. LLMs like FalconLLM[5] offer a promising solution to these ongoing issues. Furthermore HuntGPT, a special intrusion detection dashboard applying a Random Forest classifier using the KDD99 dataset, integrating XAI frameworks like SHAP and Lime for user-friendly and intuitive model interaction, combined with a GPT-3.5 Turbo, delivers threats in an understandable format [6].
In addition, a specific study envisions a future where RCA follows anomaly detection, unravelling the underlying triggers of anomalies (LogLAB) [7]. On the other hand, major contributions in building a foundation model and a dataset for evaluation of the Telecom Domain have been made by the Technology Innovation Institute producing the Falcon LLM and TeleQnA [8]. Also, multiple challenges have been addressed and improved upon previous iterations of open LLMs (GPT-J, OPT [9], and BLOOM [10]).
These contributions have immensely helped democratise access to capabilities of powerful models which exhibit human-like operation on tasks such as the operation of telco or data centre networks.
Although LLMs are trained on gigantic datasets (600B tokens) such as RefinedWeb [11] consisting of sources such as Books, Conversations, Research papers, and Code they still generate incorrect recommendations [2]. These incorrect recommendations can prolong the time the engineer needs to diagnose the issue.
To enhance the relevance and accuracy of the LLM results, the Retrieval-augmented generation (RAG) [12] is used to retrieve past incidents, historical knowledge, and various relevant sources for troubleshooting the current issue [3]. The RAG framework provides different retrieval and augmentation approaches based on input information types, such as voice, audio, text, or knowledge graph [13]. These approaches are primarily categorized as naive, advanced, and modular [12].
The most basic approach is naïve RAG which consists of indexing, retrieval, and generation from frozen LLM. Several stages are mentioned in [12] that help enhance the semantic representation mainly they are listed as chunk optimization, fine-tuning embedding models, query rewriting, fine-tuning retrievers, embedding transformation, information compression, and re-ranking. These paradigms independently help in improving different metrics of RAG such as Retrieval Quality (Hit Rate, MRR, and NDCG) and generation Quality (faithfulness, relevance, and non-harmfulness [12]. The final step is evaluating these metrics through different automatic evaluation frameworks that have been developed, like RAGAS in [14] and ARES in [15].
While RAGs improve LLM capabilities by providing relevant and current information, they face challenges such as context length, which depends on the context window size of LLM [16], and robustness when handling conflicting input [17]. In addition, inverse scaling laws show that smaller models outperform larger ones [18]. Lastly, production-ready RAGs need to align with engineering requirements and address data security concerns [19]. As state-of-the-art (SOTA) we can list Microsoft Research papers that have applied the most accurate and efficient frameworks in solving the problem of RCA and anomaly detections employing PACE-LM [4] and RCAgent [3], respectively. Microsoft’s tool PACE-LM is a major contribution to evaluating the output reliability and improving confidence estimation. Combining and optimising these two factors would shorten the time to find the root cause. The first factor is called CoE (confidence of evaluation) which assesses its uncertainty and groundness, and the second factor is root cause evaluation (RCE) explaining the plausibility and reliability of the predicted root cause. These two factors solve the effectiveness and generalization of the solution. Although we have major benefits in using this method there are caveats such as data scarcity – this method needs a huge dataset of incidents and root causes, which might not be available in every environment, domain adaptation – different domain vocabulary, model robustness – vulnerable to attacks and degradation based on inputs [20]. On the other hand, RCAgent introduces the observation snapshot key (OBSK) method, which solves the major issue with RAG frameworks as context length by compressing the information to accommodate it [3].
Based on the abovementioned observations, in this article, we introduce the concept of new intelligent interface between external tools and engineers. Finally, some experimental results are shown mainly on time savings such as MTTR and indirect availability which are critical to maintaining SLAs in operating large datacenter networks.