Improving Datacenter Networking Operations with Large Language Models and Chat Operations

doi:10.21203/rs.3.rs-5289625/v1

Maintaining a sizable global network requires a vast number of human resources, who are knowledgeable about networks and capable of responding swiftly to various situations. To reduce the burden and number of engineers required to monitor and operate these enormous and complicated networks, artificial intelligence provides an extra tool for managing complex jobs, processing data, and making choices. The usage of large language models has been demonstrated to be crucial in maintaining Service Level Agreements (SLAs), reducing Mean Time to Repair (MTTR), and enhancing the overall availability of large-scale networks, such as those found at Microsoft Azure, Google GCP, and Amazon AWS. Our methodology centres on establishing an interface that leverages natural language processing (NLP) and chat operations (chatops) to facilitate the diagnosis and recovery of services between human engineers and large-scale networks. This approach has proven to improve the ability of engineers to support complex issues and support a greater volume of escalated cases. Ultimately, this has improved MTTR and the overall availability of large-scale data centre networks. The chatops approach reduces the number of tools with which engineers operate daily, thus decreasing the chances of introducing errors during maintenance or troubleshooting processes.

Service level agreements (SLAs)

mean time to repair (MTTR)

natural language processing (NLP)

and large language models (LLMs)

Network operations centres (NOC) for large-scale data centre networks have a lot of tools and engineering resources spread among different technological areas as well as time zones that tackle a plethora of issues, such as logistics, tracking, monitoring, capacity, maintenance, software and hardware. Interfacing with all these tools, on a day-to-day basis causes a large toll on operations mainly due to data dependency and correlation not being a trivial process for engineering resources. Reacting to different incidents that occur during maintenance and changes requires milliseconds of reaction time. Another aspect of operating networks consists of sharing previous and new troubleshooting knowledge created from incidents in the network. Access to previous and new expertise requires context and quick communication among engineering resources. In this research, we aim to solve operational issues by introducing a Natural Language Processing (NLP) layer between engineers that drives intent and then automatically selects which tools need to be used and when we need to read data from network devices. Existing large language models are not able to answer specific questions since they are missing context but can be augmented with tools. This article tends to address the following tasks in networking using large language models and APIs to retrieve context knowledge:

Finding and locating spare parts and hardware as one of the major failed components in networking, which can be a single card or the whole chassis, thus locating these on time is critical in resolving issues. Identifying, locating and installing specific vendor hardware can be a process that on-site engineers would execute very often. Providing identification, location, and guidance can be a huge improvement, thus drastically reducing the time needed to repair.
Finding out-of-band connection information for devices: since out-of-band access is provided via terminal servers, it can be changed due to migration or other events, thus information can become stale and out of date.
Searching for packet loss and latency between data center locations can be critical in preventing incidents or customer complaints.
Search for control and data plane paths for L2VPN (layer 2 network services); can provide information for any impact and locate the source of the problem.
Node isolation alerting can prevent a prolonged outage for devices handling multiple customer services.
searching for vendor knowledge and internal knowledge based on SNMP or log events to locate the root cause of an issue in the router/switch/server.

Recent substantial advancements in generative AI and the ability to run models in commodity hardware via quantisation and compression open doors to applicability in broader domains such as root cause analysis (RCA) and anomaly detection [1].

The applicability of large language models (LLMs) in RCA has been broadly researched recently [2]. The evolution of generative artificial intelligence (GenAI) constitutes a turning point in reshaping the future of technology in various aspects [3]. Multiple venues have been explored to get optimal results such as:

prompting and augmentation for calibrated confidence estimation with GPT-4 on cloud incident root cause analysis (PACE-LM) [4], and
cloud RCA by autonomous agents with Tool-Augmented LLMs (RCAgent) in [3].

Another major cause of issues in the networking field can be software vulnerabilities which can have profound consequences, even though we use traditional methods like automated software testing, fault localisation, and repair to address them. However, statistical analysis tools have a high rate of false positives. LLMs like FalconLLM[5] offer a promising solution to these ongoing issues. Furthermore HuntGPT, a special intrusion detection dashboard applying a Random Forest classifier using the KDD99 dataset, integrating XAI frameworks like SHAP and Lime for user-friendly and intuitive model interaction, combined with a GPT-3.5 Turbo, delivers threats in an understandable format [6].

In addition, a specific study envisions a future where RCA follows anomaly detection, unravelling the underlying triggers of anomalies (LogLAB) [7]. On the other hand, major contributions in building a foundation model and a dataset for evaluation of the Telecom Domain have been made by the Technology Innovation Institute producing the Falcon LLM and TeleQnA [8]. Also, multiple challenges have been addressed and improved upon previous iterations of open LLMs (GPT-J, OPT [9], and BLOOM [10]).

These contributions have immensely helped democratise access to capabilities of powerful models which exhibit human-like operation on tasks such as the operation of telco or data centre networks.

Although LLMs are trained on gigantic datasets (600B tokens) such as RefinedWeb [11] consisting of sources such as Books, Conversations, Research papers, and Code they still generate incorrect recommendations [2]. These incorrect recommendations can prolong the time the engineer needs to diagnose the issue.

To enhance the relevance and accuracy of the LLM results, the Retrieval-augmented generation (RAG) [12] is used to retrieve past incidents, historical knowledge, and various relevant sources for troubleshooting the current issue [3]. The RAG framework provides different retrieval and augmentation approaches based on input information types, such as voice, audio, text, or knowledge graph [13]. These approaches are primarily categorized as naive, advanced, and modular [12].

The most basic approach is naïve RAG which consists of indexing, retrieval, and generation from frozen LLM. Several stages are mentioned in [12] that help enhance the semantic representation mainly they are listed as chunk optimization, fine-tuning embedding models, query rewriting, fine-tuning retrievers, embedding transformation, information compression, and re-ranking. These paradigms independently help in improving different metrics of RAG such as Retrieval Quality (Hit Rate, MRR, and NDCG) and generation Quality (faithfulness, relevance, and non-harmfulness [12]. The final step is evaluating these metrics through different automatic evaluation frameworks that have been developed, like RAGAS in [14] and ARES in [15].

While RAGs improve LLM capabilities by providing relevant and current information, they face challenges such as context length, which depends on the context window size of LLM [16], and robustness when handling conflicting input [17]. In addition, inverse scaling laws show that smaller models outperform larger ones [18]. Lastly, production-ready RAGs need to align with engineering requirements and address data security concerns [19]. As state-of-the-art (SOTA) we can list Microsoft Research papers that have applied the most accurate and efficient frameworks in solving the problem of RCA and anomaly detections employing PACE-LM [4] and RCAgent [3], respectively. Microsoft’s tool PACE-LM is a major contribution to evaluating the output reliability and improving confidence estimation. Combining and optimising these two factors would shorten the time to find the root cause. The first factor is called CoE (confidence of evaluation) which assesses its uncertainty and groundness, and the second factor is root cause evaluation (RCE) explaining the plausibility and reliability of the predicted root cause. These two factors solve the effectiveness and generalization of the solution. Although we have major benefits in using this method there are caveats such as data scarcity – this method needs a huge dataset of incidents and root causes, which might not be available in every environment, domain adaptation – different domain vocabulary, model robustness – vulnerable to attacks and degradation based on inputs [20]. On the other hand, RCAgent introduces the observation snapshot key (OBSK) method, which solves the major issue with RAG frameworks as context length by compressing the information to accommodate it [3].

Based on the abovementioned observations, in this article, we introduce the concept of new intelligent interface between external tools and engineers. Finally, some experimental results are shown mainly on time savings such as MTTR and indirect availability which are critical to maintaining SLAs in operating large datacenter networks.

ChatOps has been recently introduced to the networking world due to the increased use of natural language processing (NLP) advancements [21]. The rise of LLM has made it easier to understand basic networking knowledge as well as human queries [22]. We aim to provide a single interface where problems can be tracked and resolved by combining application programming interface (API) data from various tools.

The main component in this approach is the chat app in this case we are using Slack (Fig. 1.) [23], which acts as a bridge between engineer and intelligent logic [24]. The bot can action based on events that are happening on any channel the bot is participating in or when it is mentioned. We chose to mention the boat and send the intent into the bot. So, we are listening to events whenever the bot is mentioned, retrieving the query and parsing for information using pre-determined keywords. LLM has also the ability to use tools based on the query [25] but it is not generally available on all LLMs [26]. In our approach to minimize the code footprint and have predictability, we have used keywords to trigger tools to be used i.e. “console” for out-of-band management information. So, the query will flow from the troubleshooting engineer towards the bot where it will trigger the “console” tool which would on the other hand do an API function call by passing the router hostname extracted from the query [27]. The resultant JSON object will pass through LLM to be converted into a human-readable message [28].

Toolset plays an important role in the capabilities of the bot as the bot needs to interface with APIs to provide context to the query [29]. In our experimental environment, available tools include: netbox [30] (for inventory data such as hardware, and software), Route Explorer [31] (for LSDB – Link State Database Information), selector [32] (for correlated data from monitoring tools), config backup repo (keep up-to-date device configuration), SharePoint and confluence (for design and issue documentation), vendor support portals (Juniper, Nokia, Arista), and ServiceNow (for scheduled changes and incident data).

3.1 SPARE LOCATOR

The first and most common issue that affects network health is node failure which can be transient, and remedial via change such as a software upgrade, however, when having a huge network there is higher a chance that we can have faulty hardware. In such cases, the usual remediation is via replacement, finding the spare device or its availability is crucial when customer services are impacted [22]. Spare is not always available in the same physical location as the affected part and the engineer must locate it in the nearest datacenter location or find a similar device that supports similar functionality. These tasks take considerable engineer’s time; to improve this operation, we have offered an approach to interface via API and query the bot via NLP query e.g. spare part for device A (Fig. 2).

It is important to know the location of device A and then find spares in a similar or close location to the affected device, in addition, we also need to find devices that are in store with compatible capabilities. In this case, the LLM needs device location information, device capabilities and geographical location, which could be stored on a netbox as it contains all network inventory. We create a query to list affected device datacenter location, capabilities and geoinformation via API then upon having a response we instruct LLM to find similar info inside the embedded database which has been created from the inventory database (Fig. 2) [12]. When we made a comparison between the manual operation vs. automatic with LLM, we extracted metrics as shown in Fig. 3.

3.2 CONSOLE INFO

Furthermore, during incidents knowing if spare devices are healthy is of crucial importance. Spare devices are new and not yet configured they are accessed out-of-band via the console. Console information is not readily available, mainly due to the scale of the network information usually would be found in some SharePoint or Excel format where it takes time to locate and keep up to date. We chose to use a database with API access as the source of truth, containing all the console-related data. We chunk and embed console data into a vector database by using queries from engineers to search and retrieve data as necessary [33]. The interaction between different elements is shown in Fig. 4.

The total time saved by using the intelligent method of retrieving these results in our experiments is shown in Fig. 5.

3.3 PACKET LOSS AND LATENCY

The major causes of incidents while operating data centre environments are latency and packet loss. There are many causes of latency and packet loss: dust collected on Small Factor Pluggable (SFP), upstream providers having fibre cuts, interface errors, inefficient routing, issues in the control plane, software bugs, device incompatibility, wrong configuration, and maintenance [22]. Locating packet loss and latency is of paramount importance in huge data centre environments where you have multiple vendors and transit providers for interconnectivity. Global-scale network services can traverse subsea cables, multiple vendors and country borders thus it is crucial to locate and find the cause of the issue very quickly. Monitoring tools that are usually readily available for packet loss or latency are traceroute and ping results. For global networks, we have usually ping traces running constantly within the network between different data centre locations. In our experiments we could query an API for packet loss or latency, however, the data is not ready to be consumed directly by an engineer (Fig. 6). We prompt a large language model to transform JSON output to retrieve all the necessary data with temporal information to present this in human-readable form.

The actual time needed to retrieve and process this information manually is considerable compared to the sensitivity of services when experiencing loss and latency. In Fig. 7, we have shown metrics of time tasks take via the manual method vs automatic with LLM.

3.4 PATH RETRIEVAL

In major incidents, networking control and data plane information play a crucial role in diagnosing and rectifying issues. For many routing protocols used in service providers, control plane reachability can be retrieved from a single node by way of a link-state database (LSDB). LSDB shows adjacency changes, neighbour information, k-shortest paths, metrics towards different destinations, redundant paths, router isolations, path changes, and flaps. Tools offer visibility into the control and data plane by providing API access to LSDB. Using large language models improves the interaction between lsdb and the engineer creating a translation layer which facilitates diagnosing issues by giving a clear picture at each specific point in time. In Fig. 8 we show interaction between different elements in our experiments where LLM offers translation between information from route explorer tools towards engineers.

In contrast, manual processes involve searching for multiple protocol label switching (MPLS) labels within each device to track all available paths. Instead of this cumbersome process, an API call is used automatically to get path information, greatly reducing the time it takes. Since LDSB history is stored for several months, it can be used to troubleshoot issues in the past as well. Instead, it would be impossible with any other procedure. Overall time saved for the path retrieval process is shown in Fig. 9.

3.5 NODE ISOLATION

Since maintenance is unavoidable in networks due to patches and fixes, it is one of the main causes of network outages. Thus, the ability to detect any impact before, during, or after maintenance is of paramount importance. Many critical incidents are caused by improper router isolation processes which is mainly used during maintenance to avoid any impact. Additionally, paging engineers can receive notifications about isolation events, enabling them to respond quickly and urgently. In Fig. 10 we show messaging between different backend systems to offer engineers with proper alerting in case of events.

Using the flow during our experiments has helped tremendously reduce the reaction time when causing outages. In Fig. 11, we show the average number of seconds saved by using our approach.

3.6 VENDOR SEARCH

Having multiple device vendors reduces opex and increases reliability when there are software issues by isolating the issue only from certain vendors. Its downside is harder to troubleshoot due to engineers being specialized in certain vendors. Furthermore, it is difficult to isolate interoperability issues. Vendors, by operating their devices in different enterprise and service provider environments, have grown their knowledge base around products and best practices. This knowledge base serves as a valuable resource for troubleshooting. For each issue, it is a specific signature and problem article created by vendors. Engineers, having the ability to have these articles readily available based on log events, could provide information to find the root cause and resolve issues ultimately. Thus, we have devised an approach whereupon any event that happens from the correlation tool we will create a retrieval augmented search for logs during that timeline from each vendor to try and find information from a vendor knowledge base and summarize results using AI (Fig. 12).

An important metric in this approach is the amount of time needed to research vendor portal positive responses which helped in detecting the issue and rectifying the issue (Fig. 13).

To present results from this research, we have collected data over several months to find the frequency of tasks listed in the current article in a major data centre provider. Thus, we can measure the impact of our approach in the meantime to repair and overall network availability. In Table 1 we have shown the average frequency of each task.

Table 1

Frequency of tasks per week
Tasks	Spare device search	Console info	Packet loss and Lat.	Search for path	Node isolation	Vendor search
Frequency per week	10	20	60	60	6	60

Another great comparison is the number of seconds saved when using automation with LLM vs the manual approach. In Table 2 we have shown this comparison which makes a great case for adopting automation using LLMs in day-to-day operations. The mean time to repair savings from a single task saved within a month using automation with LLM shows a range from 8 minutes to 100 hrs per week.

Table 2

Comparison of approach vs tasks
Tasks	Spare device search	Console Info	Packet loss and Lat.	Search for path	Node isolation	Vendor search
Frequency per week	10	20	60	60	6	60
Manual (sec.)	9000	600	216000	216000	43200	648000
Automatic with LLM (sec.)	100	100	900	300	18	1800

The seconds saved from this indirectly affect the SLA and availability of the service since Availability is indirectly related to the MMTR, as shown in formula (1).

$$\:Availability\:\frac{MTBF}{MTBF+MTTR}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(1\right)$$

MTBF - mean time between failures, a measure of the reliability of a device or system, which cannot be influenced in most cases.

To calculate per-service Availability, we will take for example an existing layer 2 production service connecting two major locations London to Singapore (Fig. 14).

Formula (2) shows the dependency of each node’s component directly affecting the overall service availability. Consequentially that is dependent inversely on MTTR impacted directly by beforementioned tasks such as spare search, console information, packet loss and latency, search for path, node isolation and vendor search. Thus, we can greatly impact the overall service availability by reducing the time engineers are engaged in these tasks.

Emergent capabilities on large language models such as in-context learning have shown great usage in multiple research fields. We have trialled this usage in the context of service operations in large data centre environments. Based on experimental results we can show that by combining predictive static coding with generative AI we can bridge the gap between API systems and engineers, as well as dramatically reduce MTTR and thus increase the total availability of these complex systems. A future research idea would be to include all these capabilities as tools by way of function calling inside LLM so that context decides on which API or APIs should be called to support a combination of tasks as well as store chat history, by way of compressing and re-ranking the importance of conversational messages to fit the input message into context length of LLMs.

FUNDING DECLARATION

The authors declare that the research was conducted in the absence of any type of funding

Author Contribution

E.H. made substantial contributions to the conception or design of the work. Revised it critically for important intellectual content, and approved the version to be published.F. P. made the acquisition, analysis and interpretation of data, and drafted the paper. ;

Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, “FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs,” Aug. 2023, [Online]. Available: http://arxiv.org/abs/2308.09723
A. Maatouk, N. Piovesan, F. Ayed, A. De Domenico, and M. Debbah, “Large Language Models for Telecom: Forthcoming Impact on the Industry,” Aug. 2023, [Online]. Available: http://arxiv.org/abs/2308.06013
Z. Wang et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” Oct. 2023, [Online]. Available: http://arxiv.org/abs/2310.16340
D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis,” Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.05833
E. Almazrouei et al., “The Falcon Series of Open Language Models,” Nov. 2023, [Online]. Available: http://arxiv.org/abs/2311.16867
T. Ali and P. Kostakos, “HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs),” Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.16021
T. Wittkopp, A. Acker, and O. Kao, “Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis,” Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.14748
A. Maatouk, F. Ayed, N. Piovesan, A. De Domenico, M. Debbah, and Z.-Q. Luo, “TeleQnA: A Benchmark Dataset to Assess Large Language Models Telecommunications Knowledge,” Oct. 2023, [Online]. Available: http://arxiv.org/abs/2310.15051
S. Zhang et al., “OPT: Open Pre-trained Transformer Language Models,” May 2022, [Online]. Available: http://arxiv.org/abs/2205.01068
B. Workshop et al., “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,” Nov. 2022, [Online]. Available: http://arxiv.org/abs/2211.05100
G. Penedo et al., “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only,” Jun. 2023, [Online]. Available: http://arxiv.org/abs/2306.01116
Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.10997
R. Zhao et al., “Retrieving Multimodal Information for Augmented Generation: A Survey,” Mar. 2023, [Online]. Available: http://arxiv.org/abs/2303.10868
S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated Evaluation of Retrieval Augmented Generation,” Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.15217
J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia, “ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems,” Nov. 2023, [Online]. Available: http://arxiv.org/abs/2311.09476
C. Packer et al., “MemGPT: Towards LLMs as Operating Systems,” Oct. 2023, [Online]. Available: http://arxiv.org/abs/2310.08560
W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, and D. Yu, “Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models,” Nov. 2023, [Online]. Available: http://arxiv.org/abs/2311.09210
B. Wang et al., “Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study,” Apr. 2023, [Online]. Available: http://arxiv.org/abs/2304.06762
U. Alon, F. F. Xu, J. He, S. Sengupta, D. Roth, and G. Neubig, “Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval,” Jan. 2022, [Online]. Available: http://arxiv.org/abs/2201.12431
M. A. Ferrag, A. Battah, N. Tihanyi, M. Debbah, T. Lestable, and L. C. Cordeiro, “SecureFalcon: The Next Cyber Reasoning System for Cyber Security,” Jul. 2023, [Online]. Available: http://arxiv.org/abs/2307.06616
J. Bandlamudi et al., “Towards Hybrid Automation by Bootstrapping Conversational Interfaces for IT Operation Tasks,” 2023. [Online]. Available: https://www.redhat.com/en/technologies/management/ansible
J. Wulf and J. Meierhofer, “EXPLORING THE POTENTIAL OF LARGE LANGUAGE MODELS FOR AUTOMATION IN TECHNICAL CUSTOMER SERVICE,” 2024.
J. Wang et al., “Network Meets ChatGPT: Intent Autonomous Management, Control and Operation,” 2023.
K. An et al., “Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides,” Feb. 2024, [Online]. Available: http://arxiv.org/abs/2402.17531
Y. Jiang et al., “Xpert: Empowering Incident Management with Query Recommendations via Large Language Models,” Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.11988
D. Wu et al., “NetLLM: Adapting Large Language Models for Networking,” Feb. 2024, doi: 10.1145/3651890.3672268.
Y. Huang et al., “Large Language Models for Networking: Applications, Enabling Techniques, and Challenges,” Nov. 2023, [Online]. Available: http://arxiv.org/abs/2311.17474
D. Roy et al., “Exploring LLM-based Agents for Root Cause Analysis,” 2017. doi: XXXXXXX.XXXXXXX.
M. Shetty, C. Bansal, S. P. Upadhyayula, A. Radhakrishna, and A. Gupta, “AutoTSG: Learning and Synthesis for Incident Troubleshooting,” May 2022, [Online]. Available: http://arxiv.org/abs/2205.13457
Netboxlabs, “The Premier Network Source of Truth,” Netbox. Accessed: Aug. 16, 2024. [Online]. Available: https://netboxlabs.com/docs/netbox/en/stable/
Ciena, “Ciena Blueplanet ROA Route explorer,” Ciena Blueplanet ROA Route explorer. Accessed: Aug. 16, 2024. [Online]. Available: http://www.blueplanet.com/products/roa-route-explorer.html
Selector, “Get end-to-end visibility and contextual insights for your complex hybrid networks,” Selector AI. Accessed: Aug. 16, 2024. [Online]. Available: https://www.selector.ai/solutions/network-connectivity/
S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Oct. 2022, [Online]. Available: http://arxiv.org/abs/2210.03629

No competing interests reported.

Improving Datacenter Networking Operations with Large Language Models and Chat Operations

Status:

Version 1

Abstract

Figures

1. INTRODUCTION

2. RELATED WORK

3. RESEARCH METHODOLOGY AND APPROACH TO EXPERIMENTAL WORK

3.1 SPARE LOCATOR

3.2 CONSOLE INFO

3.3 PACKET LOSS AND LATENCY

3.4 PATH RETRIEVAL

3.5 NODE ISOLATION

3.6 VENDOR SEARCH

4. RESULTS

5. CONCLUSION

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1