Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection

doi:10.21203/rs.3.rs-4914517/v1

Download PDF

Research Article

Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection

https://doi.org/10.21203/rs.3.rs-4914517/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

This paper presents a novel approach to network performance monitoring and improvement through profile pattern extraction-based anomaly detection in multi-dimensional network throughput metrics. As modern networks grow in complexity, traditional monitoring methods often struggle to detect subtle yet significant anomalies that can impact performance. Our research addresses this challenge by developing an integrated framework that combines advanced data analysis techniques with machine learning algorithms to identify and interpret complex patterns in network behaviour. The proposed methodology leverages autocorrelation function (ACF) based clustering to group similar time series, and employs feature extraction methods to create profile patterns from multi-dimensional network data. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using the Isolation Forest algorithm. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using a combination of statistical methods and machine learning approaches.

Our experimental results, based on real-world data from a telecommunications network, demonstrate that the profile pattern-based approach significantly enhances anomaly detection capabilities. The best-performing model, which combines raw data and Z-scores derived from profile patterns, achieved an anomaly detection rate of 1.805% with the highest confidence (average anomaly score of -0.123). This model outperformed both raw data analysis and Z-score-only approaches in terms of selectivity and computational efficiency, completing analysis in 8.582 seconds. This research contributes to the field of network performance monitoring by offering a more sophisticated and accurate approach to anomaly detection, potentially leading to enhanced network reliability, reduced downtime, and improved user experience. The paper concludes by discussing the implications of these findings for network administrators and outlining future research directions in this rapidly evolving field.

Network Performance

Anomaly Detection

Multi-Dimensional Data Analysis

Machine Learning

Pattern Extraction

Network Throughput Metrics

In the era of digital transformation, network performance has become a critical factor in the success of businesses and organizations worldwide. Modern computer networks require accurate anomaly detection in real-time streaming data for various applications, including preventative maintenance, fraud prevention, problem detection, and monitoring. The increasing complexity and heterogeneity of these networks, along with complex patterns and real-time constraints, make detecting and diagnosing anomalies in network throughput metrics a challenging task.

Anomaly detection in network systems has long been a focus of research and practical implementation. It serves as an early warning system, alerting network administrators to potential issues that could degrade performance or indicate security threats. Anomaly detection in multivariate time series metrics in network data is a challenging problem due to the complexity and large-scale nature of network data. Large, complex networks can face challenges in detecting anomalies due to the curse of dimensionality(Chen et al., 2020) This term describes the abundance of data originating from numerous sources, leading to diverse manifestations across various network devices. Such a situation can hinder the effective identification of anomalies. Telecommunication ISPs (Internet Service Providers) companies often generate multiple correlated time series, representing different aspects of network performance. These metrics provide a rich dataset that, when properly analysed, can reveal complex patterns and anomalies that may go unnoticed in simpler monitoring schemes(Z. Li et al., 2021). The challenge lies in effectively processing and interpreting this high-dimensional data to extract meaningful insights and identify potential issues before they escalate into major problems. Anomaly detection methods must be capable of handling multivariate data, considering the correlations between dimensions to provide a more comprehensive and accurate assessment of network anomalies. Traditional anomaly detection methods often focus on datasets with a single item. However, real-world scenarios often involve multi-component datasets, where each component has its own pattern and behaviour. Developing a separate model for each component can lead to performance issues and increased complexity.

This research paper introduces a novel approach to anomaly detection in multi-dimensional network throughput metrics through profile pattern extraction. In the first step multiple time series representing different metrics are grouped using clustering algorithm based on their similar pattern. In the second step profile patterns are created by grouping data based on temporal features, then aggregating key metrics within these groups to identify trends and deviations. The z-score features computed after the profile pattern are modelled for anomaly detection with Isolation Forest algorithm. This approach offers a scalable and efficient solution for anomaly detection in large-scale network environments, ultimately contributing to improved network performance and increased reliability. Thus, rather than developing separate model for each time series, this approach clusters similar behavioural time series into the same cluster and develop a model for each cluster.

The objectives of this study are threefold:

To develop an integrated single model-based approach for extracting profile patterns from multi-dimensional network throughput metrics.

To design and implement an anomaly detection mechanism based on these extracted profiles.

A telecommunication network monitoring case study with throughput data is used to test the proposed method.

The potential impact of this research extends beyond mere academic interest. By providing network administrators with more accurate and timely information about network anomalies, this approach can contribute to proactive network management, reduced downtime, and improved user experience. Furthermore, the insights gained from profile pattern analysis can inform network optimization strategies, leading to more efficient resource allocation and enhanced network design.

Telecommunication networks contain diverse data that represents various metrics of different types and scales (Bordeau-Aubert et al., 2023) When working with unsupervised techniques, it is common to handle multiple time series by either clustering them(Diez-Olivan et al., 2017) or calculating the distances between them(Benkabou et al., 2018). High dimensionality in data complicates the training process for Machine Learning Algorithms (MLAs), leading to overfitting of the model and decreased predictive performance(Anowar et al., 2022). This is due to the increased complexity of the dataset, which hinders the algorithm's ability to generalize to new data. Clustering time series data is a valuable technique for identifying underlying patterns and structures in various applications. State-of-the-art anomaly detection algorithms have been proposed (Garg et al., 2021) for heterogeneous time series data include unsupervised and semi-supervised deep-learning-based methods. These methods can scale to high dimensions and model complex patterns in various domains, making them suitable for detecting anomalies in multivariate time series data. Another approach is to use robust time series decomposition, as proposed in (T. Li et al., 2020)to decouple the trend and seasonal components of the data and detect outliers in the residual. However, the field of time series anomaly detection is constantly advancing, and several methods are available, making it a challenge to determine the most appropriate method for a specific domain(Sørbø & Ruocco, 2023).

Canizo et al., (2019) proposed a multi-head CNN-RNN architecture for anomaly detection in multi-sensor systems to provide a novel deep learning-based approach that addresses the challenge of detecting anomalies in heterogeneous data. The architecture uses independent CNNs to extract features from each sensor data on a fully independent basis, allowing for a more tailored approach to each type of sensor. The RNN component processes the sensor data in a window-based method, which allows it to focus on different phases of the sensor data. The effectiveness of this approach was demonstrated through an industrial case study involving a service elevator, where anomalies were effectively detected based on multiple sensor data. This study shows that the proposed approach outperforms other state-of-the-art methods for anomaly detection in multi-time series data and provides an effective solution for anomaly detection in multi-sensor systems without the need for data pre-processing.

For large internet companies, it is crucial to monitor a significant number of Key Performance Indicators (KPIs) and identify anomalies to ensure service quality and reliability. However, detecting anomalies on millions of KPIs poses significant challenges, including the extensive overhead of model selection, parameter tuning, model training, and labelling. In (Z. Li et al., 2018) KPI clustering approach is proposed that can provide a solution. By clustering millions of KPIs into a smaller number of clusters, we can select and train models on a per-cluster basis.

In this chapter we, proposed a method for anomaly detection in multi-component datasets where the patterns of components are dependent on common factors.

3.1 Overview of Proposed Scalable Anomaly Detection Framework

In this section, we presented our large-scale time series anomaly detection approach for data in two main steps. First the multicomponent time series metrics are clustered according to their underlying shapes. By identifying similar KPIs, such as the number of queries per server in a well-balanced server cluster, we can group them into a few clusters. This approach allows us that we can employ one anomaly detection model per cluster, effectively reducing the mentioned overhead significantly. In the second step a simple single model-based approach is used to detect anomalies in data. The approach incorporates temporal information into the analysis, which is not commonly done in traditional time series analysis methods. This enables us to group the data based on specific time intervals, providing valuable insights into the relationships between various KPIs and their corresponding temporal contexts. The approach detects anomalies in large-scale time series data by leveraging clustering and KPI-based pattern table extraction as depicted in Fig. 1.

The proposed approach can be extended for many application domains which share common characteristics. The details of the main steps involved in our approach are outlined in the next subsections.

3.2 Clustering of Multi-Dimensional Time Series Data

Time series data is a sequence of observations recorded at regular time intervals with its dynamic fluctuations, presents a unique challenge for analysis. Typically, there are a vast number of Key Performance Indicators (KPIs) in a large-scale internet-based service company. It is impossible for operators to analyse each KPI individually. By using clustering, they can analyse KPIs per cluster and create an anomaly detection model for each cluster(Z. Li et al., 2018). This significantly reduces modelling costs and improves efficiency.

Clustering such data involves grouping similar time series based on certain characteristics. One efficient way to extract features from time series data for clustering is by using the Autocorrelation Function (ACF). Clustering is a type of unsupervised learning that involves splitting a collection of unlabelled data objects into clusters or groups that are similar to each other. A number of anomaly detection methods have been proposed, e.g. (Xu et al., 2018) and (Laptev et al., 2015) However, these approaches often assume that an individual model is needed for each KPI. This poses a significant challenge when it comes to large-scale anomaly detection, involving thousands to millions of KPIs. The overhead involved developing multiple models, parameter tuning, model training, and anomaly labelling becomes a major obstacle. Fortunately, there are many KPIs that share common characteristics and associations.

The autocorrelation function (ACF) measures the linear dependence between data points in a time series at different time lags. It provides a correlation between the series and its lagged values. In this context, the ACF is used to transform the time series data into a form that can be effectively clustered(Yakubu & Saputra, 2022). To compute ACF estimates, we determined the degree of autocorrelation between data points separated by varying time lags. The result is a feature vector representing the temporal structure of the time series.

To reduce complexity of the data analysing clusters instead of individual series can offer a concise overview of the data's overall structure. We clustered time series data based on similarities in their autocorrelation patterns using Kmeans Clustering. The KMeans algorithm is a popular method for partitioning a dataset into a specified number of clusters. It works by assigning each data point to the cluster whose centroid is nearest. The centroids are then updated based on the new assignments, and the process is repeated until the assignments no longer change(Ferencz et al., 2022). However, the number of clusters needs to be specified in advance, and has to be estimated using other methods(Xia et al., 2015). In our approach we can obtain an idea for the number of clusters by analysing the patterns in the autocorrelation plot. We first calculated the ACF for each time series, then uses those ACF values as features for clustering. By using the ACF to transform the data and the KMeans algorithm to cluster it, our approach can reveal patterns in the data that may not be apparent from the raw time series alone.

3.3 Context-Aware Based Pattern Construction for Multi-Dimensional Metrics

In the previous section we explored how to cluster multiple behavioural time series into groups. In this section we detail the creation of pattern table based on temporal contextual features. Data is segmented based on contextual factors such as time of day and day of the week. Each grouping represents a unique combination of time-related attributes (e.g., Monday at 10 AM, Sunday at 3 PM, etc.). This segmentation is crucial as it acknowledges that data behaviour can vary significantly depending on these contextual elements, which is often overlooked in simpler models. KPIs exhibit regular patterns based on time, making deviations potentially indicative of issues or extraordinary events.

For each group, statistical measures are calculated for each KPI. The typical measures include the mean (average) and standard deviation. The mean provides an average value of the KPI for that specific time segment, offering insight into what is typical or expected during that period. While the standard deviation measures the amount of variation or dispersion of the KPI values from the mean. The results of these aggregations are compiled into a pattern table. This table serves as a reference model, capturing the normal behaviour patterns of KPIs across different temporal contexts. Thus, each entry in the pattern table corresponds to a specific temporal grouping and includes the aggregated statistics (mean and standard deviation) for each KPI. Each row in this table corresponds to a unique combination of temporal attributes and includes the aggregated statistical figures for each KPI.

3.4 Zscore based Features and Anomaly Detection

In the final step after constructing the pattern table, the pattern table is merged with the original data, and z-scores are calculated for each KPI. This means that the deviations of each KPI value from its own pattern are estimated by considering the factors that determine the pattern. Each z-score is determined by evaluating how many standard deviations a KPI value deviates from its mean, which is calculated for each temporal segment.

The Isolation Forest algorithm is an anomaly detection method that identifies anomalies based on the concept of isolation without employing any distance or density measure. This approach is fundamentally different from most existing methods(Madhukar Rao & Ramesh, 2021). Feeding z-score features to the Isolation Forest algorithm can enhance its anomaly detection capabilities. Therefore, the z-scores are utilized as input features for the Isolation Forest model. The model operates under the principle of isolating anomalies instead of constructing a profile of normal instances. This method is particularly adept in our context due to its efficiency in handling the high-dimensional feature space. It leverages the inherent property of anomalies being few and distinct, thus efficiently segregating them from normal observations.

When these z-score features are fed into the Isolation Forest algorithm, the algorithm can more effectively isolate anomalies. This is because the z-score transformation helps to highlight those data points that are significantly different from the mean, which are likely to be the anomalies that the Isolation Forest algorithm is designed to detect.

This section details the experiment's setting, preprocessing and baseline approaches used. For experiments in this chapter, we utilised the same type of data as we used in the previous chapter. Also, in the exploratory data analysis section we found that the data needs to be pre-processed before using for analysis. The following section details the necessary pre-processing steps performed on the data.

4.1 Data Collection from Real Network Infrastructure (Core Peering Router)

Real-world dataset from British Telecommunication (BT) network components was collected. The data is captured from 85 BT internet peering router interfaces at a single location. Each interface represents four time-series data: input data rate, output data rate, input packet rate, and output packet rate. The time-series data for each interface of the router consists of input and output throughput rates for almost 12 days of the period between 09 May and 21 May 2019 and each observation has been timestamped with approximately 30 seconds gap. The anomaly detection techniques can be applied at different locations of the telecommunication networks such as the customer access layer e.g., Broadband time series data and core network layer e.g., core peering router. The underlying issues include primarily potential network faults, connection, spikes in input traffic, and other network configuration issues that should raise the alarm for a timely recovery.

These metrics are extracted from an operational system, it contains time series data of key metrics such as OPRATE, representing the output packet rate over time. The system metrics generated are of diverse nature and requires domain expert to label anomalies. To understand its basic behaviour, the plot shown in Fig. 2 represents output packet rate on y-axis and datetime on x-axis.

In this chapter for experiments single interface data is used for analysis which consists of 4 metrics represent the input and output throughput. As the core network peering router generate more regular data at high speed, such time series data poses a challenge to predict where time series deviates significantly from normal behaviour.

Both the following Fig. 3 and Fig. 4 shows the typical pattern of network time series metrics. It is illustrated from the figures that the time series data for each interface starts on 09 May and ends on 21 May 2019. Every observation has been recorded with a time stamp, spaced approximately 30 seconds apart.

The daily patterns of each component exhibit similarities, with the main difference being their magnitude.

4.2 Data Component Selection and Filtering

Organizing the data by component type allows for a clear understanding of the different elements within the dataset. We can select random number of components for analysis. However, for this experiment, we selected 20 components out of 85 components due to better interpret the results.

By grouping the data frames and generating a bar chart displaying data points per day. The bar chart below illustrates the number of data points per day, it is evident that, the on the first day there are 83 observations while on last day there are 1890 observations. To avoid bias in the anomaly detection analysis, we exclude these two days.

For each component we selected 10 days of data and dropped the observations on date 2019-05-09 and 2019-05-20.

Upon examining a substantial number of metric streams, it becomes apparent that despite the variety of metrics, many shares common characteristics due to their inherent associations and similarities. The heterogeneous time series metrics are clustered into groups based on their patterns, using autocorrelation-based clustering method described in the subsection 6.4.2.

4.3 Resampling

The raw data consisted of measurements from different components recorded at irregular intervals. The inconsistency between each measurement is not balanced and tightly centred around 30sec, and the difference fluctuates evenly between 29, 30 and 31 sec. The time interval between consecutive measurements isn't fixed, but it's approximately 30 seconds.

The initial step in our resampling process involved grouping the data by the 'Component'. For each component, we rounded the index to the nearest minute to align the data points to a standardized time grid. This step is crucial to avoid any misalignment issues that could arise due to data points being recorded at varying seconds within each minute. Subsequently, we resampled the data to a one-minute frequency using the mean as the aggregation function. This step involved averaging the data points that fell within each one-minute interval. Resampling by averaging is a common technique to reduce the noise and variability in the data, providing a smoother representation of the underlying trends. Details of the datasets used in this chapter after preprocessing are presented in the following Table 1.

Table 1

Data Component details after preprocessing (Resampling and Filtering)
Number of Components	Start Date	End Date	Total No of Days	Total Datapoints	Frequency	Data Points per Hour	Data Points per Day	Total number of time series
20	2019-05-10	2019-05-19	10	14,400	1 minute	60	1440	80

4.4 Train/Test Split

In the experiments, the dataset is grouped after creating groups using clustering. Each of these groups is then individually subjected to the train-test split. For example, in the first cluster of the time series metric we have the data available for the duration of 10 days which can be seen in Fig. 6.

The starting one week of data of each group's data is used for training purposes, and the last three days is used for testing. This approach ensures that each group's unique characteristics are represented in both the training and testing phases, allowing for a more robust evaluation of the model across different segments of the data. The total number of points after preprocessing we have 10 days data in which we have used one week(7-days) for training and 3 days as our testing data.

Profile Pattern Creation

After grouping similar metric streams into several clusters, we constructed a pattern table for each cluster to capture patterns across time segments. It groups each clustered data by weekday, hour, and 30-minute intervals, and aggregates mean and standard deviation for each KPI within these segments. The aggregation is based on a combination of weekday, hour, and 30-minute intervals. This design reflects the assumption that the time series metrics often exhibit temporal patterns based on day of the week and time of day.

Isolation Forest Training and Anomaly Detection

An iForest model was trained on the engineered features extracted from the training set. We employed multiple random forests for robust anomaly scoring. The Contamination parameter plays a crucial role in the detection process. It represents the assumed level of contamination or the proportion of outliers in the dataset. When fitting the model, this parameter is used to determine the threshold on the scores of the samples. Manual adjustment of the Contamination parameter is necessary to achieve an optimal fit for the specific dataset and the desired outcome. The parameters used for Isolation Forest model and feature set are presented in the following Table 2.

Table 2

Parameters and Feature Sets Used in the Isolation Forest Algorithm for Anomaly Detection
Category	Parameter /Feature set	Description	Value
Parameters
1	max_samples	Number of samples to train each estimator	1000
2	random_state	Seed for random number generator	0
3	contamination	Proportion of outliers in the dataset	0.025
4	max_features	Number of features for each estimator	1.0
5	n_estimators	Number of base estimators in the ensemble	1000
6	bootstrap	Whether samples are drawn with replacement	False
7	verbose	Verbosity of the process	0
8	n_jobs	Number of cores for parallel processing	-1
Feature Sets	Set 0 Model 1		[TS_0]
	Set 1 Model 2		[TS_0, TS_0_zscore]
	Set 2 Model 3		[TS_0_zscore]

Isolation Forest detects anomalies by considering their distinctiveness or deviation from normal data points. This is determined by the number of neighbouring instances that surround an anomaly, which is fewer compared to normal data points. The decision_function method computes the anomaly score for every data point in the dataset. The scores can be utilized to identify and rank any outliers or anomalies within the dataset. Higher scores indicate a greater probability of being an outlier. The anomaly score indicates the level of abnormality or uniqueness of a data point. It is determined by the average path length needed to isolate the data point within the Isolation Forest structure. A lower average path length corresponds to a higher anomaly score, indicating that the data point is more likely to be an outlier.

We constructed three models using different feature sets. The first model uses raw data only of the original time series values. The second model uses the standardized feature values derived from Pattern table-based Z-scores. Third model Employing both raw data and z-scores to exploit both inherent patterns and standardized deviations.

These three models will provide different perspectives and insights into the anomaly detection process.

The heterogenous data from multiple components containing throughput metrics are integrated in one data frame to analyse it through clustering. The multiple time series data might be originating from different processes and exhibits distinct autocorrelation structure. We verified this by analysing their autocorrelation structure. We computed ACF for each time series with a lag of 20. By measuring the degree of correlation between an observation and its lagged versions, the ACF reveals the temporal dependencies within the time series. Thus, time series with similar temporal structures will have similar ACFs. Figure 7 displays the autocorrelation functions (ACFs) of 80-time series. The horizontal axis represents the lag, while the vertical axis shows the correlation value for each lag. It is evident from the ACF plot that we can determine optimal number of clusters for our dataset by analysing the underlying structures in their ACFs. By using these ACF feature vectors as the basis for K Means clustering, we can group time series not by their raw values, but by the structure and dependencies in the data. This enables us to identify clusters of time series with similar temporal dynamics. The experiment validated the hypothesis that ACF is a robust feature for Clustering time series capturing the essential dynamics of each time series. For our context, each observation is a time series represented by its ACF feature vector.

It can be seen from Fig. 7 that we can chose the number of clusters visually as 4 clusters as there are 4 distinct structures. We employed the k-means algorithm with four clusters to segregate them computationally. The algorithm partitions the dataset into 4 clusters, where each observation belongs to the cluster with the nearest mean. K-means algorithm assigns a label to each time series indicating the cluster it belongs to. These labels are then utilized to group and display the time series data based on their corresponding clusters. For example, in Fig. 8 provide a visual representation of the inherent patterns within the first cluster and there are 51 ACFs of the respective time series.

Thus, from Fig. 8 to Fig. 11, it is evident that each cluster predominantly consisted of time series from the same generating processes, indicating the utility of ACF in time series clustering. By plotting the ACF for each cluster, we can visually confirm the internal coherence of each cluster – time series within the same cluster exhibit similar ACF patterns, indicating similar temporal dynamics. The plots illustrate the correlation structure of the time series grouped in their corresponding cluster, highlighting their similar temporal dynamics.

This supports in validating the effectiveness of the clustering approach and provides insight into the fundamental characteristics of each cluster, which can be valuable when examining the data for conducting further analysis. Clustering these feature vectors using k-means enables us to group time series based on the similarity of their temporal structures rather than their raw values, making this a powerful method for understanding the underlying dynamics of the data.

The pattern table successfully grouped KPI data into time segments based on weekday, hour, and 30-minute intervals. For each KPI, means and standard deviations were calculated within each time segment, providing a summary of typical behaviour during those periods. For example, in Fig. 12 shows the normal pattern for a single time series in group 0 we have created. The analysis revealed diverse temporal patterns across KPIs. For example, the KPIs in group 3 (TS_58, TS_19, TS_43, TS_78) exhibited a slightly different trends across weekdays and hours, with lower mean values during certain periods. These findings suggest the influence of day-to-day and hourly routines on these KPIs.

Other KPIs (TS_18, TS_50) in 2nd group(g1) showed minimal variation across time segments, indicating their relative independence from weekday and hourly fluctuations. This may be due to factors like external events or specific user activities beyond the scope of the time-based grouping.

Z-scores were calculated for each KPI, allowing for comparison of individual data points to the temporal pattern within their corresponding time segment. Outliers identified through significant deviations from the expected mean and standard deviation within their time segment could indicate unusual activity or potential anomalies for further investigation. We used time based Zscore features and raw data to assist Isolation Forest Algorithm.

As mentioned in section 6.5.5, three models were constructed using different features to examine the effectiveness of iForest in capturing anomalies within each feature set. The distribution of anomaly scores obtained from each model were also analysed illustrated in Fig. 13. we observed the shape and range of the scores associated with each feature set.

By comparing the score distributions between models, we can evaluate the relative performance and contribution of different features in detecting anomalies within the dataset. If there are noticeable differences in the score distributions among the models, it indicates that the different features capture specific aspects of the data's anomalies. Each model assigns different levels of anomaly scores to the data points based on the specific feature set it uses. These distribution differences provide insights into the effectiveness of each feature set in identifying anomalies. A model with a wider or more skewed score distribution may indicate a greater ability to distinguish between normal and anomalous data points. On the other hand, a model with scores concentrated within a specific range may suggest a more conservative or less sensitive approach to anomaly detection.

The following Fig. 14 depicts the test time series data with detected anomalies in red. Figure 14a represents the results of the first model, which involved fitting IForest model only on the raw data. It is apparent that the outcome is relatively simplistic, as the model primarily detects anomalies at the significantly higher extremes.

Figure 14b displays the results of a model fitted on two features (raw value and Pattern table based zscore). In the perspective of data, this model produces outcomes that are more in line with a somewhat subjective interpretation of interesting anomalies. Lastly Fig. 14c illustrates the model fitted exclusively on the TS_0_zscore derived after pattern table.

Table 3

Comparison of Anomaly Detection Models Using Different Feature Sets for Network Throughput Metrics
Model	Anomaly Detection Rate (%)	Average Anomaly Score	Score Distribution Skewness	Computational Time (s)
Model 1 (TS_0)	2.522	-0.091	2.282	9.046
Model 2 (TS_0, TS_0_z_score)	1.805	-0.123	1.824	8.582
Model 3 (TS_0_z_score)	2.198	-0.122	3.247	9.182

Table 3 shows the percentage of data points classified as anomalies varies across the models. Model 1 (Raw Data) likely shows a higher anomaly detection rate compared to Models 2 and 3. This suggests that using raw data alone may be overly sensitive, potentially flagging normal network fluctuations as anomalies. Models 2 and 3, which incorporate Z-scores, are likely more selective in identifying anomalies, which aligns with the paper's goal of detecting subtle yet significant anomalies in network performance.

It is evident in the table that Models 2 and 3, which incorporate Z-scores, have lower (more negative) average anomaly scores. This indicates that these models are more confident in their anomaly classifications. The similarity between Models 2 and 3 suggests that the Z-score feature is driving this improved confidence, aligning with your paper's emphasis on profile pattern extraction for more accurate anomaly detection.

Model 3, using only Z-scores, shows the highest positive skewness. This suggests it's most effective at distinguishing between normal and anomalous network behavior, providing a clearer separation between regular operations and potential issues. The lower skewness of Model 2 might indicate a more balanced approach, potentially reducing extreme classifications.

In this paper, we proposed a novel approach to anomaly detection in multidimensional network throughput metrics through profile pattern extraction. The similar behavioural patterns are clustered through autocorrelation-based clustering method. Using a real-world dataset, we demonstrated an efficient method for anomaly detection using clustering and constructing pattern-based features. The proposed method showed the effectiveness of combining profile pattern with anomaly detection algorithms. For anomaly detection, we employed Isolation Forest model as an outlier detection model. Isolation Forest demonstrated its capability to detect anomalies in diverse time-series data, even with varying feature sets. The combination of IForest with pattern table based Zscore exhibits superior performance compared to raw features. The results obtained depicts that the approach offered a simpler solution for detecting anomalies for multi component metrics using simpler models. The approach presented provides a single model-based approach capable of handling multiple metrics, where components might have different patterns. these results strongly support the effectiveness of the profile pattern extraction approach for anomaly detection in multi-dimensional network throughput metrics. The models incorporating Z-scores (2 and 3) show improvements in anomaly detection confidence and characterization, with minimal computational overhead. In addition, the approach is scalable and adaptable to other use cases through the construction of more relevant features. However, it is important to acknowledge the limitations of our study. There is an assumption that each cluster dependent on only inferred factors (e.g. hour, weekday, weekend). Also, for detecting anomalies in multivariate data additional features such as correlation between the features could be considered along with multiple raw data.

The current analysis includes creating pattern table based on combination of each weekday, hour and 30-minute interval. More investigations could be done with different granularity of time segments while constructing pattern table and calculating Zscore. The overall approach presented in this paper can be extended to detect anomalies in multivariate data. Isolation Forest Model is particularly effective in detecting anomalies, especially for high-dimensional data.

KPI	Key Performance indicator
RNN	Recurrent Neural Network
CNN	Convolutional neural network
ACF	Autocorrelation Function
STL	Seasonal-Trend decomposition
ADF	augmented Dickey–Fuller
FFT	Fast Fourier Transform
BTIIC	British Telecom Ireland Innovation Centre
AE	Autoencoder

Ethics approval and consent to participate

Not applicable

Funding

The authors have received no financial support regarding carrying out of this research study

Conflict of Interest/Competing interest

The authors have no conflicts of interest to declare. All co-authors have seen and agreed with the contents of the manuscript. We certify that the submission is original work and is not under review at any other publication.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not Applicable.

Availability of data and materials

Due to British Telecom (BT) policy, the real data collected from the core network cannot be provided due to confidentiality. Source code can be provided once paper is accepted.

Authors' contributions

S.A. conceived the idea of the proposed research study and designed a conceptual framework. In addition, S.A. carried out the research from highlighting the research gap by exploring the existing literature to develop a scalable framework. S.A., B.S. and D.G. conducted formal analysis and wrote the original manuscript. D.Y. captured the real data from the telecommunication network routers. D.G. investigated the experimental results. B.S., D.G. and S.Z. supervised the overall research and reviewed the final manuscript.

Acknowledgement

This research is supported by the BTIIC (British Telecom Ireland Innovation Centre) project, funded by British Telecom and Invest Northern Ireland. In addition, I am very thankful to Prof Bryan Scotney for his immense support and encouragement to carry out this research.

Anowar, F., Sadaoui, S., & Dalal, H. (2022). Clustering Quality of a High-dimensional Service Monitoring Time-series Dataset. International Conference on Agents and Artificial Intelligence.
Benkabou, S. E., Benabdeslem, K., & Canitia, B. (2018). Unsupervised outlier detection for time series by entropy and dynamic time warping. Knowledge and Information Systems, 54(2), 463–486. https://doi.org/10.1007/S10115-017-1067-8/FIGURES/8
Bordeau-Aubert, K., Whatley, J., Nadeau, S., Glatard, T., & Jaumard, B. (2023). Classification of Anomalies in Telecommunication Network KPI Time Series. https://arxiv.org/abs/2308.16279v1
Canizo, M., Triguero, I., Conde, A., & Onieva, E. (2019). Multi-head CNN–RNN for multi-time series anomaly detection: An industrial case study. Neurocomputing, 363, 246–260. https://doi.org/https://doi.org/10.1016/j.neucom.2019.07.034
Chen, F., Garrett, J., Zacks, D. N., & Yashin, V. (2020). METHOD AND SYSTEM FOR ANOMALY DETECTION IN LARGE-SCALE NETWORKS.
Diez-Olivan, A., Pagan, J. A., Sanz, R., & Sierra, B. (2017). Data-driven prognostics using a combination of constrained K-means clustering, fuzzy modeling and LOF-based score. Neurocomputing, 241, 97–107. https://doi.org/10.1016/J.NEUCOM.2017.02.024
Ferencz, K., Domokos, J., & Kovács, L. (2022). Analysis of time series data for anomaly detection. 2022 IEEE 22nd International Symposium on Computational Intelligence and Informatics and 8th IEEE International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics (CINTI-MACRo), 95–100. https://api.semanticscholar.org/CorpusID:256589477
Garg, A., Zhang, W., Samaran, J., Savitha, R., & Foo, C.-S. (2021). An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems, 33(6), 2508–2517.
Laptev, N. P., Amizadeh, S., & Flint, I. (2015). Generic and Scalable Framework for Automated Time-series Anomaly Detection. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Li, T., Geng, Y., & Jiang, H. (2020). Anomaly Detection on Seasonal Metrics via Robust Time Series Decomposition. In arXiv.org. https://arxiv.org/abs/2008.09245
Li, Z., Zhao, Y., Han, J., Su, Y., Jiao, R., Wen, X., & Pei, D. (2021). Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining.
Li, Z., Zhao, Y., Liu, R., & Pei, D. (2018). Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), 1–10. https://doi.org/10.1109/IWQoS.2018.8624168
Madhukar Rao, G., & Ramesh, D. (2021). A Hybrid and Improved Isolation Forest Algorithm for Anomaly Detection BT - Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications (V. K. Gunjan & J. M. Zurada (eds.); pp. 589–598). Springer Singapore.
Sørbø, S., & Ruocco, M. (2023). Navigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly Detection in Time Series. https://arxiv.org/abs/2303.01272
Xia, H., Chen, B., Fan, J., Li, Z., & Gao, D. (2015). Mining Time Series Data with Two Dimensional Fuzzy Pattern Rules. https://api.semanticscholar.org/CorpusID:118498288
Xu, H., Chen, W., Zhao, N., Li, Z. Z., Bu, J., Li, Z. Z., Liu, Y., Zhao, Y., Pei, D., Feng, Y., Chen, J., Wang, Z., & Qiao, H. (2018). Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018, 187–196. https://doi.org/10.1145/3178876.3185996
Yakubu, U. A., & Saputra, M. P. A. (2022). Time Series Model Analysis Using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for E-wallet Transactions during a Pandemic. International Journal of Global Operations Research. https://api.semanticscholar.org/CorpusID:251462691

No competing interests reported.

Download PDF

Editorial decision: Revision requested
28 Oct, 2024
Reviews received at journal
22 Oct, 2024
Reviews received at journal
19 Oct, 2024
Reviews received at journal
11 Sep, 2024
Reviewers agreed at journal
09 Sep, 2024
Reviewers agreed at journal
08 Sep, 2024
Reviewers agreed at journal
04 Sep, 2024
Reviewers agreed at journal
03 Sep, 2024
Reviewers agreed at journal
03 Sep, 2024
Reviewers invited by journal
03 Sep, 2024
Editor assigned by journal
31 Aug, 2024
Submission checks completed at journal
19 Aug, 2024
First submitted to journal
14 Aug, 2024

You are reading this latest preprint version

Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 Methodology

3.1 Overview of Proposed Scalable Anomaly Detection Framework

3.2 Clustering of Multi-Dimensional Time Series Data

3.3 Context-Aware Based Pattern Construction for Multi-Dimensional Metrics

3.4 Zscore based Features and Anomaly Detection

4 Experimental Setup

4.1 Data Collection from Real Network Infrastructure (Core Peering Router)

4.2 Data Component Selection and Filtering

4.3 Resampling

4.4 Train/Test Split

5 Results and Discussion

6 Conclusion

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1