The heterogenous data from multiple components containing throughput metrics are integrated in one data frame to analyse it through clustering. The multiple time series data might be originating from different processes and exhibits distinct autocorrelation structure. We verified this by analysing their autocorrelation structure. We computed ACF for each time series with a lag of 20. By measuring the degree of correlation between an observation and its lagged versions, the ACF reveals the temporal dependencies within the time series. Thus, time series with similar temporal structures will have similar ACFs. Figure 7 displays the autocorrelation functions (ACFs) of 80-time series. The horizontal axis represents the lag, while the vertical axis shows the correlation value for each lag. It is evident from the ACF plot that we can determine optimal number of clusters for our dataset by analysing the underlying structures in their ACFs. By using these ACF feature vectors as the basis for K Means clustering, we can group time series not by their raw values, but by the structure and dependencies in the data. This enables us to identify clusters of time series with similar temporal dynamics. The experiment validated the hypothesis that ACF is a robust feature for Clustering time series capturing the essential dynamics of each time series. For our context, each observation is a time series represented by its ACF feature vector.
It can be seen from Fig. 7 that we can chose the number of clusters visually as 4 clusters as there are 4 distinct structures. We employed the k-means algorithm with four clusters to segregate them computationally. The algorithm partitions the dataset into 4 clusters, where each observation belongs to the cluster with the nearest mean. K-means algorithm assigns a label to each time series indicating the cluster it belongs to. These labels are then utilized to group and display the time series data based on their corresponding clusters. For example, in Fig. 8 provide a visual representation of the inherent patterns within the first cluster and there are 51 ACFs of the respective time series.
Thus, from Fig. 8 to Fig. 11, it is evident that each cluster predominantly consisted of time series from the same generating processes, indicating the utility of ACF in time series clustering. By plotting the ACF for each cluster, we can visually confirm the internal coherence of each cluster – time series within the same cluster exhibit similar ACF patterns, indicating similar temporal dynamics. The plots illustrate the correlation structure of the time series grouped in their corresponding cluster, highlighting their similar temporal dynamics.
This supports in validating the effectiveness of the clustering approach and provides insight into the fundamental characteristics of each cluster, which can be valuable when examining the data for conducting further analysis. Clustering these feature vectors using k-means enables us to group time series based on the similarity of their temporal structures rather than their raw values, making this a powerful method for understanding the underlying dynamics of the data.
The pattern table successfully grouped KPI data into time segments based on weekday, hour, and 30-minute intervals. For each KPI, means and standard deviations were calculated within each time segment, providing a summary of typical behaviour during those periods. For example, in Fig. 12 shows the normal pattern for a single time series in group 0 we have created. The analysis revealed diverse temporal patterns across KPIs. For example, the KPIs in group 3 (TS_58, TS_19, TS_43, TS_78) exhibited a slightly different trends across weekdays and hours, with lower mean values during certain periods. These findings suggest the influence of day-to-day and hourly routines on these KPIs.
Other KPIs (TS_18, TS_50) in 2nd group(g1) showed minimal variation across time segments, indicating their relative independence from weekday and hourly fluctuations. This may be due to factors like external events or specific user activities beyond the scope of the time-based grouping.
Z-scores were calculated for each KPI, allowing for comparison of individual data points to the temporal pattern within their corresponding time segment. Outliers identified through significant deviations from the expected mean and standard deviation within their time segment could indicate unusual activity or potential anomalies for further investigation. We used time based Zscore features and raw data to assist Isolation Forest Algorithm.
As mentioned in section 6.5.5, three models were constructed using different features to examine the effectiveness of iForest in capturing anomalies within each feature set. The distribution of anomaly scores obtained from each model were also analysed illustrated in Fig. 13. we observed the shape and range of the scores associated with each feature set.
By comparing the score distributions between models, we can evaluate the relative performance and contribution of different features in detecting anomalies within the dataset. If there are noticeable differences in the score distributions among the models, it indicates that the different features capture specific aspects of the data's anomalies. Each model assigns different levels of anomaly scores to the data points based on the specific feature set it uses. These distribution differences provide insights into the effectiveness of each feature set in identifying anomalies. A model with a wider or more skewed score distribution may indicate a greater ability to distinguish between normal and anomalous data points. On the other hand, a model with scores concentrated within a specific range may suggest a more conservative or less sensitive approach to anomaly detection.
The following Fig. 14 depicts the test time series data with detected anomalies in red. Figure 14a represents the results of the first model, which involved fitting IForest model only on the raw data. It is apparent that the outcome is relatively simplistic, as the model primarily detects anomalies at the significantly higher extremes.
Figure 14b displays the results of a model fitted on two features (raw value and Pattern table based zscore). In the perspective of data, this model produces outcomes that are more in line with a somewhat subjective interpretation of interesting anomalies. Lastly Fig. 14c illustrates the model fitted exclusively on the TS_0_zscore derived after pattern table.
Table 3
Comparison of Anomaly Detection Models Using Different Feature Sets for Network Throughput Metrics
Model | Anomaly Detection Rate (%) | Average Anomaly Score | Score Distribution Skewness | Computational Time (s) |
Model 1 (TS_0) | 2.522 | -0.091 | 2.282 | 9.046 |
Model 2 (TS_0, TS_0_z_score) | 1.805 | -0.123 | 1.824 | 8.582 |
Model 3 (TS_0_z_score) | 2.198 | -0.122 | 3.247 | 9.182 |
Table 3 shows the percentage of data points classified as anomalies varies across the models. Model 1 (Raw Data) likely shows a higher anomaly detection rate compared to Models 2 and 3. This suggests that using raw data alone may be overly sensitive, potentially flagging normal network fluctuations as anomalies. Models 2 and 3, which incorporate Z-scores, are likely more selective in identifying anomalies, which aligns with the paper's goal of detecting subtle yet significant anomalies in network performance.
It is evident in the table that Models 2 and 3, which incorporate Z-scores, have lower (more negative) average anomaly scores. This indicates that these models are more confident in their anomaly classifications. The similarity between Models 2 and 3 suggests that the Z-score feature is driving this improved confidence, aligning with your paper's emphasis on profile pattern extraction for more accurate anomaly detection.
Model 3, using only Z-scores, shows the highest positive skewness. This suggests it's most effective at distinguishing between normal and anomalous network behavior, providing a clearer separation between regular operations and potential issues. The lower skewness of Model 2 might indicate a more balanced approach, potentially reducing extreme classifications.