The methodology used in this study is illustrated in Fig. 2. It consists of the following four parts.
-
Treatment of raw data for missing values
-
Data filtration
-
Development of a hybrid model for anomaly detection
-
Modelling of normal behaviour wind turbine power curve
Treatment for missing values
SCADA data contains entries for various reasons during data collection at the wind farm. Before processing the available data, it was necessary to treat missing values. In this study, the missing values of the SCADA data entries were treated using MATLAB using the shape-preserving cubic interpolation (PCHIP) method. On segments (a, b) with grid nodes a = xo<x1<x2……xn+1 = b, the values of the function can be written as (Romadanova, 2023; Volkov et al., 2010):
\(\:\left({x}_{i},{F}_{i}\right),i=0,\:.\:\dots\:,\:n+1\) 1
It can be written with the following notations
$$\:{F}_{i}\left[{x}_{i},\:{x}_{i+1}\right]=\frac{\left({F}_{i+1}-{F}_{i}\right)}{{h}_{i}}\:,\:{h}_{i}={x}_{i+1}-{x}_{i}\:,\:i=0,\dots\:..,\:n$$
If \(\:{F}_{i}\left[{x}_{i},\:{x}_{i+1}\right]\ge\:0\:,\:i=0,\dots\:\dots\:.\:n.\) it is called a monotonic function in Eq. 1. For the shape-preserving interpolation function S(xi) = Fi, i = 0, …., n+1, S was monotonic at the initial value of the data. For a cubic spline function with weight wi > 0, I = 0,….., n for function S, satisfying the following condition:
-
Function S should be a cubic polynomial.
-
Sϵ Ck {a, b}, k ≥ 1;
-
w i+1 S” (xi-) = wi+1 S” (xi+), i = 0,....,n.
For shape preserve cubic spline interpolation function can be written as:
$$\:S\:\left(x\right)={F}_{i},\:i=0,\dots\:.,n+1$$
For spline function following boundary conditions and constraints has been used
-
Establishing the value of the boundary using the first derivative, S’(a) = Fo’, S’(b) = Fn+1’
-
Establishing the value of the boundary with the second derivative, S’’(a) = Fo’’, S’’(b) = Fn+1’’
Data filtration
SCADA data were filtered out for incorrect and obvious anomalies during the measurement campaign owing to faults and data collected during non-operational times. The data obtained during the operation were against a blade pitch of 30 °and the cut-in and cut-out wind speeds.
Development of a hybrid model for anomaly detection
A hybrid data-driven model based on isolation forests and density-based spatial clustering of applications with noise (DBSCAN) was proposed to identify anomalies using the SCADA database. The isolation forest first isolates data points that are few and different as outliers based on the values of the random decision tree structure. Then, DBSCAN analyzes the clusters of data points to identify the regions of the data points with varying densities and shapes and sizes.
Based on the capacity to recognize anomalies instead of profiling, the normal dataset of the SCADA isolation forest is different from that of other popular methods. This is a relatively new method introduced and developed by Liu et al. (2008). In general, normal data points are more frequently compared with abnormal values, which differ from each other and are called outliers. In one feature space, abnormal data points lie far from regular data values, which is why they have fewer partitions to identify. In contrast, normal data points require several splits to be isolated.
The isolation forest method is a tree-based approach consisting of decision trees. Based on the selected features, values were randomly selected and split between the minimum and maximum values (Lin et al., 2020). Isolation forests use anomaly scores for decision-making. For instance, X, out of all instances, the N anomaly score is defined as
\(\:AS\left(X,N\right)={2}^{-\frac{E\left(h\left(X\right)\right)}{C\left(N\right)}}\) 2
Where E(h(X)) is the average path length of X across all isolation trees, h (X) is the path length of point X, and C (X) is the average path length of the unsuccessful binary tree search.
In this study, If the S value is 1, it is called an anomaly, whereas if the value of S is -1 it is called an outlier. The forest isolation method is computationally inexpensive and rapid. This was implemented using a sklearn-ensemble-isolation-forest library in Python (Pedregosa et al., 2011).
Anomaly detection clustering is an important unsupervised learning method. The clustering method uses two approaches based on distance and density. Distance-based clustering deals with data that have a spherical structure but becomes inefficient when the data have a non-spherical structure (Zhao et al., 2018). However, density-based clustering techniques can handle the shape of a dataset. In this approach, datasets with high-density regions can be easily differentiated from low-density regions of low density (Kusiak et al., 2009; Yesilbudak, 2018).
DBSCAN works on the concept of using density to cluster the data without requiring a specific number of clusters. The DBSCAN method first divides the dataset into different-density regions to identify clusters of random shapes and sizes. Clusters are a set of data points coupled with density. These points or nodes are called core, boundary, and noise points. The data points are core points (Pc) consisting of at least a minimum number of point samples (Pmin) in a cluster with a maximum radius (Eps). Boundary points are the data points in the neighborhoods but have points less than Pmin. The noise points were neither core nor boundary points. An illustration of the data points and DBSCAN methodology framework of DBSCAN are shown in Fig. 1.
Modelling of normal behaviour wind turbine power curve
After processing the data for missing values and removing most outliers and anomalies, a normal dataset was used for power curve modelling. Normal data were used as inputs to obtain a robust normal-behavior power curve for wind turbines. To obtain normal behavior, a locally weighted regression method was applied. Each smoothed point was obtained using neighboring data points within the selected span. For each data point, the regression weight within a given span was calculated using Eq. 3 (Bilendo et al., 2022).
\(\:{w}_{i}={\left(1-{\left|\frac{x-{x}_{i}}{{d}_{x}}\right|}^{3}\right)}^{3}\) 3
where xi denotes the nearest neighborhood of x in the span and x represents the predictor values associated with the point to be smoothed. In the above expression, d(x) represents the horizontal distance between the furthest predictor point and x on span. To find the upper and lower limit of robust power curve following expression can be used as shown in Equations 4 and 5.
\(\:{U}_{L}={w}_{i}+n\left(\frac{{\sigma\:}_{{w}_{i}}}{k}\right)\) 4
\(\:{L}_{L}={w}_{i}-n\left(\frac{{\sigma\:}_{{w}_{i}}}{k}\right)\) 5
Where UL is the upper limit, LL is the lower limit, σwi is the standard deviation of wi for the normal-behavior power curve. The control of the limit is adjusted by n which represents an integer multiple. k represents a data sample for a given span.