A. Dataset Description & Real Anomaly
Stock market data of companies was used to train models. Each company’s history of manipulation was searched via search engines and articles from reputed media houses. Stock regulators like SEBI (Securities and Exchange Board of India) publicly share details of these cases and were checked for the period under which these stocks were investigated. Those periods were marked as anomalies. It is to note that data was limited to Indian markets only under the scope of research but the methodologies could easily be extended to any type of stock data.
Our primary goal is to detect contextual anomalies. These observations stand out only in comparison to nearby data points, rather than being considered anomalies when compared to all other observations. It is essential to provide compelling reasons for categorizing them as anomalous.
The data was downloaded from the official site of BSE (Bombay Stock Exchange) by searching for their Security ID/Name.
Sadhna Broadcast Ltd. The Securities and Exchange Board of India (SEBI) has found that the stock prices of Sadhna Broadcast Ltd were manipulated through misleading videos on some YouTube channels [11]. The videos falsely claimed that the company was going to be acquired by the Adani Group and had signed big contracts with Sony Pictures and Zee. This caused retail investors to buy the stock, driving up the price.
Sharpline Broadcast Ltd. Sharpline’s stock was involved in the same scam along with Sadhna Broadcast Ltd [11].
The periods in which these stocks were under review by SEBI are provided in Table 1.
Table 1
The period under investigation by SEBI
Name | Start | End |
---|
Sadhna Broadcast Ltd. | April 2022 | September 2022 |
Sharpline Broadcast Ltd. | April 2022 | August 2022 |
a. Statistical Approach – Benford’s Law
Leveraging Benford's Law enables the identification of potential fraud within datasets conforming to this fundamental statistical principle. The first-digit law inherent in Benford's Law posits that, in numerous naturally occurring datasets, the initial digit of a number tends to be smaller rather than larger. It anticipates that the digit '1' should occur approximately 30% of the time, whereas the digit '9' should appear in less than 5% of instances. The distribution of these occurrences is visually depicted in Fig. 1.
An altered or fabricated dataset may have significant differences from its expected first-digit distribution. An unusually high proportion of numbers starting with 9 on a company's financial statements could signal fraudulent activity. In our study, we examined the total number of shares (Volume) column, which represents the volume of shares traded that day. Our analysis of the dataset using Benford's Law confirms that there has been manipulation.
We know that Benford's Law is a simple statistical law to detect data manipulation, but it simply gives a YES or NO answer, that is, whether manipulation is detected or not. We need precise periods for the manipulation, so we investigated LSTM with autoencoders and TadGAN further in our research. We concluded that these algorithms performed better in manipulation detection and its precise period.
b. LSTM Autoencoder
Nitish et al. [18] made an early observation regarding the potential enhancement of LSTMs through learning embeddings from an encoder-decoder model. They introduced the use of Neural networks with multiple layers of Long Short-Term Memory (LSTM) cells to acquire representations of sequential data. The encoder-decoder LSTM reads, encodes, decodes, and reproduces input sequences within a given dataset. Model performance is evaluated based on its ability to accurately reproduce input sequences. The decoder portion of the model can be eliminated, leaving only the encoder, once the required performance level is reached. This design allows input sequences to be encoded into a fixed-length vector, enabling effective processing of sequential data, capturing temporal patterns, and generating desired outputs. Figure 3 illustrates the model architecture generated using the TensorFlow v2 API, and the training process is outlined in Fig. 2.
c. LSTM with Dynamic Thresholding
The approach functions as an algorithmic technique designed to detect anomalies in temporal data sequences. Leveraging Long Short-Term Memory (LSTM) networks, the model captures associations between preceding and current data points by encoding these connections through numerically optimized weights. Following the generation of predictive outputs, an unsupervised, dynamic, and nonparametric method is employed to evaluate the residual values. This circumvents challenges such as heterogeneity, non-stationarity, and stochastic noise that often confound automated threshold determinations in data streams with fluctuating behavioural patterns and value distributions. By responsively adjusting to the variance in prediction errors, the dynamic threshold remains relatively low when errors exhibit minor deviations and escalates when the deviation is more substantial. Empirical validations of the LSTM with Dynamic Threshold model have been substantiated across multiple disciplinary contexts, ranging from the identification of anomalies in aerospace systems [14] to predictive analytics in healthcare [15], and transportation planning [16].
d. TadGAN
Mentioned in Liu et al. TadGAN [12] offers a performance-efficient and generalisable approach for anomaly detection. With an adversarial unsupervised learning approach, they can capture temporal correlations of the time series distribution. The original cycle loss method described in the paper allows efficient reconstruction of the time series. To reconstruct signals only Generators and Encoders are used which can be represented as
$$G\left(E\right(s\left)\right)\approx \hat s$$
The Generators and Encoders instinctively should not be able to reconstruct the anomaly. Henceforth, anomalous stock data should deviate from the reconstructed ŝ. The critic \({C}_{x}\) is responsible for identifying what windows are anomalous in \(\hat s\). The architecture of TadGAN is represented in Fig. 3 & Fig. 4. The model was trained using the pipeline shown in Fig. 5. Specifically, the TadGAN model was configured with a length of 100 input sequences, a latent space of 20 dimensions, a batch size of 64, a single-layer bidirectional LSTM with 100 hidden units for the Encoder, a two-layer bidirectional LSTM with 64 hidden units for the Generator, a one-dimensional convolutional layer for the Critics, and 25 training epochs were used. In the following step, the stock was segmented into sub-segments using the default window size of 100.
e. Auto-Encoder with Regression (AER)
Wong et al. [17] have introduced the Auto-Encoder with Regression model, an unsupervised anomaly detection construct that integrates the principles of Generative Adversarial Networks (GANs) with the dynamic capabilities of LSTM Recurrent Neural Networks. It employs a cycle consistency loss during its training phase, enhancing its ability to accurately reconstruct time-series data. The AER model is further advanced by its innovative approaches to calculating reconstruction errors and by its strategic fusion of these errors with critic outputs to derive a comprehensive anomaly score. This model's architecture is adept at processing sequential data, identifying temporal patterns, and executing precise anomaly detection, with its effectiveness illustrated in a model architecture diagram generated using the tensorflow-v2 API, as referenced in Wong et al.'s research.
The architecture of AER decoder and encoder is represented in Fig. 10 & Fig. 8 respectively. The model was trained using the pipeline shown in Fig. 9.