A. Dataset
A subset of Kaggle [25] was used for training and testing. Kaggle consists of 31 raags of duration ranging from 2 minutes to 60 minutes. The recordings consists of solo vocals and solo instruments. The percussions and the drone typically heard in raag performances have been removed from the recordings. Each raag is also annotated with its tonic frequency.
The Raga Identification System’s fundamental components are described in Figure1. The testing phase and training phase are the two modules of the Raga Identification System. The film songs are given as the input for this system in the training phase. We have considered two datasets that are having 5 ragas in each of them for the analysis of the results.
Segmentation of the audio song is done into its overlapped frames. The discontinuity can be reduced by allowing the neighboring frames to overlap. Raga Identification Performance is evaluated by considering the two different systems in our approach. One out of thirteen MFCC coefficients is extracted in each and every frame in the first system. Whereas in the second system, one out of twelve MFCC coefficients along with one pitch frequency is extracted in each frame and then they are concatenated.
The extracted features are modeled using the algorithm known as the K-means clustering algorithm. With the cluster size of 256, two sets of data which has 5 ragas in each of the models have been created. Among the various models, to identify the raga, testing has been done for every song in 5 models.
B. Feature extraction and normalization:
1. Frame Blocking
In audio signal frame blocking, S(n) is divided into several frames with each frame having N samples in it. Frames that are adjacent to each other are separated by M number of samples. The blocking of frames with the value for M = (1/2) N is illustrated in Figure 4. a. The audio signal sample and input frame sample is shown in Figure 4.b and 4.c.
The first N samples of audio are illustrated in the first frame. After the M number of samples, the second frame will begin and the N-M number of samples will be overlapped.
After 2M number of samples from first frame, the third frame will begin, and N-2M number of samples will be overlapped. Until all the audio signal is considered within one or more frames, this process will continue.
2. DWT Co-efficient Computation
The speech samples present in the database and the input test speech signal are decomposed into Approximate and Detail coefficients using Discrete Wavelet Transform (DWT). Daubachies wavelets have been reported to be highly successful in speech applications among the wavelets family and hence the same is used in this work. Daubachies 4 with 5, Level decomposition (db4, Lev5) are used here. End Point Algorithm is used to detect the beginning and end points of the speech signal and remove the unwanted silence portions.
3. MFCC Computation
Mel Frequency Cepstral Coefficients (MFCC) are the regularly used features in speech recognition since Mel scale is the nearest match to the way that human ear perceives the sound. Mel scale has linear frequency spacing below 1 KHz and logarithmic spacing above 1 KHz, which is shown in Fig 5.
MFCC computation using DWT co-efficient consists of the following steps:
- Discrete Wavelet Transform co-efficients are obtained.
- Spectrum of each frame is calculated using FFT.
- The spectral components are passed through Mel Filter bank.
- Logarithm of the Mel filter bank filtered spectrum is obtained.
- Discrete Cosine Transform is calculated to achieve energy compaction.
The relation between frequency variable and the Mel scale frequency is given by the equation (1)
c. Cepstrum method of Pitch Estimation
The pitch estimation can be done by Cepstral analysis. For the pitch estimation, the source of information can be extracted from speech signal by separating the excitation and vocal tract related information. The equation for cepstrum is given by (2) as,
The input sample audio, I/P Audio Signal cepstrum and pitch tracking waveforms are as shown in Figure 4.a- 4.c. In the waveform we can observe that the low frequency region is having all the slowly varying components in the log magnitude and also the high frequency region is having all the fast varying components. In Log Magnitude Spectrum, the vocal tract is represented by the slow varying components, whereas the excitation source is represented by the fast varying components.
The Pitch lag is represented by Cepstrum in terms of "quefrency". By estimating the pitch, the dominant frequency is represented by most energy when there is a lag.
d. Chromagram
In the musical context, chroma function or chromagram is closely related to 12 different pitches. Chroma-based features, also known as pitch class profiles, are a powerful tool for analysing melodies that can be meaningfully categorized into pitches (often in 12 categories) and tuned to approximately the same scale. The main feature of the chroma function is to capture the overtones and melodic quality of the track while withstanding changes in timbre and instruments. The chromatogram is the energy within the 12 semitones (or chroma) of the western music octave, specifically C, C #, D, D #, E, F, F #, G, G #, A, A #, It is a visual representation of and B. Shows a 12-pitch energy distribution.
K Means Clustering
Vector quantification by clustering K-paths from a popular signal for cluster analysis in data mining. Clustering observations on the major moto k clusters of partitioning by k-means, where each observation occurs on average with the cluster, is a model of the cluster. 1-nearest neighbor classification can be applied to cluster centers derived from k-means to classify new data in existing clusters, each observation d-dimensional real vector, k-means of clustering n (set n) set S = {S1, S2…, Sk}. Divide into sk so that the squares in the cluster (WCSS) can be reduced (the sum of the remote functions in the center of cluster K). In other words, its purpose is to find:
Where, the mean of points in Si is given by μi.
Classification using Clustering Algorithm
K-means is easy and can be used for a wide variety of data types; It is sensitive to the initial conditions of cluster centers. The main cluster centroids may not be optimal because the algorithm can convert to local optimal solutions. Empty cluster can be obtained if the points are not assigned to the cluster during the assignment points. It is important to have a good startup cluster center for K-Mean to work properly. A new cluster center initialization algorithm is proposed for the algorithm to introduce cluster centers for K-means. So, the incremental k-means algorithm is as follows. Input: Number of initial groups (M) and target number of groups (K) where M> K.