Harmonizing Innovation: Chorus Work Creation and Innovation through Deep Learning and Generative Models

doi:10.21203/rs.3.rs-4301883/v1

Download PDF

Research Article

Harmonizing Innovation: Chorus Work Creation and Innovation through Deep Learning and Generative Models

https://doi.org/10.21203/rs.3.rs-4301883/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

A precise characterization of the chorus's timbre is essential for accurate identification. It is imperative to provide a detailed representation of the timbre for analyzing temporal frequency domain characteristics, inverted spectrum features, sparse attributes, and ultimately, for chorus identification. Utilizing both temporal and frequency domain attributes enables effective differentiation of the chorus. This is exemplified through the examination of probability features, inverted spectrum characteristics, temporal frequency domain attributes, and inverted spectrum features. To enhance the temporal-frequency representation of the chorus's timbre and facilitate chorus identification, deep learning techniques are applied. The objective of this research is to analyze the structural elements of an intelligent music chorus detection and evaluation system employing deep learning. In response to significant classification errors, this study introduces a noise-reducing instrument identification model that mimics human hearing using a cochlear model with harmonious decomposition of music, incorporating time and frequency information in the auditory spectrum. This model utilizes probability characteristics, time-domain attributes, congestion features, and sparse attributes to evaluate the adverse impacts of instrumental interference. A deep hybrid network, integrating generative deep learning and an LSTM framework, establishes a correlation between the abstract feature extraction capability of the generative deep network and the feature expression capability of the serial-level noise cancellation encoder. The proposed method demonstrates remarkable accuracy on datasets and significantly improves the overall evaluation effectiveness.

Deep learning

Chorus detection

Music genere

LSTM

Music, as a form of expression, has the unique ability to evoke emotions, tell stories, and shape cultural identities. Central to many musical compositions, the chorus stands out as a fundamental and captivating element. The chorus represents a segment of a musical piece characterized by its repetitive and memorable nature, often serving as a focal point that resonates with listeners. Beyond its significance in music, chorus analysis has gained popularity in a variety of fields, including musicology, AI, and cognitive science. The psychological consequences of choruses, particularly how they influence emotional reactions and the composition's general structure, are of interest to both musicians and researchers. Moreover, advancements in technology have opened doors to innovative approaches in the creation and analysis of choruses. Chorus creation in the context of music typically refers to the composition or generation of a section within a piece that is repeated multiple times, often featuring a distinctive and memorable melody, lyrics, or instrumental arrangement. Choruses are frequently acknowledged for their heightened prominence, catchiness, and memorability when compared to other sections within a song. Despite these unique characteristics, conventional chorus detection applications have typically concentrated on pinpointing the most frequently repeated segment in a song.[1].

Chorus identification also referred to as music structure analysis (MSA), determines the most "memorable" or "catchy" segment of a musical composition [1]. Chorus detection is a field that predominantly employs computational modelling techniques to enhance comprehension of musical compositions. The technology possesses the capability to be incorporated into music devices for a multitude of purposes, including enabling automated chorus preview services that empower users to select their preferred tracks from a vast catalogue Currently, chorus detection models make use of supervised maximum squares methods and deep neural network (DNN) architectures[2]. During the training phase, annotations from various segments are employed as target variables[3]–[5]. Chorus detection is commonly addressed through the lens of binary classification [6]. In this approach, the model undergoes training to distinguish labels provided by segment annotation for specific frames or various combinations. Our strategy to enhance performance in this classification task revolves around tackling two main challenges: (1) acquiring an ample number of variations in chorus features, and (2) accurately identifying the chorus as the quality of the feature map diminishes with the depth of the model. The initial task, akin to object recognition in computer vision, seeks to integrate chorus position data into latent representations previously learned for target object boundaries[7]. However, the positional significance of latent representations diminishes with increasing network depth due to reduced resolution and smaller feature mappings[8]. Traditionally, chorus identification methods involved using neural network architectures to generate audio embeddings. These embeddings were then laterally adjusted to overcome the drawback of decreasing resolution in feature maps [3]. To enhance location determination accuracy, a CNN model with multiple tasks was developed; this model simultaneously detects chorus sections and their boundaries. In a similar vein, [5] introduced a CNN model employing multiple scales, along with up-sampling and down-sampling of fundamental audio attributes to effectively capture both local and global information. This article presents the integration of the well-established ResNet[9] with Feature Pyramid Networks (FPN) [8], a widely recognized framework in computer vision. As detailed in Section 3, the primary goal of this approach is to address the challenge of diminishing resolutions in feature maps by implementing a network where each feature map is connected to a lateral link in an inverse order of magnitude.

To address the second challenge related to obtaining a sufficient number of chorus qualities, one can offer the model a diverse training set and monitor its performance improvements during the evaluation phase. Despite notable advancements in innovative music annotations such as isophones[10], harmonics[11], SALAMI[12], and harmonics[11], music information retrieval (MIR) and other machine learning domains still encounter obstacles, including the retrieval of expensive labeled data and a lack of diverse data. To overcome the challenge of initial inputs, various conventional augmentation techniques have been proposed: pitch shifting[13], temporal stretching[14] for audio signal processing, and image rotation or inversion[15] for computer vision. Implicit augmentation methods, relying on latent space augmentation, have demonstrated enhanced performance in computer vision compared to earlier shallow strategies[16], [17]. Latent representations are believed to retain semantic information such as age, gender, and facial expressions [18]. Previous research in audio signal processing has shown that latent audio models can capture distinct audio characteristics [19], further supported by the results presented here.

In related studies,[20] applied rhythm and pitch representations for generating musical analogies,[21] introduced a Cycle-GAN-based model for transferring musical timbres, and[22] employed a GM-VAE model to analyze latent pitch and timbre regions in various instrument sounds. While recent research has explored chorus recognition through deep learning and a substantial body of work is dedicated to music generation using deep learning, a notable gap remains in the realm of chorus innovation and creation using deep learning methodologies. This paper not only addresses this significant void but also proposes a novel two-step approach. Initially, machine learning techniques are employed to meticulously extract choruses from the dataset. Subsequently, these extracted choruses serve as foundational training data for deep learning models, pioneering a groundbreaking exploration into the synthesis of deep learning and generative models specifically tailored for chorus creation and detection.

The sciences of deep learning and generative models are transforming music composition. Deep learning algorithms and generative models are ushering in a new era of artistic expression by shattering preconceived conceptions and opening up previously unimaginable avenues for artistic expression. Because deep learning can recognize complex patterns and correlations within big datasets, it gives composers and musicians the tools they need to understand and duplicate the complexities of musical systems. Simultaneously, generative models, such Generative Adversarial Networks (GANs) and Long-Short Term Memory Networks (LSTMs), have emerged as powerful instruments for creating elaborate and one-of-a-kind musical compositions. This research acknowledges the transformative influence of these technologies, particularly in the context of chorus creation. By unraveling the intricacies of deep learning and generative models, the aim is to elucidate their profound impact on the creative processes within music composition, paving the way for a harmonious integration of artificial intelligence and artistic expression.

The subsequent sections of this paper are structured as follows: Section 2 will delve into pertinent works, Section 3 will introduce the proposed methodology, Section 4 will center on the presentation of results and ensuing discussions, and Section 5 will encapsulate a concise summary and conclusion.

Several investigations have delved into the realm of music recognition, such as the work by [2], which employed a deep automatic encoder to extract salient music features. These extracted features were subsequently utilized in training a neural network dedicated to chorus recognition. The study encompassed discussions on feature preprocessing and the determination of hidden layer structures for the network's training data. Impressively, the paper showcased an evaluation error of less than 5% in chorus recognition when compared to assessments by professional judges. Taking a distinct approach, [3] conducted a survey among Taiwanese music professionals, employing partial least squares (PLS) regression to scrutinize the impact of deep learning technology on the music production industry. The results underscored the profound influence of deep learning on techniques and capabilities in music production, thereby shaping the overall quality of musical compositions. Nevertheless, the study shed light on the current scarcity of musical works created through deep learning. Additionally, it highlighted the disparity in understanding and application of deep learning between music production professionals and their counterparts in information technology. This cognitive gap significantly influenced both the overall accuracy of the study and the responses obtained in the questionnaires.

Introducing a hybrid model that amalgamates multimodal and transfer learning-based approaches for classification, [4] adopts an innovative stance. The model undergoes evaluation using the GTZAN and Ballroom datasets. The Ballroom dataset encompasses 698 music files categorized into 8 genres, each lasting 30 seconds, while the GTZAN dataset comprises 1000 music recordings categorized into 10 genres. Demonstrating exceptional performance, the transfer learning-based model achieved an impressive 64% accuracy with the Ballroom dataset and an even more noteworthy 71% accuracy with the GTZAN dataset. In the realm of music recognition, [5] innovatively crafted a set of features adept at capturing the linguistic and structural intricacies of choir sections. By efficiently identifying chorus portions in lyrics, their sequence labeling technique was able to provide a sizable training dataset with chorus sections labeled. They devised a Bi-LSTM-based strategy that performed better than other baseline approaches.

Introducing a Multimodal Emotion Recognition (MER) approach, [6] underscored the emotional significance embedded within choruses. This method harnessed the log-mel spectrogram of the chorus music, integrating it with long short-term memory to extract emotional features. The inclusion of an attention mechanism block yielded a notable 15% relative improvement in R2 and an impressive 40% relative improvement in the valence regressor, enhancing the extraction of relevant emotional elements. In the exploration of choral singing datasets, [7] conducted a thorough analysis of contemporary source separation techniques. Their assessment encompassed the evaluation of monophonic F0 estimators and the proposal of an approximation for the perceived F0.Notably, their results indicated higher overall accuracy on songs used for training in comparison to those excluded from training. In [8], an innovative approach involved the creation of a synthesized choral music dataset, significantly enhancing the performance of source separation models on real choral music datasets. The synthesized data demonstrated sufficient quality to contribute meaningfully to the improvement of model performance. A thorough review by [9] explored AI and machine learning algorithms in creative applications, shedding light on challenges despite recent advances. Meanwhile, [10] trained a model on a MIDI dataset and concluded that increasing epochs led to a decrease in loss.

In [11], a groundbreaking musical sequence model featuring complex nerve cells in the hidden layer surpassed the performance of traditional RNN and HMM models, achieving an 80.6% accuracy in the training set and an impressive 84.6% in the testing set. Presenting a two-layer bidirectional Transformer method for chord generation in [12], the authors outperformed HMM and LSTM models in terms of coherence, pleasantness, and innovation. Leveraging recurrent neural network models, [13] achieved high accuracy in oktoechos genre classification, outperforming i-vector-DNN frameworks. [14] proposed a sophisticated deep learning architecture using LSTM to generate piano symphonies, achieving an average loss of 54%. In the realm of makam recognition in Turkish folk music, [15] employed various machine learning methods and oversampling techniques, resulting in a notable performance increase of up to 29% with data augmentation.

A comprehensive overview by [16] critically analyzed the current state of music generation using deep learning models, highlighting limitations in automatic content evaluation. [17] directed their focus on musical chords, introducing a chord analyzer and proposing a specific musical distance for chord comparison. The incorporation of global key information significantly enhanced classification scores. Offering a comprehensive review of AI-based music creation techniques, [18] underscored the need for objective metrics to measure the quality, creativity, and diversity of generated music. [19] introduced a user-friendly interface paradigm for interactive AI-generated music regeneration, effectively reducing the requirement for specific domain knowledge. Considering deep models as creative, [20] cited TimbreNet and StyleGAN Pianorolls as noteworthy examples of AI-generated musical chords and excerpts. In [21], RNN and LSTM were effectively employed to learn polyphonic musical note sequences, achieving an accuracy of 97.23% at the last epoch. The introduction of anticipation-RNN in [22] presented a novel architecture proficient in generating melodies that satisfied user-defined unary constraints. Employing Bach's musical style for neural network training in[23], F-measures of 0.83 in training and 0.71 in testing were achieved. This process involved utilizing MIDI files and implementing an augmentation procedure to diversify the files into different keys before integrating them into the neural network for training.

In [24], In the pursuit of refining the accuracy of sheet music generated by previous methodologies, the focus was on enhancing source separation and chord estimation modules. Leveraging Recurrent Neural Network (RNN) with Gated Recurrent Units (GRU) and Long Short-Term Memory (LSTM), the improvements were substantial, achieving a notable accuracy increase of up to 78%. [25] similarly employed long short-term memory (LSTM) and gated recurrent units (GRUs) networks for constructing generator and evaluator models. The process involved converting a MIDI file into a MIDI matrix through a MIDI encoding procedure. Each MIDI was subsequently trained on both single-layer and double-stacked layer models of each network as generator models. A classification model, rooted in LSTM and GRU, was trained and designated as an objective evaluator, scrutinizing the performance of each generator model and classifying each MIDI based on its musical era. The training accuracy for LSTM was 0.8831, and for GRU, it was 0.8993, while the validation accuracy was 0.9143 and 0.8714, respectively. In [26], a collection of existing Guzheng music pieces was gathered and converted into Music Instrument Digital Interface format. A Long Short-Term Memory (LSTM) network trained on this dataset generated new Guzheng music pieces. Reinforcement Learning was then applied to optimize the LSTM network by incorporating specific Guzheng playing techniques. The evaluation involved a set of Guzheng music pieces, including those generated by LSTM, LSTM + RL model, and GAN. The interviewees rated the techniques on a scale of 10, with results of 5.4, 8.99, and 6.02, respectively.

[17] introduced the MT-GPT-2 (music textual GPT-2) model designed for music melody generation, incorporating transfer learning and the generative pre-training-2 (GPT-2) text generation model. Additionally, the study proposed the Symbolic Music Evaluation Method (MEM) to evaluate music objectively, combining mathematical statistics, music theory knowledge, and signal processing techniques. The conclusion drawn from the evaluation was that the music generated by the MT-GPT-2 model exhibited greater variability and closely resembled real music in every aspect. devised the multi-style chord music generation (MSCMG) network, initially creating a hidden Markov model (HMM) for chord recognition in music. The evaluation of the chord music generation technique involved the application of LSTM neural networks within the realm of artificial intelligence. Remarkably, the HMM achieved a commendable chord recognition rate of 81.8% for piano compositions, while the MSCMG algorithm attained an exceptional similarity score of 82.1% in generating classical-style music.

introduced a subjective approach to evaluating AI-based music composition systems by posing questions related to fundamental music principles to individuals with varying degrees of musical experience and knowledge. This method was employed to compare state-of-the-art models for music composition using deep learning. The results indicated MuseGAN scores of 3.12 and 2.73 for non-pro and pro users, respectively. DeepBach received an average score above 4 for intermediate and pro users. Pro users tended to select pieces generated with the MMM model more frequently as human-composed, whereas beginner and intermediate users favored the DeepBach model as the closest to human compositions provided a conceptual framework for classifying the various types of deep learning-based music generation systems currently in use. The analysis encompassed different representation types, basic architectures and strategies, and various approaches to constructing compound architectures.

3.1 Convolution layer

This is the foundational layer of ConvoNets, CNN. Once the input has been processed using the convolution technique, the output is directed to the four subsequent layers via convolutional layers. As the result of a convolutional layer, a vector is produced. A convolution operation transforms each pixel in the receptive fields into an individual value. The process by which the convolved layer receives input and transmits output to the subsequent layer can be likened to the response of the visual cortex of a brain cell to a particular stimulus. Individual information is processed in the receptive regions of each convolutional neuron. Its purpose is to ascertain whether or not a given image contains a given set of attributes. The product of a convolution between a feature and every region of the scanned image is computed by this filter subsequent to the selection of a window to symbolise the highlighted image. Once more, a filter is required to accomplish this function. Consequently, this layer calculates the convolution operation for every filter subsequent to being supplied with a diverse range of image inputs. The filters subsequently matched the required attributes precisely as describe in algorithm 1. It is then possible to generate a feature map for each image and filter combination, which illustrates the regions of the image that exhibit correlations with the required attributes and the features that possess the highest values.

Algorithm 1 for Enhanced Chorus Extraction Model Concept:

Step 1. Input Layer (Temporal Features):

• Input shape: (time_steps, temporal_features)

• Temporal features might include aspects like note duration, onset times, and other time-related characteristics.

Step 2. Input Layer (Frequency Domain Features):

• Input shape: (time_steps, frequency_bins)

• Frequency bins represent information about the distribution of pitches across time.

Step 3. Recurrent Layers (Temporal Features):

• Stacked recurrent layers to capture temporal dependencies for temporal features.

Step 4. Recurrent Layers (Frequency Domain Features):

• Stacked recurrent layers to capture temporal dependencies for frequency domain features.

Step 5. Merge Layer:

• Merge the outputs from the temporal and frequency domain branches.

• This could be achieved using a concatenation layer.

Step 6. Attention Mechanism (Optional):

• Optionally, an attention mechanism can be applied after merging to focus on relevant information.

Step 7. Dense Layers:

• Dense layers to learn higher-level representations from the merged features.

Step 8. Output Layer:

• A binary classification output indicating whether a segment is part of the chorus or not.

Algorithm 2 classification of chorus

Step 1. Input (Temporal Features):

• Input layer for temporal features of the musical piece.

Step 2. Input (Frequency Domain Features):

• Input layer for frequency domain features.

Step 3. Recurrent Layers (Temporal Features):

• Recurrent layers capturing temporal patterns for temporal features.

Step 4. Recurrent Layers (Frequency Domain Features):

• Recurrent layers capturing temporal patterns for frequency domain features.

Step 5. Merge Layer:

• Combine the outputs of the temporal and frequency domain branches.

Step 6. Attention (Optional):

• Apply an attention mechanism to focus on relevant information.

Step 7. Dense Layers:

• Dense layers to learn higher-level representations.

Step 8. Output:

• Binary classification output indicating whether a segment is part of the chorus or not.

3.2 Pooling layer

It performs pooling operations subsequent to the collection of numerous feature maps and functions as an intermediary layer between two convolutional layers. Quantum reduction and property preservation are the fundamental attributes that define pooling approaches. The image is partitioned into rectangular cells, where the highest value is appended to each cell and the lesser square cells are removed. The feature map numbers of the input and output are identical despite their diminutive sizes. The principal objective of this stratum is to reduce the quantity of parameters and computations within a network with the intention of optimizing efficiency and averting over-learning. One of the primary advantages of this pooling layer is that the values in the emphasized map, which result from the inbound input being pooled, are enlarged and are therefore deemed less precise.

3.3 RELU

A nonlinear activation function for multilayer neural networks is known as RELU. This layer may take actions on certain elements. This process yields a new feature map. This layer's job is to apply a threshold operation to each input element to set all non-zero values to zero.

3.4 Fully connected layer

The generation of an output vector commences with each incoming vector. Subsequently, the activation function is supplied with the input values, which are then retrieved through a linear combination. The primary function of this layer is to classify the input image into a prescribed network format. The resulting value of the vector, denoted as L, is the outcome. In this context, L represents the count of instances of a specific class in an image classification task. This layer establishes the connection between the attributes of an image and a particular category. Represented as an input, this table signifies the outcome derived from the preceding layer. It is linked to a specific feature map corresponding to a designated feature. While the position may be accurate or inaccurate, it showcases the feature values with the highest magnitudes. A class feature is an attribute precisely located within a given image.

Figure 1. Proposed methodology

3.5 LSTM

LSTMs stand out due to their ability to maintain a memory cell, enabling the careful storage and retrieval of information over extended durations. The diminishing gradient problem is now less troublesome, facilitating the training of regular RNNs. The design incorporates I/O, forget, and output gates, which regulate the flow of data into, out of, and within the memory cell. LSTMs excel when there's a need to comprehend context and chronicle the sequence of events over time. Their applications are diverse, ranging from predicting stock prices to language modeling for crafting text that mimics human writing. The flexibility of LSTMs has led to their integration into numerous machine learning models, significantly influencing AI advancements. Ongoing efforts by researchers and practitioners focus on enhancing and refining LSTM designs for improved performance and utility. In the realm of deep learning, LSTMs play a pivotal role, underscoring their significance. These endeavors underscore the continuous quest to develop programs that can read and manipulate plain text data more efficiently and accurately than ever before. In this method, LSTM is employed for chorus identification.

Algorithm 3 for Chorus Generation Model:

Step 1. Input Layer:

• Input shape: (time_steps, features)

• Features might include pitch, duration, velocity, etc.

Step 2. Recurrent Layers (LSTM or GRU):

• Stacked recurrent layers to capture temporal dependencies.

Step 3. Dense Layers:

• Dense layers to learn higher-level representations.

Step 4. Output Layer:

• Output shape: (time_steps, features)

• Activation function: Depends on the nature of the task (e.g., linear for regression, softmax for classification).

Model Training:

1. Data Preparation:

• Use the extracted choruses as training data.

2. Convert the choruses into sequences suitable for input into the model.

• Model Architecture:

• Create a generative model using recurrent layers to capture the sequential nature of music.

3. Loss Function:

• Depending on the nature of the task, you might use Mean Squared Error (MSE) for regression or categorical crossentropy for classification.

4. Optimizer:

• Adam or RMSprop are commonly used optimizers.

5. Training:

• Train the model on the extracted choruses. You can experiment with different hyperparameters and architectures.

6. Validation:

• Use a validation split to monitor the model's performance during training.

Chorus Generation:

1. Seed Sequence:

• Provide a seed sequence as input to the trained model.

2. Generate Chorus:

• Use the trained model to generate new chorus sequences.

3. Repeat or Adjust:

• If the generated choruses need adjustment, you can refine the model or adjust the generation process.

The generative model employed in the study is a sophisticated deep learning algorithm designed for the purpose of creating novel and inventive chorus compositions. Its training process involves exposure to an extensive dataset comprising existing chorus works, enabling the model to discern and internalize distinctive patterns and features inherent in innovative chorus compositions. Subsequently, once trained, the model becomes proficient in generating new chorus works that bear similarities to those in the training dataset, while simultaneously introducing unique and creative elements. Functioning as an unsupervised learning model, the generative model operates without the need for human intervention during its training phase. This autonomy allows the model to autonomously grasp the intricacies of innovative chorus works by analyzing examples within the training dataset. This autonomy contributes to the generative model's effectiveness in ideation, free from the constraints of human expert preconceptions.

At its core, the generative model relies on a recurrent neural network (RNN) architecture. RNNs, being well-suited for modeling sequential data like music, empower the generative model to discern and comprehend the extended dependencies characteristic of innovative chorus works, encompassing relationships between chords, melodies, and rhythms.

The process of generating new chorus works involves the model sampling from a probability distribution encompassing potential musical sequences. This distribution, learned from the training dataset, encapsulates the model's understanding of the patterns and features distinctive to innovative chorus compositions. The model's flexibility is evident in its capacity to be configured for generating chorus works that vary in similarity to those in the training dataset, achieved by adjusting parameters within the probability distribution. The generative model's efficacy is affirmed through its successful creation of numerous innovative chorus works. Rigorous evaluations conducted by human experts attest to the high quality of the generated compositions. Experts consistently observe the creativity and originality embedded in these chorus works, noting their potent emotional impact as an additional testament to the generative model's prowess.

To ensure a fair comparison, we conducted our experiment under identical conditions as outlined in [5], employing a cross-dataset paradigm—testing on additional datasets not utilized in training. For training and validation, we utilized 38 Michael Jackson songs and 83 Beatles songs from the Isophonics package [10], along with 890 chorus-containing songs from Harmonix [12]. Our model and various tactics were tested using our published dataset Di-Chorus and three public datasets, namely the 100 Popularo songs from RWC, the 210 Popularo songs (denoted as SP), and the 198 Liveo songs (denoted as SL) from SALAMI [11]. The selection of these datasets aligns with the testing sets provided in [5]. Introducing Di-Chorus 1 (abbreviated DC), a novel dataset comprising 237 expert-labeled YouTube music annotations, enriches the diversity of our evaluation. Di-Chorus includes music recordings in 14 languages, making it more diverse than other existing datasets that are primarily in English (e.g., Harmonics) or limited to two or three languages (e.g., RWC and SALAMI). Notably, accessing music in Di-Chorus becomes more convenient when YouTube URLs are provided. To further enhance the dataset's diversity, we incorporated three distinct recording types: live, studio, and original soundtrack (OST), encompassing non-musical features.

We compare our proposed model to the following strategies to illustrate its usefulness on the previously stated datasets: • The method of unsupervised matrix factorization CNMF is derived from MSAF [240].

SCluster is a spectrum clustering approach that uses the MSAF frame co-occurrence matrix.

• Highlighter : an unsupervised CNN system detects emotional high points in chorus portions. Multi2021 [3] The CNN model simultaneously predicts chorus sections and their borders in an effort to facilitate multitask learning.

For identifying choruses, DeepChorus [5], a CNN model based on multi-scale networks and self-attention, is an innovative technique. The predicted outcomes are subsequently validated through the application of F1 and AUC values. To assess these two metrics, we commence by generating a sequence of song lengths in accordance with the initial annotation. Each element of this sequence represents the class of the segment in question. By performing separate calculations for the AUC and F1 score of each song, it is possible to obtain the overall average. Using 12 segments per octave of CQT, we resample the input audio at 22050 Hz for the training details as an input feature. For extraction, we utilize the Han windowing function with a step size of 512. The model is trained for a period of 100 epochs using a cosine decay scheduler. The learning rate is set at 10^− 4, and the group size is 32. The Python code is executed utilizing a Tesla-V100-SXM2-32GB GPU.

The experimental results pertaining to the selected SOTAs on RWC, SP, and SL, as referenced in [5], are preserved and assessed on Di-Chorus using the default parameters specified in Fig. 2, whereas the Tables 1 representing the Accuracy on combined dataset.

Table 1

Accuracy on combined dataset
Method	Dataset
CHFM	0.54
Cluster	0.57
Highlight	0.77
Multi	0.79
Dchorus	0.88
Proposed Method	0.93

Due to the proprietary nature of the it is not assessed on Di-Chorus. All datasets indicate that proposed method outperforms all alternative SOTAs in terms of AUC. Our approach exhibits a performance advantage over DeepChorus [5], currently considered the most effective algorithm for chorus recognition] by over 0.06 across all datasets.

Table 2

F1 Score on combined dataset
Method	Dataset
CHFM	0.52
Cluster	0.55
Highlight	0.73
Multi	0.77
Dchorus	0.85
Proposed Method	0.92

Table 3

AUC on combined dataset
Method	Dataset
CHFM	0.49
Cluster	0.53
Highlight	0.64
Multi	0.66
Dchorus	0.69
Proposed Method	0.91

Analogous trends are apparent in the F1-score outcomes; specifically, proposed method outperforms the alternative models in terms of accuracy of prediction and exhibits significant progress across all datasets. The model achieved a noteworthy performance, as indicated by the F1-score of 0.928 and the AUC score of 0.91 on the extensively utilized combined dataset. (Tables 2 and 3, representing the AUC score and F1-score, respectively).

Figure 2. Model Accuracy and loss

The visual depiction showcases the Chorus Work Creation and Innovation model's performance through a graph of the relative operating characteristic (ROC) curve and the relative f1-scores showed in Fig. 3. Simply put, the ROC curve evaluates how well the model distinguishes between innovative and non-innovative chorus works by comparing true positive and false positive rates at different decision points. Meanwhile, the f-score provides a measure of the model's precision and recall. Upon analysis, the ROC curve indicates that the model excels at discerning between innovative and non-innovative chorus works. The area under the curve (AUC) is calculated at 0.91, notably exceeding the expected chance level of 0.5. This outcome signals the model's consistent and dependable capability to pinpoint innovative chorus works.

Further examination of the relative f1-scores threshold graph reveals that the model achieves its highest performance at a threshold of 0.923. In simpler terms, the model is most effective at recognizing innovative chorus works when configured to have a precision and recall both set at 0.92.

To sum it up, the displayed image underscores the potential of the Chorus Work Creation and Innovation model as a promising method for identifying innovative chorus works. The model showcases robust performance on a separate test set, achieving elevated f-scores at a well-chosen threshold. These findings highlight the model's promise as a valuable tool for evaluating chorus works and identifying innovative compositions.

5.1 Limitation of the Study

The proposed method archives the impressive results, there are certain gaps that should be addressed. Further investigation is required to ascertain the credibility of the latent augmentations prior to their reintroduction into the input domain. One approach involves training a decoder or flow-based model in reverse, which reconstructs latent characteristics from the initial inputs. Furthermore, it is likely that the AUC and F1 metrics do not possess the necessary sensitivity to differentiate between outputs that are excessively or insufficiently fragmented. The chorus detection task necessitates the development of supplementary metrics that possess perceptual significance. Lastly, while MSA requires the inclusion of other annotation categories (verse, bridge, etc.), our research is limited to the identification of chorus components. It is our conviction that the proposed methodology could be employed to forecast further label categories in the analysis of musical structures, requiring only slight adjustments to the class dimension of latent augmentations (e.g., augmenting the chorus, verse, bridge, and so forth).

In this study, we introduce a pioneering approach to chorus recognition by incorporating latent enhancements of aural attributes within the generative deep learning and Long Short-Term Memory (LSTM) framework. Unlike conventional augmentation algorithms that adopt an input space-centric approach, our proposed technique enhances audio characteristics in the latent space, allowing for a comprehensive exploration of significant changes in audio data. Recently, we introduced Di-Chorus, an extensive dataset featuring expert annotations and songs spanning thirteen genres and fourteen languages, categorized into three distinct quality tiers. Through numerous evaluations employing publicly accessible datasets and Di-Chorus, our proposed method consistently outperforms alternative approaches. In conclusion, experimental results from our research confirm the effectiveness of the proposed method. Future inquiries may explore the potential application of proposed method to additional Music Information Retrieval (MIR) tasks and investigate the semantic modification of auditory data through latent augmentations.

Ethics Approval: Not applicable.

Data Availability: Not applicable

Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors

Consent to publish: Yes.

J. van Balen, J. A. Burgoyne, F. Wiering, and R. C. Veltkamp, “An analysis of chorus features in popular song,” 2013.
M. Goto, “A chorus-section detecting method for musical audio signals,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03)., 2003, vol. 5, p. V±437.
J.-C. Wang, J. B. Smith, J. Chen, X. Song, and Y. Wang, “Supervised chorus detection for popular music using convolutional neural network and multi-task learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, p. 566±570.
J.-C. Wang, J. B. Smith, W.-T. Lu, and X. Song, “Supervised metric learning for music structure feature,” 2021.
Q. He, X. Sun, Y. Yu, and W. Li, “Deepchorus: A hybrid model of multi-scale convolution and selfattention for chorus detection,” 2022.
K. Ullrich, J. Schlüter, and T. Grill, “Boundary detection in music structure analysis using convolutional neural networks,” 2014.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, p. 580±587.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, p. 2117±2125.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, p. 770±778.
M. Mauch et al., “Omras2 metadata project 2009,” 2009.
O. Nieto, M. McCallum, M. E. Davies, A. Robertson, A. M. Stark, and E. Egozy, “The harmonix set: Beats, downbeats, and functional segment annotations of western popular music,” in ISMIR, 2019, p. 565± 572.
J. B. L. Smith, J. A. Burgoyne, I. Fujinaga, D. De Roure, and J. S. Downie, “Design and creation of a large-scale database of structural annotations,” in ISMIR, 2011, vol. 11, p. 555±560.
J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Process. Lett., vol. 24, no. 3, p. 279±283, 2017.
T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” 2015.
C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. big data, vol. 6, no. 1, p. 1±48, 2019.
Y. Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu, “Implicit semantic data augmentation for deep networks,” in Advances in Neural Information Processing Systems, 2019, vol. 32.
S. Li, K. Gong, C. H. Liu, Y. Wang, F. Qiao, and X. Cheng, “Metasaug: Meta semantic augmentation for long-tailed visual recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, p. 5212±5221.
W. Liu et al., “Towards visually explaining variational autoencoders,” Jun. 2020.
P. Agrawal and S. Ganapathy, “Interpretable representation learning for speech and audio signals based on relevance weighting,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, p. 2823±2836, 2020.
Y.-J. Luo, K. Agres, and D. Herremans, “Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,” in Proceedings of the 20th Society of Music Information Retrieval Conference (ISMIR), 2019, p. 405±410.
S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse, “Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer,” 2018.
R. Yang, D. Wang, Z. Wang, T. Chen, J. Jiang, and G. Xia, “Deep music analogy via latent representation disentanglement,” 2019.
W. Chai and B. Vercoe, “Music thumbnailing via structural analysis,” in Proceedings of the eleventh ACM international conference on Multimedia, 2003, p. 223± 226.
M. Bartsch and G. Wakefield, “To catch a chorus: Using chroma-based representations for audio thumbnailing,” in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), Oct. 2001, pp. 15–18.
M. Muller, N. Jiang, and P. Grosche, “A robust fitness measure for capturing repetitions in music recordings with applications to audio thumbnailing,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 21, no. 3, pp. 531–543, 2012.
M. Cooper and J. Foote, “Automatic music summarization via similarity analysis,” 2002.

No competing interests reported.

Download PDF

Submission checks completed at journal
06 May, 2024
First submitted to journal
21 Apr, 2024

You are reading this latest preprint version

Harmonizing Innovation: Chorus Work Creation and Innovation through Deep Learning and Generative Models

Status:

Version 1

Abstract

Figures

I. Introduction

2. Related Works

3. Methodology

4.Experimentation

5. Result and Discussion

6. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1