Several investigations have delved into the realm of music recognition, such as the work by [2], which employed a deep automatic encoder to extract salient music features. These extracted features were subsequently utilized in training a neural network dedicated to chorus recognition. The study encompassed discussions on feature preprocessing and the determination of hidden layer structures for the network's training data. Impressively, the paper showcased an evaluation error of less than 5% in chorus recognition when compared to assessments by professional judges. Taking a distinct approach, [3] conducted a survey among Taiwanese music professionals, employing partial least squares (PLS) regression to scrutinize the impact of deep learning technology on the music production industry. The results underscored the profound influence of deep learning on techniques and capabilities in music production, thereby shaping the overall quality of musical compositions. Nevertheless, the study shed light on the current scarcity of musical works created through deep learning. Additionally, it highlighted the disparity in understanding and application of deep learning between music production professionals and their counterparts in information technology. This cognitive gap significantly influenced both the overall accuracy of the study and the responses obtained in the questionnaires.
Introducing a hybrid model that amalgamates multimodal and transfer learning-based approaches for classification, [4] adopts an innovative stance. The model undergoes evaluation using the GTZAN and Ballroom datasets. The Ballroom dataset encompasses 698 music files categorized into 8 genres, each lasting 30 seconds, while the GTZAN dataset comprises 1000 music recordings categorized into 10 genres. Demonstrating exceptional performance, the transfer learning-based model achieved an impressive 64% accuracy with the Ballroom dataset and an even more noteworthy 71% accuracy with the GTZAN dataset. In the realm of music recognition, [5] innovatively crafted a set of features adept at capturing the linguistic and structural intricacies of choir sections. By efficiently identifying chorus portions in lyrics, their sequence labeling technique was able to provide a sizable training dataset with chorus sections labeled. They devised a Bi-LSTM-based strategy that performed better than other baseline approaches.
Introducing a Multimodal Emotion Recognition (MER) approach, [6] underscored the emotional significance embedded within choruses. This method harnessed the log-mel spectrogram of the chorus music, integrating it with long short-term memory to extract emotional features. The inclusion of an attention mechanism block yielded a notable 15% relative improvement in R2 and an impressive 40% relative improvement in the valence regressor, enhancing the extraction of relevant emotional elements. In the exploration of choral singing datasets, [7] conducted a thorough analysis of contemporary source separation techniques. Their assessment encompassed the evaluation of monophonic F0 estimators and the proposal of an approximation for the perceived F0.Notably, their results indicated higher overall accuracy on songs used for training in comparison to those excluded from training. In [8], an innovative approach involved the creation of a synthesized choral music dataset, significantly enhancing the performance of source separation models on real choral music datasets. The synthesized data demonstrated sufficient quality to contribute meaningfully to the improvement of model performance. A thorough review by [9] explored AI and machine learning algorithms in creative applications, shedding light on challenges despite recent advances. Meanwhile, [10] trained a model on a MIDI dataset and concluded that increasing epochs led to a decrease in loss.
In [11], a groundbreaking musical sequence model featuring complex nerve cells in the hidden layer surpassed the performance of traditional RNN and HMM models, achieving an 80.6% accuracy in the training set and an impressive 84.6% in the testing set. Presenting a two-layer bidirectional Transformer method for chord generation in [12], the authors outperformed HMM and LSTM models in terms of coherence, pleasantness, and innovation. Leveraging recurrent neural network models, [13] achieved high accuracy in oktoechos genre classification, outperforming i-vector-DNN frameworks. [14] proposed a sophisticated deep learning architecture using LSTM to generate piano symphonies, achieving an average loss of 54%. In the realm of makam recognition in Turkish folk music, [15] employed various machine learning methods and oversampling techniques, resulting in a notable performance increase of up to 29% with data augmentation.
A comprehensive overview by [16] critically analyzed the current state of music generation using deep learning models, highlighting limitations in automatic content evaluation. [17] directed their focus on musical chords, introducing a chord analyzer and proposing a specific musical distance for chord comparison. The incorporation of global key information significantly enhanced classification scores. Offering a comprehensive review of AI-based music creation techniques, [18] underscored the need for objective metrics to measure the quality, creativity, and diversity of generated music. [19] introduced a user-friendly interface paradigm for interactive AI-generated music regeneration, effectively reducing the requirement for specific domain knowledge. Considering deep models as creative, [20] cited TimbreNet and StyleGAN Pianorolls as noteworthy examples of AI-generated musical chords and excerpts. In [21], RNN and LSTM were effectively employed to learn polyphonic musical note sequences, achieving an accuracy of 97.23% at the last epoch. The introduction of anticipation-RNN in [22] presented a novel architecture proficient in generating melodies that satisfied user-defined unary constraints. Employing Bach's musical style for neural network training in[23], F-measures of 0.83 in training and 0.71 in testing were achieved. This process involved utilizing MIDI files and implementing an augmentation procedure to diversify the files into different keys before integrating them into the neural network for training.
In [24], In the pursuit of refining the accuracy of sheet music generated by previous methodologies, the focus was on enhancing source separation and chord estimation modules. Leveraging Recurrent Neural Network (RNN) with Gated Recurrent Units (GRU) and Long Short-Term Memory (LSTM), the improvements were substantial, achieving a notable accuracy increase of up to 78%. [25] similarly employed long short-term memory (LSTM) and gated recurrent units (GRUs) networks for constructing generator and evaluator models. The process involved converting a MIDI file into a MIDI matrix through a MIDI encoding procedure. Each MIDI was subsequently trained on both single-layer and double-stacked layer models of each network as generator models. A classification model, rooted in LSTM and GRU, was trained and designated as an objective evaluator, scrutinizing the performance of each generator model and classifying each MIDI based on its musical era. The training accuracy for LSTM was 0.8831, and for GRU, it was 0.8993, while the validation accuracy was 0.9143 and 0.8714, respectively. In [26], a collection of existing Guzheng music pieces was gathered and converted into Music Instrument Digital Interface format. A Long Short-Term Memory (LSTM) network trained on this dataset generated new Guzheng music pieces. Reinforcement Learning was then applied to optimize the LSTM network by incorporating specific Guzheng playing techniques. The evaluation involved a set of Guzheng music pieces, including those generated by LSTM, LSTM + RL model, and GAN. The interviewees rated the techniques on a scale of 10, with results of 5.4, 8.99, and 6.02, respectively.
[17] introduced the MT-GPT-2 (music textual GPT-2) model designed for music melody generation, incorporating transfer learning and the generative pre-training-2 (GPT-2) text generation model. Additionally, the study proposed the Symbolic Music Evaluation Method (MEM) to evaluate music objectively, combining mathematical statistics, music theory knowledge, and signal processing techniques. The conclusion drawn from the evaluation was that the music generated by the MT-GPT-2 model exhibited greater variability and closely resembled real music in every aspect. devised the multi-style chord music generation (MSCMG) network, initially creating a hidden Markov model (HMM) for chord recognition in music. The evaluation of the chord music generation technique involved the application of LSTM neural networks within the realm of artificial intelligence. Remarkably, the HMM achieved a commendable chord recognition rate of 81.8% for piano compositions, while the MSCMG algorithm attained an exceptional similarity score of 82.1% in generating classical-style music.
introduced a subjective approach to evaluating AI-based music composition systems by posing questions related to fundamental music principles to individuals with varying degrees of musical experience and knowledge. This method was employed to compare state-of-the-art models for music composition using deep learning. The results indicated MuseGAN scores of 3.12 and 2.73 for non-pro and pro users, respectively. DeepBach received an average score above 4 for intermediate and pro users. Pro users tended to select pieces generated with the MMM model more frequently as human-composed, whereas beginner and intermediate users favored the DeepBach model as the closest to human compositions provided a conceptual framework for classifying the various types of deep learning-based music generation systems currently in use. The analysis encompassed different representation types, basic architectures and strategies, and various approaches to constructing compound architectures.