Speaker verification based on 3D variational self-coding multi-tasking adversarial network

doi:10.21203/rs.3.rs-2012270/v1

Download PDF

Research Article

Speaker verification based on 3D variational self-coding multi-tasking adversarial network

https://doi.org/10.21203/rs.3.rs-2012270/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

In this paper, we leverage the generative adversarial mechanism and multi-task optimization strategy to propose an architecture to enhance the accuracy of speaker verification. The proposed model can recognize both speaker's original features and their reconstructed features on the basis of the variational auto- encoder (VAE) and generative adversarial network (GAN). We term our method Multitasking Variational Autoencoder Generative Adversarial Networks (MTVAEGAN). The encoder in VAE aims at extracting target speech features and classifying to a specific target. The purpose of the generator in GAN is to generate perturbations that could make the target network misclassified as a specific target, while simultaneously fooling the discriminators by treating the adversarial examples as a beguine example. The discriminator exists to distinguish crafted examples from genuine samples. The classification is to classify the reconstructed data and drive the generator to produce more precise speech features. Experiments on short utterances demonstrates that MTVAEGAN increases the verification accuracy (ACC) by 11.52% (relatively) and 2.78% (relatively) over conventional 3DCNN method and MTGAN method respectively.

Speaker verification

VAE

GAN

3DCNN

MTGAN

Automatic Speaker Verification (ASV) is the process of identifying the speaker's unknown utterances in the registered voice database. Recently, inspired by the boom of computer calculating power, robust feature extraction capabilities of deep neural networks were demonstrated and evolved many network structure, which is rooted essentially at DNN [1], RNN/LSTM [2], and CNN [3][4]. The CNN [5] excels at identifying patterns in the input and can scale much better than DNN.

Generative adversarial networks (GAN) are the youngest rising star in the field of deep learning in recent years. The application of generative adversarial networks in speech processing is comparatively sparsely used, mainly focused on the following areas, generation [6], synthesis [7] and enhancement [8] of speech in existing studies. Recent works [9] choose to propose model of multi-task learning combined with triplet loss for voice based speaker verification, meanwhile, most of them modify the DCGAN architecture proposed by [10] and feed melfbank to the model.

Variational Auto Encoder (VAE) is also an attractive generation model that is widely used in the image field [11]. It allows the unconstrained distribution to be projected onto a simple Gaussian distribution. Compared to GAN, it can match the differences between the reconstructed image and the original image, but tends to produce blurred images more. Boesen Lindbo Larsen [12] combined Generative adversarial networks and variational auto encoder to produce more regular and clear pictures. Akbari [13] further utilized VAE-GAN to successfully generate piano music, which proves that it is possible to generate sequential data using this network structure.

In the recent years, the deep generative models are highly prized by researchers because of their ability to simulate arbitrary functions or simply generate realistic data distributions. Most of them modify the DCGAN architecture proposed by Radford [10] or use variant VAE to realize speaker verification separately. The MTVAEGAN model is proposed in this study for speaker verification, which combines the advantages of CNN and VAE-GAN nets. Considering the efficiency of processing, we will use 3D-MFCC data as input.

2.1 Dataset

To explore the features of the model through the two opposing corpus, librispeech [14] and voxceleb [15] data were employed first. Among them, librispeech data is a corpus of pure English speech which is ideal for introductory usage and can be used to test the model's ability to extract input features, while voxceleb data contains approximately 100,000 speech data from 1251 celebrities and is largely gender-balanced. Different accents, occupations and ages, these celebrities are a very challenging wild voice that can be used to test the model's noise immunity performance.

Furthermore, we use the Timit [17] data to verify the performance of the models trained on each of these two types of data.

2.2 Data representation

We extracted MFCC from raw audio slice, the length of each slice is 3s and use 40 mel-filters. The temporal features are overlapping 25ms windows with the stride of 10ms, which are used for the generation of spectrum features, thus the dimension of the MFCC is 299×40. Similar to the 3D-CNN model [16] input data structure, we extract 20 times from 299×40 feature at a fixed position witheach time extracting 80×40 feature to form a 3D feature of 20×80×40. For shorter speech signals, we often use their own signal replications to extend the method of splicing.

In the enrollment phase, 30% of the test data is extracted for registration, and the average value of the speaker model is calculated. The remaining 70% of the test data is used for evaluation. We take advantage of the cosine distance for measuring the variance between embeddings which are generated by the encoder.

2.3 Network Architecture

The architecture of our network is shown in Fig. 1. It consists of four modules and all of them have already been marked as different colors.

Encoder: In this model, it is used to extract the feature information from the real samples. The structure of the network is consistent with 3D-CNN, but at the end, there are three fully connected layers. Two of them output the mean and variance respectively, which are used to generate latent code z, while the last one generates speaker embedding. We want to use the original features to confirm the speakers.
Decoder/Generator: Through this model, we just recover the latent code z extracted by encoder and force the reconstructed features closer to the input samples by minimizing the following hybrid loss function: pixel-like reconstruction loss(\({\mathcal{ℒ}}_{\text{l}\text{i}\text{k}\text{e}}^{\text{p}\text{i}\text{x}\text{e}\text{l}}\)) and Kullback-Leibler loss (\({\mathcal{ℒ}}_{\text{p}\text{r}\text{i}\text{o}\text{r}}\)).
Discriminator: This model has two kinds of inputs, the real feature extracted from speech and the fake feature from a generator. It can determine the authenticity of the input data, and motivates the generator to produce more realistic feature.
Classifier: The main architecture used in this work is still 3D-CNNs, but it is not the same parameter as Encoder. This model is only used to extract features from samples, which created by the generator. We did a comparison experiment using the less challenging Librispeech data by removing this classifier from the model. We find that the presence of this classifier network increases the correct rate by 39.33% for the same epoch of training.

2.4 Experimental results

We first chose voxceleb1 data for the MTVAEGAN and 3DCNN model (1211 people were trained, 40 people starting with E were registered and evaluated). Second, we feed the train-clean-360 of Librispeech to the model (921people were trained, 40 people from test-clean part were registered and evaluated). Before the experiments, all audio samples were subjected to voice activity detection (VAD) to eliminate the silent part of speech.

After these Comparison experiments, we found that the MTVAEGAN model surpasses the 3DCNN model in a variety of metrics. The results are shown in Table 1.

Table 1

The comparison experiments between the MTVAEGAN model and the 3DCNN model by using clean utterances and wild utterances, demonstrate the effectiveness and reliability of the proposed method.
Dataset	Modle	ACC	EER	AUC	F1
Librispeech(epoch9)	MTVAEGAN	97.2%	1.81%	99.79%	98.58%
Librispeech(epoch9)	3DCNN(Baseline)	96.49%	2.35%	99.64%	98.20%
VoxCeleb1(eopoch36)	MTVAEGAN	85.83%	6.82%	98.49%	92.78%
VoxCeleb1(eopoch36)	3DCNN(Baseline)	76.96%	8.43%	97.18%	88.18%

We also used the trained model to evaluate the Timit data by randomly dividing 168 people into four groups. As shown from Table 2, accuracy was improved 2.58% by Librispeech (trained by 9 epoch), 4.85% by VoxCeleb1 (trained by 36 epoch) compared to MTGAN [28].

Table 2

Extended training between different datasets under different models. The proposed model which trained with a clean or wild utterances always achieves better accuracy.
Train	Evaluate	Modle	ACC
Librispeech(epoch9)	Timit	MTVAEGAN	95.23%
Librispeech(epoch100)	Timit	MTGAN(Baseline)	92.65%
VoxCeleb1(eopoch36)	Timit	MTVAEGAN	97.50%

2.5 Deep Experiments

The model is equipped with two speaker verification parts. A part exists in encoder net, which is dedicated to process real feature, and the other one is independent classifier, which dedicated to process generator feature. We did experiments by using Librispeech data and Voxceleb1 data to find each part performance such as accuracy, equal error rate and f1 score. The results are shown in Figs. 2 ~ 4.

It can be summarized that if the training is on clean Librispeech data, the Classifier Net can quickly improve the model performance by recognizing the reconstructed data. If the training is with noisy data, the Classifier Net will drive the model performance up in the early stage, but with the limited ability to reconstruct the data and unable to provide more precision features, the Encoder Net starts to access slowly to improve the model performance.

Last we did ablation experiments (in Table 3) to explore the essentials existence of the classifier net by removing it. This experiment showed that the presence of the classifier net can facilitate the model convergence and performance.

Table 3

Ablation experiments at the same trained epoch, prove that this part of the classifier net can rapidly contribute to the efficiency of the entire model.
Train	Evaluate	Modle	ACC	EER
Librispeech(epoch9)	Librispeech	MTVAEGAN	95.23%	2.49%
Librispeech(epoch9)	Librispeech	MTVAEGAN (No classifier)	57.87%	17.66%

Within this work, we propose a text-independent speaker verification system on short utterances, which is named MTVAEGAN. We applied two classifier nets to identify the original feature and the reconstructed feature respectively, trying to discover the different features of the input feature from different angles, and then compare the two results for the best selection.

Experiments prove that MTVAEGAN can obtain better test results in the shortest training time. Through more ablation experiments, we can be sure that using the 3DCNN network in combination with the generative adversarial network is better than using it alone.

4.1. Ethical Approval

4.1.1. Ethics approval and consent to participate

In this paper, we use public voice datasets and do not address ethical issues

4.1.2. Consent for publication

All authors agree the consent for publication when the paper is accepted.

4.2. Competing interests

The authors declare that they have no conflict of interest.

4.3. Authors' contributions

Hong-heng Liao: Writing original draft, Methodology, Software. Ya-juan Xue: Conceptualization. All authors reviewed the results and approved the final version of the manuscript.

4.3.1. Corresponding author

Correspondence to Ya-juan Xue.

4.4. Funding

This work is supported in part by the Central Government Funds of Guiding Local Scientific and Technological Development for Sichuan Province under Grant 2021ZYD0030.

4.5. Availability of data and materials

Librispeech: http://www.openslr.org/12/

Voxceleb: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

Timit:https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3/tech&hit=1&filelist=1

5. Additional information

5.1. Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

5.2. Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

S. Dey, T. Koshinaka, P. Motlicek and S. Madikeri, "DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5344–5348, doi: 10.1109/ICASSP.2018.8461389.
J. He, P. Zhang and L. Zhu, "LSTM Based End-to-End Text-Independent Speaker Verification Using Raw Waveform," 2020 International Conference on Culture-oriented Science & Technology (ICCST), 2020, pp. 500–503, doi: 10.1109/ICCST50977.2020.00103.
R. Jagiasi, S. Ghosalkar, P. Kulal and A. Bharambe, "CNN based speaker recognition in language and text-independent small scale system," 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2019, pp. 176–179, doi: 10.1109/I-SMAC47947.2019.9032667.
A. Torfi, J. Dawson and N. M. Nasrabadi, "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks," 2018 IEEE International Conference on Multimedia and Expo (ICME), 2018, pp. 1–6, doi: 10.1109/ICME.2018.8486441.
R. Jagiasi, S. Ghosalkar, P. Kulal and A. Bharambe, "CNN based speaker recognition in language and text-independent small scale system," 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2019, pp. 176–179, doi: 10.1109/I-SMAC47947.2019.9032667.
Kumar, K., et al. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arXiv:1910.06711, 2019.
Y. Saito, S. Takamichi and H. Saruwatari, "Text-to-Speech Synthesis Using STFT Spectra Based on Low-/Multi-Resolution Generative Adversarial Networks," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2018,pp. 5299–5303, doi: 10.1109/ICASSP.2018.8461714.
Pascual, S., et al. SEGAN: Speech Enhancement Generative Adversarial Network. arXiv:1703.09452, 2017.
Ding, W. and L. He MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks. arXiv:1803.09059, 2018.
A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” in International Conference on Learning Representation (ICLR), San Juan, Puerto Rico, 2016.
L. Qian, J. Chen, T. Urakov, W. Gu and L. Liang, "CQ-VAE: Coordinate Quantized VAE for Uncertainty Estimation with Application to Disk Shape Analysis from Lumbar Spine MRI Images," 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 580–585, doi: 10.1109/ICMLA51294.2020.00097.
Boesen Lindbo Larsen, A., et al. Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300, 2015.
M. Akbari and J. Liang, "Semi-Recurrent Cnn-Based Vae-Gan for Sequential Data Generation," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2321–2325, doi: 10.1109/ICASSP.2018.8461724.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 2015.
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
A. Torfi, J. Dawson and N. M. Nasrabadi, "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks," 2018 IEEE International Conference on Multimedia and Expo (ICME), 2018, pp. 1–6, doi: 10.1109/ICME.2018.8486441.
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S.Pallett, N. L. Dahlgren, and V. Zue, “Timit acoustic-phonetic continuous speech corpus,” Linguistic data consortium, vol. 10, no. 5,1993.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
13 Aug, 2024
Reviewers invited by journal
25 Sep, 2022
Editor assigned by journal
20 Sep, 2022
Submission checks completed at journal
09 Sep, 2022
First submitted to journal
30 Aug, 2022

You are reading this latest preprint version

Speaker verification based on 3D variational self-coding multi-tasking adversarial network

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methodology

3. Conclusion

Declarations

4.1. Ethical Approval

4.1.1. Ethics approval and consent to participate

4.1.2. Consent for publication

4.2. Competing interests

4.3. Authors' contributions

4.3.1. Corresponding author

4.4. Funding

4.5. Availability of data and materials

5. Additional information

5.1. Publisher's Note

5.2. Rights and permissions

References

Additional Declarations

Status:

Version 1