In the recent years, the deep generative models are highly prized by researchers because of their ability to simulate arbitrary functions or simply generate realistic data distributions. Most of them modify the DCGAN architecture proposed by Radford [10] or use variant VAE to realize speaker verification separately. The MTVAEGAN model is proposed in this study for speaker verification, which combines the advantages of CNN and VAE-GAN nets. Considering the efficiency of processing, we will use 3D-MFCC data as input.
2.1 Dataset
To explore the features of the model through the two opposing corpus, librispeech [14] and voxceleb [15] data were employed first. Among them, librispeech data is a corpus of pure English speech which is ideal for introductory usage and can be used to test the model's ability to extract input features, while voxceleb data contains approximately 100,000 speech data from 1251 celebrities and is largely gender-balanced. Different accents, occupations and ages, these celebrities are a very challenging wild voice that can be used to test the model's noise immunity performance.
Furthermore, we use the Timit [17] data to verify the performance of the models trained on each of these two types of data.
2.2 Data representation
We extracted MFCC from raw audio slice, the length of each slice is 3s and use 40 mel-filters. The temporal features are overlapping 25ms windows with the stride of 10ms, which are used for the generation of spectrum features, thus the dimension of the MFCC is 299×40. Similar to the 3D-CNN model [16] input data structure, we extract 20 times from 299×40 feature at a fixed position witheach time extracting 80×40 feature to form a 3D feature of 20×80×40. For shorter speech signals, we often use their own signal replications to extend the method of splicing.
In the enrollment phase, 30% of the test data is extracted for registration, and the average value of the speaker model is calculated. The remaining 70% of the test data is used for evaluation. We take advantage of the cosine distance for measuring the variance between embeddings which are generated by the encoder.
2.3 Network Architecture
The architecture of our network is shown in Fig. 1. It consists of four modules and all of them have already been marked as different colors.
-
Encoder: In this model, it is used to extract the feature information from the real samples. The structure of the network is consistent with 3D-CNN, but at the end, there are three fully connected layers. Two of them output the mean and variance respectively, which are used to generate latent code z, while the last one generates speaker embedding. We want to use the original features to confirm the speakers.
-
Decoder/Generator: Through this model, we just recover the latent code z extracted by encoder and force the reconstructed features closer to the input samples by minimizing the following hybrid loss function: pixel-like reconstruction loss(\({\mathcal{ℒ}}_{\text{l}\text{i}\text{k}\text{e}}^{\text{p}\text{i}\text{x}\text{e}\text{l}}\)) and Kullback-Leibler loss (\({\mathcal{ℒ}}_{\text{p}\text{r}\text{i}\text{o}\text{r}}\)).
-
Discriminator: This model has two kinds of inputs, the real feature extracted from speech and the fake feature from a generator. It can determine the authenticity of the input data, and motivates the generator to produce more realistic feature.
-
Classifier: The main architecture used in this work is still 3D-CNNs, but it is not the same parameter as Encoder. This model is only used to extract features from samples, which created by the generator. We did a comparison experiment using the less challenging Librispeech data by removing this classifier from the model. We find that the presence of this classifier network increases the correct rate by 39.33% for the same epoch of training.
2.4 Experimental results
We first chose voxceleb1 data for the MTVAEGAN and 3DCNN model (1211 people were trained, 40 people starting with E were registered and evaluated). Second, we feed the train-clean-360 of Librispeech to the model (921people were trained, 40 people from test-clean part were registered and evaluated). Before the experiments, all audio samples were subjected to voice activity detection (VAD) to eliminate the silent part of speech.
After these Comparison experiments, we found that the MTVAEGAN model surpasses the 3DCNN model in a variety of metrics. The results are shown in Table 1.
Table 1
The comparison experiments between the MTVAEGAN model and the 3DCNN model by using clean utterances and wild utterances, demonstrate the effectiveness and reliability of the proposed method.
Dataset
|
Modle
|
ACC
|
EER
|
AUC
|
F1
|
Librispeech(epoch9)
|
MTVAEGAN
|
97.2%
|
1.81%
|
99.79%
|
98.58%
|
3DCNN(Baseline)
|
96.49%
|
2.35%
|
99.64%
|
98.20%
|
VoxCeleb1(eopoch36)
|
MTVAEGAN
|
85.83%
|
6.82%
|
98.49%
|
92.78%
|
3DCNN(Baseline)
|
76.96%
|
8.43%
|
97.18%
|
88.18%
|
We also used the trained model to evaluate the Timit data by randomly dividing 168 people into four groups. As shown from Table 2, accuracy was improved 2.58% by Librispeech (trained by 9 epoch), 4.85% by VoxCeleb1 (trained by 36 epoch) compared to MTGAN [28].
Table 2
Extended training between different datasets under different models. The proposed model which trained with a clean or wild utterances always achieves better accuracy.
Train
|
Evaluate
|
Modle
|
ACC
|
Librispeech(epoch9)
|
Timit
|
MTVAEGAN
|
95.23%
|
Librispeech(epoch100)
|
Timit
|
MTGAN(Baseline)
|
92.65%
|
VoxCeleb1(eopoch36)
|
Timit
|
MTVAEGAN
|
97.50%
|
2.5 Deep Experiments
The model is equipped with two speaker verification parts. A part exists in encoder net, which is dedicated to process real feature, and the other one is independent classifier, which dedicated to process generator feature. We did experiments by using Librispeech data and Voxceleb1 data to find each part performance such as accuracy, equal error rate and f1 score. The results are shown in Figs. 2 ~ 4.
It can be summarized that if the training is on clean Librispeech data, the Classifier Net can quickly improve the model performance by recognizing the reconstructed data. If the training is with noisy data, the Classifier Net will drive the model performance up in the early stage, but with the limited ability to reconstruct the data and unable to provide more precision features, the Encoder Net starts to access slowly to improve the model performance.
Last we did ablation experiments (in Table 3) to explore the essentials existence of the classifier net by removing it. This experiment showed that the presence of the classifier net can facilitate the model convergence and performance.
Table 3
Ablation experiments at the same trained epoch, prove that this part of the classifier net can rapidly contribute to the efficiency of the entire model.
Train
|
Evaluate
|
Modle
|
ACC
|
EER
|
Librispeech(epoch9)
|
Librispeech
|
MTVAEGAN
|
95.23%
|
2.49%
|
Librispeech(epoch9)
|
Librispeech
|
MTVAEGAN
(No classifier)
|
57.87%
|
17.66%
|