weidixie / vgg-speaker-recognition Goto Github PK

View Code? Open in Web Editor NEW

362.0 362.0 98.0 28.34 MB

Utterance-level Aggregation For Speaker Recognition In The Wild

Python 100.00%

vgg-speaker-recognition's People

Contributors

Stargazers

Watchers

Forkers

trantorrepository entn-at taylorlu wuqiangch janyoti ruohoruotsi mmxuan18 yingmuying ieee820 ddxk murray7 ming0818 aascode ajitaru jerrypeng21cuhk jamiroquai88 kelvinson xuwei1111 jaming8023 wildwolf1994411 sadam1195 matln vanova ychnlgy keunwoochoi gatsbychen wgsh3706 ajilim lingpei555 zhangxueyangjuxie dzungtx ky941122 haganf twistedmove artemiszgl lcf2764 auzxb panky8070 shiwanglei hacknet1997 shaoboh k-staple tomgco neurudan shahnawazgrewal yangzhixue1 hmen97 edualc kumarkarun bellazan etshang shobhit-agarwal srijan17 lubumobi lampts lbxcfx onucharles chihuataneo zcth428 w4-jonghoon nayanhalder 553566286 thangdc94 soliloquy1983 askintution joe2hpimn tianlongkong elliotthwang andrew-brown1 kbitc liveroomand jade1998 bml1g12 byron123t flavio58it silvadirceu uian-proton widdiot jjkindergarten linglinduan dongsig gdy1201 ishine amoufidi gangliumsft ildar5 huynguyen54 zomun go2chayan bameroncaird thecodingenthusiast willfun techthiyanes gkuo06 ahmedembeddedxx

vgg-speaker-recognition's Issues

Got core dumped when training with GPU option

Hi Weidi,
Everything is fine if i train a new model using CPU device. But when i chose GPU for training the training script is crashed at the first iteration with message being: "segmentation fault (core dumped)".
The following is the log that i got:
Epoch 1/128
Learning rate for epoch 1 is 0.0001.
Segmentation fault (core dumped)
Do you have any comment or suggestion for this?

no tensorflow convolution layers

tensorflow convolution layers weren't used, but keras module, why?

ValueError: Layer #125 (named "gvlad_center_assignment"), weight <tf.Variable 'gvlad_center_assignment/kernel:0' shape=(7, 1, 512, 12) dtype=float32_ref> has shape (7, 1, 512, 12), but the saved weight has shape (10, 512, 7, 1).

Thaks for your greate sharing ! ! !
The pre-training model can deal with identity coding effectively.
But when I change to my data to fine-tune this pre-training model, I got a error.
My python command line is :
python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 4 --loss softmax --data_path ''
The Error is:
Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 84, in main network.load_weights(os.path.join(args.resume), by_name=True) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/network.py", line 1163, in load_weights reshape=reshape) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/saving.py", line 1149, in load_weights_from_hdf5_group_by_name str(weight_values[i].shape) + '.') ValueError: Layer #125 (named "gvlad_center_assignment"), weight <tf.Variable 'gvlad_center_assignment/kernel:0' shape=(7, 1, 512, 12) dtype=float32_ref> has shape (7, 1, 512, 12), but the saved weight has shape (10, 512, 7, 1).
I've tried reshape, but doesn't work.
I have no idea for it. So how can I fix it?

There are some problems with voxlb2_train.txt and voxlb2_val.txt

There are some identical audio between voxlb2_train.txt and voxlb2_val.txt

The initialization method

Hi,

I saw that you used 'orthogonal' initialization for the whole network. Are there some special reason behind that?

Max pool layer strides param is different to paper

https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/backbone.py#L221

y = MaxPooling2D((3, 1), strides=(2, 1), name='mpool2')(x5)

How much of the acc when you train the model?

I use the default training code to train the model. In the training process ,the acc can get 90.09%.
I also the the default testing code to test the model. The eer is 0.0357370095445.
What about your result in training process?

relied on only tensorflow, not keras

could you only use TF to cnn model forward, not Keras in your code ?

questions about downloading pre-trained model

Hi, i have download  your pre-trained weights to a model , then i trained the model using Voxceleb2, but its loss is 9 ,acc is 0.001 ,It’s as if it’s the same as no download.  it shoud be low loss and acc is about 0.92.do you know why 
Looking forward to your reply

AttributeError: module 'dask.dataframe' has no attribute 'Series' ?

Traceback (most recent call last):
File "src/main.py", line 201, in
main()
File "src/main.py", line 99, in main
update_freq=args.batch_size * 16)
File "/data/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 745, in init
from tensorflow.contrib.tensorboard.plugins import projector
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/init.py", line 37, in
from tensorflow.contrib import distributions
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/init.py", line 39, in
from tensorflow.contrib.distributions.python.ops.estimator import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/python/ops/estimator.py", line 21, in
from tensorflow.contrib.learn.python.learn.estimators.head import _compute_weighted_loss
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/init.py", line 95, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/init.py", line 28, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/init.py", line 30, in
from tensorflow.contrib.learn.python.learn import estimators
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/init.py", line 302, in
from tensorflow.contrib.learn.python.learn.estimators.dnn import DNNClassifier
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 35, in
from tensorflow.contrib.learn.python.learn.estimators import dnn_linear_combined
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 36, in
from tensorflow.contrib.learn.python.learn.estimators import estimator
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 52, in
from tensorflow.contrib.learn.python.learn.learn_io import data_feeder
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/init.py", line 26, in
from tensorflow.contrib.learn.python.learn.learn_io.dask_io import extract_dask_data
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py", line 34, in
allowed_classes = (dd.Series, dd.DataFrame)
AttributeError: module 'dask.dataframe' has no attribute 'Series'

The training set contains the validation set

The dataset split file meta/voxlb2_train.txt contains audios in meta/voxlb2_val.txt.
The number of training examples is decreased from 1,198,728 to 985,290, when examples in the validation set are removed.

I guess people using this repository are suffering from overfitting because of the split error.
Please remove the duplicated examples and re-upload the two split files!

The code below is the one that I used to remove the duplicates using Pandas:

import pandas as pd

df_valid = pd.read_csv(f'meta/voxlb2_valid.txt', sep=' ', names=['path', 'label'])
df_train = pd.read_csv(f'meta/voxlb2_train.txt', sep=' ', names=['path', 'label'])
df_train = df_train[~df_train.path.isin(df_valid.path)]

There is something wrong with my model

@WeidiXie First, I transform the m4a files to wav files, and then change nothing of the code .
Learning rate for epoch 1 is 0.001.
9365/9365 [==============================] - 6966s 744ms/step - loss: 8.4926 - acc: 3.9876e-04
Epoch 2/64
Learning rate for epoch 2 is 0.001.
9365/9365 [==============================] - 6759s 722ms/step - loss: 8.4370 - acc: 4.4548e-04
Epoch 3/64
Learning rate for epoch 3 is 0.001.
9365/9365 [==============================] - 6732s 719ms/step - loss: 8.4284 - acc: 4.2629e-04
Epoch 4/64
Learning rate for epoch 4 is 0.001.
9365/9365 [==============================] - 6789s 725ms/step - loss: 8.4265 - acc: 4.6633e-04
Epoch 5/64
Learning rate for epoch 5 is 0.001.
9365/9365 [==============================] - 6795s 726ms/step - loss: 8.4266 - acc: 4.2295e-04
Epoch 6/64
Learning rate for epoch 6 is 0.001.
9365/9365 [==============================] - 6742s 720ms/step - loss: 8.4259 - acc: 4.2629e-04
Epoch 7/64
Learning rate for epoch 7 is 0.001.
9365/9365 [==============================] - 6726s 718ms/step - loss: 8.4257 - acc: 4.4381e-04
Epoch 8/64
Learning rate for epoch 8 is 0.001.
9365/9365 [==============================] - 6729s 718ms/step - loss: 8.4256 - acc: 4.3630e-04
Epoch 9/64
Learning rate for epoch 9 is 0.001.
9365/9365 [==============================] - 6733s 719ms/step - loss: 8.4255 - acc: 4.3046e-04
Epoch 10/64
Learning rate for epoch 10 is 0.001.
9365/9365 [==============================] - 6768s 723ms/step - loss: 8.4254 - acc: 4.4548e-04
Epoch 11/64
Learning rate for epoch 11 is 0.001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4253 - acc: 4.2462e-04
Epoch 12/64
Learning rate for epoch 12 is 0.001.
9365/9365 [==============================] - 6757s 722ms/step - loss: 8.4252 - acc: 4.4631e-04
Epoch 13/64
Learning rate for epoch 13 is 0.001.
9365/9365 [==============================] - 6751s 721ms/step - loss: 8.4253 - acc: 4.2379e-04
Epoch 14/64
Learning rate for epoch 14 is 0.001.
9365/9365 [==============================] - 6754s 721ms/step - loss: 8.4253 - acc: 4.4214e-04
Epoch 15/64
Learning rate for epoch 15 is 0.001.
9365/9365 [==============================] - 6796s 726ms/step - loss: 8.4253 - acc: 4.0960e-04
Epoch 16/64
Learning rate for epoch 16 is 0.001.
9365/9365 [==============================] - 6755s 721ms/step - loss: 8.4252 - acc: 4.1628e-04
Epoch 17/64
Learning rate for epoch 17 is 0.0001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4183 - acc: 4.2796e-04
Epoch 18/64
Learning rate for epoch 18 is 0.0001.
9365/9365 [==============================] - 6741s 720ms/step - loss: 8.4151 - acc: 4.2629e-04

is there any qualitative analysis of the network learn a good embedding feature, any visualize tools for this ?

in computer vision fields, there is some tools to visualize what the network learned for the final classification, such as gradcam/cam and so on
in speaker recognition fields, how to analysis the output which activate the input, then i can say the network learn a good feature directly. what are the generality things in the input spectrograms for different context of the same speaker?

a example of grad-cam audio

Why we need to regularize outputs during eval mode?

I'm rewriting this project on PyTorch and got confused with the code below.
if mode == 'eval': y = keras.layers.Lambda(lambda x: keras.backend.l2_normalize(x, 1))(x)
Is there some special reason behind that?

Training is quite slow.

I am also facing the isse, as model trained very slowly. I run other codes and projects on same gpu and they are running fine, gpu has been used but VGG-Speaker runs slowly.
I tried it on two NVIDIA GTX-1060 installed in my computer and P100 on google-cloud as well.

I tried everything to resolve this issue but not succeed.

Epoch 1/10
Learning rate for epoch 1 is 0.0001.
17/305810 [..............................] - ETA: 3354:00:12 - loss: 0.8716 - acc: 0.9531

Please help.
Thanks.

TypeError: unsupported operand type(s) for *: 'int'与'dimension'

src/model.py

the 70th line

outputs = K.reshape(cluster_l2, [-1, self.k_centers * num_features])
TypeError: unsupported operand type(s) for *: 'int'与'dimension'

modified

outputs = K.reshape(cluster_l2, [-1, int(self.k_centers) * int(num_features)])

-----debug successfully

any slides can share about the voxsrc workshop in interspeech 2019?

i find that you are one organiser of the voxsrc, and the winners have some workshop talks on interspeech 2019, can share the slides or videos? thank you very much

error in data generation MP 2

I tried to train with voxceleb2

5272/7492 [====================>.........] - ETA: 49:13 - loss: 7.6936 - acc: 0.0133

and got message while training

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/data/ghostvlad-speaker-original/src/utils.py", line 28, in load_data
    linear_spect = lin_spectogram_from_wav(wav, hop_length, win_length, n_fft)
  File "/data/ghostvlad-speaker-original/src/utils.py", line 22, in lin_spectogram_from_wav
    linear = librosa.stft(wav, n_fft=n_fft, win_length=win_length, hop_length=hop_length) # linear spectrogram
  File "/usr/local/lib/python3.5/dist-packages/librosa/core/spectrum.py", line 165, in stft
    y = np.pad(y, int(n_fft // 2), mode=pad_mode)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraypad.py", line 1290, in pad
    " in axis {} of `array`".format(axis))
ValueError: There aren't any elements to reflect in axis 0 of `array`
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "main.py", line 196, in <module>
    main()
  File "main.py", line 136, in main
    verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_generator.py", line 181, in fit_generator
    generator_output = next(output_generator)
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 601, in get
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 595, in get
    inputs = self.queue.get(block=True).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 401, in get_index
    return _SHARED_SEQUENCES[uid][i]
  File "/data/ghostvlad-speaker-original/src/generator.py", line 42, in __getitem__
    X, y = self.__data_generation_mp(list_IDs_temp, indexes)
  File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in __data_generation_mp
    X = np.expand_dims(np.array([p.get() for p in X]), -1)
  File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in <listcomp>
    X = np.expand_dims(np.array([p.get() for p in X]), -1)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
ValueError: There aren't any elements to reflect in axis 0 of `array`
I0528 18:57:06.038911 22079 executor.cpp:675] Container exited with status 1
W0528 18:57:06.038911 22072 logging.cpp:93] RAW: Received signal SIGTERM from process 16635 of user 0; exiting

And I think the accuracy is too low. How about in your case?

will you share the pretrained model which used amsoftmax, i trained myself use amsoftmax but very hard to convergence

i trained the model use the same backbone and ghost pooling, with softmax and amsoftmax, but the amsoftmax acc is very low, any tricks for training amsoftmax?

ValueError: Range cannot be empty (low >= high) unless no samples are taken

18/3521 [..............................] - ETA: 5:33:17 - loss: 9.4843 - acc: 0.0017
Traceback (most recent call last):
File "train.py", line 270, in
main()
File "train.py", line 207, in main
verbose=1)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
ValueError: Range cannot be empty (low >= high) unless no samples are taken

I got the error while training the model on my data, have you ever met?

some question about the thin-ResNet?

the thin-Resnet describe in the paper define each stage as repeat X times，but in the code，is conv2d + identity_block_2d * times, why do like this which is different from original resnet arch.

and did you try some other arch such as se_resnet or se_renext as attention mechanism may helpful for feature extract

Training accuracy oscillates between 51-52% early while loss decreases slowly.

I have 4.5k speakers and 88k utterances (total 150k pairs of text) total with each range between 4-10 sec in Hindi-English mixed(98% Hindi).

I tried to run VGG-Speaker-Recognition on my dataset, and is giving me the following result:

Epoch 1/50
Learning rate for epoch 1 is 0.0001.
1130/1130 [==============================] - 11718s 10s/step - loss: 1.6074 - acc: 0.5193
Epoch 2/50
Learning rate for epoch 2 is 0.0001.
1130/1130 [==============================] - 11616s 10s/step - loss: 1.2483 - acc: 0.5243
Epoch 3/50
Learning rate for epoch 3 is 0.0001.
1130/1130 [==============================] - 11591s 10s/step - loss: 1.2341 - acc: 0.5260
Epoch 4/50
Learning rate for epoch 4 is 0.0001.
1130/1130 [==============================] - 11512s 10s/step - loss: 1.2289 - acc: 0.5239
Epoch 5/50
Learning rate for epoch 5 is 0.0001.
1130/1130 [==============================] - 11470s 10s/step - loss: 1.2255 - acc: 0.5281
Epoch 6/50
Learning rate for epoch 6 is 0.0001.
1130/1130 [==============================] - 11548s 10s/step - loss: 1.2246 - acc: 0.5264
Epoch 7/50
Learning rate for epoch 7 is 0.0001.
1130/1130 [==============================] - 11550s 10s/step - loss: 1.2228 - acc: 0.5278
Epoch 8/50
Learning rate for epoch 8 is 0.0001.
1130/1130 [==============================] - 11602s 10s/step - loss: 1.2223 - acc: 0.5273
Epoch 9/50
Learning rate for epoch 9 is 0.0001.
1130/1130 [==============================] - 11620s 10s/step - loss: 1.2211 - acc: 0.5292
Epoch 10/50
Learning rate for epoch 10 is 0.0001.
1130/1130 [==============================] - 11581s 10s/step - loss: 1.2206 - acc: 0.5284
Epoch 11/50
Learning rate for epoch 11 is 0.0001.
1130/1130 [==============================] - 11544s 10s/step - loss: 1.2203 - acc: 0.5272
Epoch 12/50
Learning rate for epoch 12 is 0.0001.
1130/1130 [==============================] - 11467s 10s/step - loss: 1.2196 - acc: 0.5286
Epoch 13/50
Learning rate for epoch 13 is 0.0001.
1130/1130 [==============================] - 11399s 10s/step - loss: 1.2191 - acc: 0.5294

As I can see above training accuracy oscillates between 51-52 % and it seems like model overfitted at 51-52 %.
Earlier I had 500 speakers (a subset of 4.5k speakers), I got the same result at that time as well.

What could be the reason for this result? Please help @WeidiXie.

Error in loading the pretrained weights

When loading the pretrained model, I am getting the following error:

Traceback (most recent call last):
  File "main.py", line 201, in <module>
    main()
  File "main.py", line 82, in main
    if mgpu == 1: network.load_weights(os.path.join(args.resume))
  File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/network.py", line 1166, in load_weights
    f, self.layers, reshape=reshape)
  File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/saving.py", line 1030, in load_weights_from_hdf5_group
    str(len(filtered_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 80 layers into a model with 81 layers.

using tripletLoss function

Hi Weidixie:
Thanks for your sharing, and I have a question: Have you ever tried to combine gVlad with triplet loss function or GE2E loss by Google to train the model and how about the effect?
I look forward for your reply and thank you very much again ！

during training it seem like no working to gpu?

src/main.py
It didn't work to gpu ,did it?

Why extend audio?

In the function load_wav_Predict why did you need to extend the audio file (see code below)

I cannot think of a reason why one would need to extend the time signal

def load_wav_Predict(vid_path, sr):
    wav, sr_ret = librosa.load(vid_path, sr=sr )
    assert sr_ret == sr
    extended_wav = np.append(wav, wav[::-1])
    return extended_wav

How good is the ghostvlad weights of the model?

Hi,

What is the accuracy and EER of the model (https://drive.google.com/open?id=1M_SXoW1ceKm3LghItY2ENKKUn3cWYfZm) you provide?

how to implement a transfer learning solution for classification problems?

Hi, I want to train some layers and leave the others frozen (Dont have to train the entire model)
I dont know how I can adjust it
"python main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --multiprocess 8 --loss softmax --data_path ../path_to_voxceleb2:" ??

audio feature was not tensor

scr/predict.py

specs = ut.load_data(...),
the specs was feature through stft,subtract mean, divided by time-wise var, while specs was not tensor, but array.

Training under Windows

Hello @WeidiXie ,

Thanks for this awesome work, and sharing it with the open-source community ! I am trying to adapt the code for training on VoxCeleb1 (just because it is a smaller dataset, I decided to play with it first).
I have prepared the file lists, plugged in your weights, froze the first layers until the bottleneck in the code, and tried to run main.py . However, for some reason I do get an annoying error prior to training, that is likely related to the fact you're using multiprocessing to speed up data generation:

File "D:\Repos\VGG-Speaker-Recognition\tool\toolkits.py", line 45, in set_mp pool = mp.Pool(processes=processes, initializer=init_worker) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 175, in __init__ self._repopulate_pool() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 236, in _repopulate_pool self._wrap_exception) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 255, in _repopulate_pool_static w.start() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__ reduction.dump(process_obj, to_child) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'set_mp.<locals>.init_worker'

I also tried to set the number of processes to 1, and that did not help either.
I wonder if you have any suggestions on how to alleviate this.

Thanks again,

Anton.

How to train the model on the Voxceleb1 dataset？

How to train the model on the Voxceleb1 dataset？
Thanks！

error in data generation MP

i tried to add my own dataset and build a model but i get a data shape error

4/14671 [..............................] - ETA: 29:30:39 - loss: 11.9053 - acc: 0.0000e+00Traceback (most recent call last):

File "src/main.py", line 195, in
main()
File "src/main.py", line 135, in main
verbose=1)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 43, in getitem
X, y = self.__data_generation_mp(list_IDs_temp, indexes)
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 59, in __data_generation_mp
X = np.expand_dims(np.array([p.get() for p in X]), -1)
ValueError: could not broadcast input array from shape (257,250) into shape (257)

isit issue with any audio file size duration?

Duration of audio file for the trained model?

Hello Weidi

Thank you for the excellent paper and making your work opensource.

I have a couple of questions:

What is the duration of audio files you used for the pre-trained model? In the paper you mention that audio files have duration of 2.5 seconds during training phase. But you also mention that ‘in the wild’ sequences having longer utterances (4 seconds or more) is a significant improvement over shorter segments. I would image that for best results, I have to use the same size audio chunks as used for training therefore want to know the duration used for trained model.
Did you try Mel STFTs instead of linear STFTs? Typically, Mel STFTs are known to provide far better results than linear STFTs.

Many thanks
.

two question about the vladpooling implement compare to the paper of netvlad?

this is the netvlad author's presentation, as show with yellow circle the feature map x is the same for two branch. but in your code, has some different:
1: frome feature map, x --> x_fc, and x --> x_k_center, then this two pass to vladpooling which do softmax and normalization, as compare to netvlad, look like x --> fc unnecessary
2: before compute softmax, why need to sub max first, this is seem not very common?
3: in netvlad there first do intra-normalization then l2-normalization (as one paper refer this improve the acc) but here only one l2-normaliztion

what's benefit will get from these differents?

i train use netvlad and vladpooling on my dataset with same optim params and use grad-cam to see the featmaps activation different, same times vladpooling will activate the background noise, but netvlad not:
orignal code:

orignal code add self_attetion after featmap and before vladpooling:

self_attetion after featmap and before netvlad:

RuntimeWarning: divide by zero encountered in true_divide

Environment:

Ubuntu 18.04
Python 2.7/3.6
TensorFlow 1.13
keras 2.2

Before test, I truncated the voxceleb1_veri_test.txt to only 50 lines to speed up test.

Command:

python predict.py --gpu 0 --net resnet34s --ghost_cluster 2 --vlad_cluster 8 --loss softmax --resume ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5

output:

Instructions for updating:
Colocations handled automatically by placer.
==> successfully loading model ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5.
==> start testing.
Finish extracting features for 0/100th wav.
2019-05-29 17:26:34.142088: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Finish extracting features for 50/100th wav.
scores : 0.848808407784, gt : 1
scores : 0.635771036148, gt : 0
scores : 0.896652877331, gt : 1
scores : 0.666865587234, gt : 0
scores : 0.86113858223, gt : 1
scores : 0.649969100952, gt : 0
scores : 0.882004976273, gt : 1
scores : 0.675234436989, gt : 0
scores : 0.814641714096, gt : 1
scores : 0.612952053547, gt : 0
scores : 0.841726779938, gt : 1
scores : 0.691002070904, gt : 0
scores : 0.875537037849, gt : 1
scores : 0.760980308056, gt : 0
scores : 0.862766265869, gt : 1
scores : 0.595528423786, gt : 0
scores : 0.872184753418, gt : 1
scores : 0.580520808697, gt : 0
scores : 0.866317629814, gt : 1
scores : 0.861166357994, gt : 1
scores : 0.735198259354, gt : 0
scores : 0.846519947052, gt : 1
scores : 0.634202837944, gt : 0
scores : 0.879867553711, gt : 1
scores : 0.617964744568, gt : 0
scores : 0.866540849209, gt : 1
scores : 0.502503097057, gt : 0
scores : 0.884967088699, gt : 1
scores : 0.568573653698, gt : 0
scores : 0.926931381226, gt : 1
scores : 0.637345910072, gt : 0
scores : 0.834380090237, gt : 1
scores : 0.620291650295, gt : 0
scores : 0.912857890129, gt : 1
scores : 0.626294493675, gt : 0
scores : 0.952058196068, gt : 1
scores : 0.640718281269, gt : 0
scores : 0.933943748474, gt : 1
scores : 0.51838862896, gt : 0
scores : 0.861519873142, gt : 1
scores : 0.771008253098, gt : 0
scores : 0.881197452545, gt : 1
scores : 0.641325950623, gt : 0
scores : 0.885362446308, gt : 1
scores : 0.748977065086, gt : 0
scores : 0.839608311653, gt : 1
scores : 0.611160635948, gt : 0
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:610: RuntimeWarning: divide by zero encountered in true_divide
slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:613: RuntimeWarning: invalid value encountered in multiply
y_new = slope*(x_new - x_lo)[:, None] + y_lo
==> model : ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5, EER: 0.0

how to increase gpu usage?

i use 6 gpu to train the model with batch size 200, but each gpu-util is only between 0-22%, and cost 2 hours per epoch, but memory usage is high. The time is not increased compared with 2 gpu, while 2 gpu-util is about 80%.

and then i change the multi-thread to 128, the usage can be increased to 30-50%, and cost 1 hour per epoch, is there any way to increased the gpu usage anymore?

Output of the array : "feats"

What is the output of this array (feat), (it is 512D vector), can I use this as extracted feature vector?

Answer about hardware

I've not found information about hardware. Can you tell me what gpu's model you used? In my implementation (3.6 mln parameters approximately as in paper) 35 element ~ 11 gb in memory (1080ti). Maybe I made a mistake.

Question on train loss

Hi, sorry for bothering you again.

What was the final train loss of the model? Actually, I'm worried about our model being overfitted to VoxCeleb2 data, so I'm not sure whether I should use early-stopping here, or just wait until the convergence.

why the wav preprocess not directly use librosa.feature.melspectrogram? what's the difference?

problem with testing my own trained model

hi.
i found your code really complete and i tried to use is but:
when i use your model called in "readme" to predict it works but when i try to use my own data with "python main.py .... " and then i want to use it for prediction , i found an error about not same shape , ....
would you please help me about that?

toolkits.initialize_GPU(args) error

I'm getting error

Using TensorFlow backend.
Traceback (most recent call last):
  File "src/main.py", line 195, in <module>
    main()
  File "src/main.py", line 41, in main
    toolkits.initialize_GPU(args)
AttributeError: 'module' object has no attribute 'initialize_GPU'

It makes sense, since it has only import toolkits

Run command:
python src/main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --loss softmax --data_path ../../data/voxceleb1

My libraries:

tensorflow          1.8.0
toolkits            0.1.28```

Question on last ReLU layer for evaluating 512-dim vector

Hi, thanks again for open-sourcing this Speaker Recognition system and kindly replying to every issue.

I have a question about the model shown here. When evaluating final 512-dimension output(embedding vector), ReLU activation is applied at last, as shown in https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/model.py#L139-L147. Hence, the output looks like:

[[1.15972664e-02 5.04462933e-03 2.85871420e-02 0.00000000e+00
  3.08723319e-02 0.00000000e+00 3.42872031e-02 4.36003655e-02
  0.00000000e+00 1.12573527e-01 5.46368458e-31 5.64192347e-02
  2.56476291e-02 0.00000000e+00 2.51553692e-02 4.77801599e-02
  0.00000000e+00 3.06680351e-02 2.24540825e-03 0.00000000e+00
  1.33734914e-02 2.91635211e-31 2.31502447e-02 5.39273359e-02
  9.22401696e-02 0.00000000e+00 3.31045166e-02 5.57319149e-02
  1.24792336e-02 4.04326282e-02 6.75894767e-02 0.00000000e+00
  6.08060285e-02 4.47864346e-02 2.85473187e-02 0.00000000e+00
... (truncated)

Here, we can observe that some values are 0.

In my opinion, the last ReLU layer is eliminating some information by erasing all negative values. Moreover, it limits the area of hypersphere where embeddings can exist, by a factor of 1/2^512. So, my question is: was the last ReLU layer necessary?

I strongly believe that it was necessary(since it's currently SotA on Speaker Recognition in the wild!), but I couldn't guess the necessity of the last ReLU layer. I would like to kindly ask you about that. Thanks in advance.

TypeError: 'int' object is not callable

Hi, Weidi .
Thank you for your prompt reply !!!
And sorry to interrupt again.
As you said yesterday, I can successfully load the pre-training model.
But I encountered a new mistake as the title shows.
This time I used the following command line:

python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --ghost_cluster 2 --vlad_cluster 8 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 1 --loss softmax --data_path ''

I know that the model loaded successfully through the information displayed in the terminal.
But after print Learning rate for epoch 1 is 0.0001., the new error occured.

**Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, self._kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 559, in _run
sequence = list(range(len(self.sequence)))
TypeError: 'int' object is not c

I think it's a multi-process or multi-threading problem.
So I tried to set --multiprocess 0, or comment out areas of code that involve multiple processes.
But nothing changed.
I found one issues, he set --workers 0, so I changed fit_generator function's workers=0 in main.py.
A "new" error occured:

**Traceback (most recent call last):
File "main.py", line 223, in
main()
File "main.py", line 162, in main
verbose=0)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_utils.py", line 590, in iter_sequence_infinite
for item in seq:
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 372, in iter
for item in (self[i] for i in range(len(self))):
TypeError: 'int' object is not callable

The same error TypeError: 'int' object is not callable occured, someone said that your custom variable name and the default function or class name duplicate will cause this problem.
I can't solve the problem, So I would like to ask if you have come across this problem, or have any ideas to solve it.

Sorry again for the interruption, and thanks in advance !!!

trained slowly

tool/toolkits.py
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu
modified
os.environ['TF_CPP_MIN_LOG_LEVEL'] = args.gpu

Epoch 1/128
Learning rate for epoch 1 is 0.001.
1/7492 [..............................] - ETA: 294:55:57 - loss: 9.3743 - acc: 0.0000e+00

but trained slowly， gpu wasn't work.

Please specify license for this code/model

Thanks for open sourcing the code and pretrained weights!

I'd like to be able to use/modify this awesome code in our company. However, I can't find any license specified here.
In specific manner, we need to know whether these code and model could be used in commercial use or not.

Again, thank you for open sourcing the code of your paper.

Some doubts about pre-trained model weights

Your model is trained under multiple gpu, is the single gpu weight saved so that we can predict on a machine with a single gpu

What's the meaning of the variable self.annealing

Hi, thank you for your work. What I am confusing is the meaning of self.annealing. I didn't see the definition of it. It this a Keras (which I am not familiar with) variable?

VGG-Speaker-Recognition/src/new_layers.py

Line 87 in 8024093

anneal_cluster_score = (cluster_score - max_cluster_score)/self.annealing

Different between the average and VLAD pooling

Hi,

This paper and the idea is pretty interesting! May I ask two questions about the details please?

I found the idea of LDE (learnable dictionary encoding, cited in the paper as [Cai et.al.]) is very similar with the NetVLAD (if not the same). I'm wondering what your opinion is about the different between LDE and NetVLAD used in this paper?
After going through the code, I found the forward propagation of the VLAD and average pooling seems different.
For average pooling, the output of resnet_2D_v1/v2 is directly used, which makes the shape to be
[batch, 7, 16, D] -> [batch, 84, D] (after pooling, no additional layer)

For VLAD, the output is processing by an additional Conv2D layer, making the shape:
[batch, 7, 16, D] -> [batch, 1, 16, D] (feat, use Conv2D) / [batch, 1, 16, n_clusters] (cluster_score) -> [batch, D * n_clusters] (after VLAD)

The additional layer may lead to better performance. Maybe this is part of the reasons why the TAP performs poorly in the paper?

Last, the performance comparison in the paper is really useful. Good work :-)

Section "Probing verification based on length" -- about verification pairs 25,020

hi, weidi, I'm puzzled by one problem. There're 1251 speakers in voxceleb1 dataset , if for each speaker, 100 positive pairs and 100 negative pairs are sampled, it sounds more likely reasonable that 250,200 verification pairs in total?