weidixie / vgg-speaker-recognition Goto Github PK
View Code? Open in Web Editor NEWUtterance-level Aggregation For Speaker Recognition In The Wild
Utterance-level Aggregation For Speaker Recognition In The Wild
Hi Weidi,
Everything is fine if i train a new model using CPU device. But when i chose GPU for training the training script is crashed at the first iteration with message being: "segmentation fault (core dumped)".
The following is the log that i got:
Epoch 1/128
Learning rate for epoch 1 is 0.0001.
Segmentation fault (core dumped)
Do you have any comment or suggestion for this?
tensorflow convolution layers weren't used, but keras module, why?
Thaks for your greate sharing ! ! !
The pre-training model can deal with identity coding effectively.
But when I change to my data to fine-tune this pre-training model, I got a error.
My python command line is :
python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 4 --loss softmax --data_path ''
The Error is:
Traceback (most recent call last): File "main.py", line 212, in <module> main() File "main.py", line 84, in main network.load_weights(os.path.join(args.resume), by_name=True) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/network.py", line 1163, in load_weights reshape=reshape) File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/saving.py", line 1149, in load_weights_from_hdf5_group_by_name str(weight_values[i].shape) + '.') ValueError: Layer #125 (named "gvlad_center_assignment"), weight <tf.Variable 'gvlad_center_assignment/kernel:0' shape=(7, 1, 512, 12) dtype=float32_ref> has shape (7, 1, 512, 12), but the saved weight has shape (10, 512, 7, 1).
I've tried reshape, but doesn't work.
I have no idea for it. So how can I fix it?
Hi,
I saw that you used 'orthogonal' initialization for the whole network. Are there some special reason behind that?
https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/backbone.py#L221
y = MaxPooling2D((3, 1), strides=(2, 1), name='mpool2')(x5)
I use the default training code to train the model. In the training process ,the acc can get 90.09%.
I also the the default testing code to test the model. The eer is 0.0357370095445.
What about your result in training process?
could you only use TF to cnn model forward, not Keras in your code ?
Hi, i have download your pre-trained weights to a model , then i trained the model using Voxceleb2, but its loss is 9 ,acc is 0.001 ,It’s as if it’s the same as no download. it shoud be low loss and acc is about 0.92.do you know why
Looking forward to your reply
Traceback (most recent call last):
File "src/main.py", line 201, in
main()
File "src/main.py", line 99, in main
update_freq=args.batch_size * 16)
File "/data/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 745, in init
from tensorflow.contrib.tensorboard.plugins import projector
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/init.py", line 37, in
from tensorflow.contrib import distributions
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/init.py", line 39, in
from tensorflow.contrib.distributions.python.ops.estimator import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/distributions/python/ops/estimator.py", line 21, in
from tensorflow.contrib.learn.python.learn.estimators.head import _compute_weighted_loss
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/init.py", line 95, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/init.py", line 28, in
from tensorflow.contrib.learn.python.learn import *
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/init.py", line 30, in
from tensorflow.contrib.learn.python.learn import estimators
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/init.py", line 302, in
from tensorflow.contrib.learn.python.learn.estimators.dnn import DNNClassifier
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn.py", line 35, in
from tensorflow.contrib.learn.python.learn.estimators import dnn_linear_combined
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 36, in
from tensorflow.contrib.learn.python.learn.estimators import estimator
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 52, in
from tensorflow.contrib.learn.python.learn.learn_io import data_feeder
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/init.py", line 26, in
from tensorflow.contrib.learn.python.learn.learn_io.dask_io import extract_dask_data
File "/data/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py", line 34, in
allowed_classes = (dd.Series, dd.DataFrame)
AttributeError: module 'dask.dataframe' has no attribute 'Series'
The dataset split file meta/voxlb2_train.txt
contains audios in meta/voxlb2_val.txt
.
The number of training examples is decreased from 1,198,728
to 985,290
, when examples in the validation set are removed.
I guess people using this repository are suffering from overfitting because of the split error.
Please remove the duplicated examples and re-upload the two split files!
The code below is the one that I used to remove the duplicates using Pandas:
import pandas as pd
df_valid = pd.read_csv(f'meta/voxlb2_valid.txt', sep=' ', names=['path', 'label'])
df_train = pd.read_csv(f'meta/voxlb2_train.txt', sep=' ', names=['path', 'label'])
df_train = df_train[~df_train.path.isin(df_valid.path)]
@WeidiXie First, I transform the m4a files to wav files, and then change nothing of the code .
Learning rate for epoch 1 is 0.001.
9365/9365 [==============================] - 6966s 744ms/step - loss: 8.4926 - acc: 3.9876e-04
Epoch 2/64
Learning rate for epoch 2 is 0.001.
9365/9365 [==============================] - 6759s 722ms/step - loss: 8.4370 - acc: 4.4548e-04
Epoch 3/64
Learning rate for epoch 3 is 0.001.
9365/9365 [==============================] - 6732s 719ms/step - loss: 8.4284 - acc: 4.2629e-04
Epoch 4/64
Learning rate for epoch 4 is 0.001.
9365/9365 [==============================] - 6789s 725ms/step - loss: 8.4265 - acc: 4.6633e-04
Epoch 5/64
Learning rate for epoch 5 is 0.001.
9365/9365 [==============================] - 6795s 726ms/step - loss: 8.4266 - acc: 4.2295e-04
Epoch 6/64
Learning rate for epoch 6 is 0.001.
9365/9365 [==============================] - 6742s 720ms/step - loss: 8.4259 - acc: 4.2629e-04
Epoch 7/64
Learning rate for epoch 7 is 0.001.
9365/9365 [==============================] - 6726s 718ms/step - loss: 8.4257 - acc: 4.4381e-04
Epoch 8/64
Learning rate for epoch 8 is 0.001.
9365/9365 [==============================] - 6729s 718ms/step - loss: 8.4256 - acc: 4.3630e-04
Epoch 9/64
Learning rate for epoch 9 is 0.001.
9365/9365 [==============================] - 6733s 719ms/step - loss: 8.4255 - acc: 4.3046e-04
Epoch 10/64
Learning rate for epoch 10 is 0.001.
9365/9365 [==============================] - 6768s 723ms/step - loss: 8.4254 - acc: 4.4548e-04
Epoch 11/64
Learning rate for epoch 11 is 0.001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4253 - acc: 4.2462e-04
Epoch 12/64
Learning rate for epoch 12 is 0.001.
9365/9365 [==============================] - 6757s 722ms/step - loss: 8.4252 - acc: 4.4631e-04
Epoch 13/64
Learning rate for epoch 13 is 0.001.
9365/9365 [==============================] - 6751s 721ms/step - loss: 8.4253 - acc: 4.2379e-04
Epoch 14/64
Learning rate for epoch 14 is 0.001.
9365/9365 [==============================] - 6754s 721ms/step - loss: 8.4253 - acc: 4.4214e-04
Epoch 15/64
Learning rate for epoch 15 is 0.001.
9365/9365 [==============================] - 6796s 726ms/step - loss: 8.4253 - acc: 4.0960e-04
Epoch 16/64
Learning rate for epoch 16 is 0.001.
9365/9365 [==============================] - 6755s 721ms/step - loss: 8.4252 - acc: 4.1628e-04
Epoch 17/64
Learning rate for epoch 17 is 0.0001.
9365/9365 [==============================] - 6743s 720ms/step - loss: 8.4183 - acc: 4.2796e-04
Epoch 18/64
Learning rate for epoch 18 is 0.0001.
9365/9365 [==============================] - 6741s 720ms/step - loss: 8.4151 - acc: 4.2629e-04
in computer vision fields, there is some tools to visualize what the network learned for the final classification, such as gradcam/cam and so on
in speaker recognition fields, how to analysis the output which activate the input, then i can say the network learn a good feature directly. what are the generality things in the input spectrograms for different context of the same speaker?
I'm rewriting this project on PyTorch and got confused with the code below.
if mode == 'eval': y = keras.layers.Lambda(lambda x: keras.backend.l2_normalize(x, 1))(x)
Is there some special reason behind that?
I am also facing the isse, as model trained very slowly. I run other codes and projects on same gpu and they are running fine, gpu has been used but VGG-Speaker runs slowly.
I tried it on two NVIDIA GTX-1060 installed in my computer and P100 on google-cloud as well.
I tried everything to resolve this issue but not succeed.
Epoch 1/10
Learning rate for epoch 1 is 0.0001.
17/305810 [..............................] - ETA: 3354:00:12 - loss: 0.8716 - acc: 0.9531
Please help.
Thanks.
src/model.py
the 70th line
outputs = K.reshape(cluster_l2, [-1, self.k_centers * num_features])
TypeError: unsupported operand type(s) for *: 'int'与'dimension'
modified
outputs = K.reshape(cluster_l2, [-1, int(self.k_centers) * int(num_features)])
-----debug successfully
I tried to train with voxceleb2
5272/7492 [====================>.........] - ETA: 49:13 - loss: 7.6936 - acc: 0.0133
and got message while training
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/data/ghostvlad-speaker-original/src/utils.py", line 28, in load_data
linear_spect = lin_spectogram_from_wav(wav, hop_length, win_length, n_fft)
File "/data/ghostvlad-speaker-original/src/utils.py", line 22, in lin_spectogram_from_wav
linear = librosa.stft(wav, n_fft=n_fft, win_length=win_length, hop_length=hop_length) # linear spectrogram
File "/usr/local/lib/python3.5/dist-packages/librosa/core/spectrum.py", line 165, in stft
y = np.pad(y, int(n_fft // 2), mode=pad_mode)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraypad.py", line 1290, in pad
" in axis {} of `array`".format(axis))
ValueError: There aren't any elements to reflect in axis 0 of `array`
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 196, in <module>
main()
File "main.py", line 136, in main
verbose=1)
File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.5/dist-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "/data/ghostvlad-speaker-original/src/generator.py", line 42, in __getitem__
X, y = self.__data_generation_mp(list_IDs_temp, indexes)
File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in __data_generation_mp
X = np.expand_dims(np.array([p.get() for p in X]), -1)
File "/data/ghostvlad-speaker-original/src/generator.py", line 58, in <listcomp>
X = np.expand_dims(np.array([p.get() for p in X]), -1)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
ValueError: There aren't any elements to reflect in axis 0 of `array`
I0528 18:57:06.038911 22079 executor.cpp:675] Container exited with status 1
W0528 18:57:06.038911 22072 logging.cpp:93] RAW: Received signal SIGTERM from process 16635 of user 0; exiting
And I think the accuracy is too low. How about in your case?
i trained the model use the same backbone and ghost pooling, with softmax and amsoftmax, but the amsoftmax acc is very low, any tricks for training amsoftmax?
18/3521 [..............................] - ETA: 5:33:17 - loss: 9.4843 - acc: 0.0017
Traceback (most recent call last):
File "train.py", line 270, in
main()
File "train.py", line 207, in main
verbose=1)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/gongke/anaconda3/envs/py27/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
ValueError: Range cannot be empty (low >= high) unless no samples are taken
I got the error while training the model on my data, have you ever met?
the thin-Resnet describe in the paper define each stage as repeat X times,but in the code,is conv2d + identity_block_2d * times, why do like this which is different from original resnet arch.
and did you try some other arch such as se_resnet or se_renext as attention mechanism may helpful for feature extract
I have 4.5k speakers and 88k utterances (total 150k pairs of text) total with each range between 4-10 sec in Hindi-English mixed(98% Hindi).
I tried to run VGG-Speaker-Recognition on my dataset, and is giving me the following result:
Epoch 1/50
Learning rate for epoch 1 is 0.0001.
1130/1130 [==============================] - 11718s 10s/step - loss: 1.6074 - acc: 0.5193
Epoch 2/50
Learning rate for epoch 2 is 0.0001.
1130/1130 [==============================] - 11616s 10s/step - loss: 1.2483 - acc: 0.5243
Epoch 3/50
Learning rate for epoch 3 is 0.0001.
1130/1130 [==============================] - 11591s 10s/step - loss: 1.2341 - acc: 0.5260
Epoch 4/50
Learning rate for epoch 4 is 0.0001.
1130/1130 [==============================] - 11512s 10s/step - loss: 1.2289 - acc: 0.5239
Epoch 5/50
Learning rate for epoch 5 is 0.0001.
1130/1130 [==============================] - 11470s 10s/step - loss: 1.2255 - acc: 0.5281
Epoch 6/50
Learning rate for epoch 6 is 0.0001.
1130/1130 [==============================] - 11548s 10s/step - loss: 1.2246 - acc: 0.5264
Epoch 7/50
Learning rate for epoch 7 is 0.0001.
1130/1130 [==============================] - 11550s 10s/step - loss: 1.2228 - acc: 0.5278
Epoch 8/50
Learning rate for epoch 8 is 0.0001.
1130/1130 [==============================] - 11602s 10s/step - loss: 1.2223 - acc: 0.5273
Epoch 9/50
Learning rate for epoch 9 is 0.0001.
1130/1130 [==============================] - 11620s 10s/step - loss: 1.2211 - acc: 0.5292
Epoch 10/50
Learning rate for epoch 10 is 0.0001.
1130/1130 [==============================] - 11581s 10s/step - loss: 1.2206 - acc: 0.5284
Epoch 11/50
Learning rate for epoch 11 is 0.0001.
1130/1130 [==============================] - 11544s 10s/step - loss: 1.2203 - acc: 0.5272
Epoch 12/50
Learning rate for epoch 12 is 0.0001.
1130/1130 [==============================] - 11467s 10s/step - loss: 1.2196 - acc: 0.5286
Epoch 13/50
Learning rate for epoch 13 is 0.0001.
1130/1130 [==============================] - 11399s 10s/step - loss: 1.2191 - acc: 0.5294
As I can see above training accuracy oscillates between 51-52 % and it seems like model overfitted at 51-52 %.
Earlier I had 500 speakers (a subset of 4.5k speakers), I got the same result at that time as well.
What could be the reason for this result? Please help @WeidiXie.
When loading the pretrained model, I am getting the following error:
Traceback (most recent call last):
File "main.py", line 201, in <module>
main()
File "main.py", line 82, in main
if mgpu == 1: network.load_weights(os.path.join(args.resume))
File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "/home/ultron/miniconda3/envs/tf/lib/python3.7/site-packages/keras/engine/saving.py", line 1030, in load_weights_from_hdf5_group
str(len(filtered_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 80 layers into a model with 81 layers.
Hi Weidixie:
Thanks for your sharing, and I have a question: Have you ever tried to combine gVlad with triplet loss function or GE2E loss by Google to train the model and how about the effect?
I look forward for your reply and thank you very much again !
src/main.py
It didn't work to gpu ,did it?
In the function load_wav_Predict
why did you need to extend the audio file (see code below)
I cannot think of a reason why one would need to extend the time signal
def load_wav_Predict(vid_path, sr):
wav, sr_ret = librosa.load(vid_path, sr=sr )
assert sr_ret == sr
extended_wav = np.append(wav, wav[::-1])
return extended_wav
Hi,
What is the accuracy and EER of the model (https://drive.google.com/open?id=1M_SXoW1ceKm3LghItY2ENKKUn3cWYfZm) you provide?
Hi, I want to train some layers and leave the others frozen (Dont have to train the entire model)
I dont know how I can adjust it
"python main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --multiprocess 8 --loss softmax --data_path ../path_to_voxceleb2:" ??
scr/predict.py
specs = ut.load_data(...),
the specs was feature through stft,subtract mean, divided by time-wise var, while specs was not tensor, but array.
Hello @WeidiXie ,
Thanks for this awesome work, and sharing it with the open-source community ! I am trying to adapt the code for training on VoxCeleb1 (just because it is a smaller dataset, I decided to play with it first).
I have prepared the file lists, plugged in your weights, froze the first layers until the bottleneck in the code, and tried to run main.py
. However, for some reason I do get an annoying error prior to training, that is likely related to the fact you're using multiprocessing to speed up data generation:
File "D:\Repos\VGG-Speaker-Recognition\tool\toolkits.py", line 45, in set_mp pool = mp.Pool(processes=processes, initializer=init_worker) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 175, in __init__ self._repopulate_pool() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 236, in _repopulate_pool self._wrap_exception) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\pool.py", line 255, in _repopulate_pool_static w.start() File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__ reduction.dump(process_obj, to_child) File "C:\Users\abiryukov\AppData\Local\Continuum\anaconda3\envs\pyBK\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'set_mp.<locals>.init_worker'
I also tried to set the number of processes to 1, and that did not help either.
I wonder if you have any suggestions on how to alleviate this.
Thanks again,
Anton.
How to train the model on the Voxceleb1 dataset?
Thanks!
i tried to add my own dataset and build a model but i get a data shape error
4/14671 [..............................] - ETA: 29:30:39 - loss: 11.9053 - acc: 0.0000e+00Traceback (most recent call last):
File "src/main.py", line 195, in
main()
File "src/main.py", line 135, in main
verbose=1)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 601, in get
six.reraise(*sys.exc_info())
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 595, in get
inputs = self.queue.get(block=True).get()
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/rohit/anaconda3/envs/condaenv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 43, in getitem
X, y = self.__data_generation_mp(list_IDs_temp, indexes)
File "/media/rohit/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/VGG-Speaker-Recognition/src/generator.py", line 59, in __data_generation_mp
X = np.expand_dims(np.array([p.get() for p in X]), -1)
ValueError: could not broadcast input array from shape (257,250) into shape (257)
isit issue with any audio file size duration?
Hello Weidi
Thank you for the excellent paper and making your work opensource.
I have a couple of questions:
What is the duration of audio files you used for the pre-trained model? In the paper you mention that audio files have duration of 2.5 seconds during training phase. But you also mention that ‘in the wild’ sequences having longer utterances (4 seconds or more) is a significant improvement over shorter segments. I would image that for best results, I have to use the same size audio chunks as used for training therefore want to know the duration used for trained model.
Did you try Mel STFTs instead of linear STFTs? Typically, Mel STFTs are known to provide far better results than linear STFTs.
Many thanks
.
this is the netvlad author's presentation, as show with yellow circle the feature map x is the same for two branch. but in your code, has some different:
1: frome feature map, x --> x_fc, and x --> x_k_center, then this two pass to vladpooling which do softmax and normalization, as compare to netvlad, look like x --> fc unnecessary
2: before compute softmax, why need to sub max first, this is seem not very common?
3: in netvlad there first do intra-normalization then l2-normalization (as one paper refer this improve the acc) but here only one l2-normaliztion
what's benefit will get from these differents?
i train use netvlad and vladpooling on my dataset with same optim params and use grad-cam to see the featmaps activation different, same times vladpooling will activate the background noise, but netvlad not:
orignal code:
orignal code add self_attetion after featmap and before vladpooling:
self_attetion after featmap and before netvlad:
Environment:
Before test, I truncated the voxceleb1_veri_test.txt
to only 50 lines to speed up test.
Command:
python predict.py --gpu 0 --net resnet34s --ghost_cluster 2 --vlad_cluster 8 --loss softmax --resume ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5
output:
Instructions for updating:
Colocations handled automatically by placer.
==> successfully loading model ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5.
==> start testing.
Finish extracting features for 0/100th wav.
2019-05-29 17:26:34.142088: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Finish extracting features for 50/100th wav.
scores : 0.848808407784, gt : 1
scores : 0.635771036148, gt : 0
scores : 0.896652877331, gt : 1
scores : 0.666865587234, gt : 0
scores : 0.86113858223, gt : 1
scores : 0.649969100952, gt : 0
scores : 0.882004976273, gt : 1
scores : 0.675234436989, gt : 0
scores : 0.814641714096, gt : 1
scores : 0.612952053547, gt : 0
scores : 0.841726779938, gt : 1
scores : 0.691002070904, gt : 0
scores : 0.875537037849, gt : 1
scores : 0.760980308056, gt : 0
scores : 0.862766265869, gt : 1
scores : 0.595528423786, gt : 0
scores : 0.872184753418, gt : 1
scores : 0.580520808697, gt : 0
scores : 0.866317629814, gt : 1
scores : 0.861166357994, gt : 1
scores : 0.735198259354, gt : 0
scores : 0.846519947052, gt : 1
scores : 0.634202837944, gt : 0
scores : 0.879867553711, gt : 1
scores : 0.617964744568, gt : 0
scores : 0.866540849209, gt : 1
scores : 0.502503097057, gt : 0
scores : 0.884967088699, gt : 1
scores : 0.568573653698, gt : 0
scores : 0.926931381226, gt : 1
scores : 0.637345910072, gt : 0
scores : 0.834380090237, gt : 1
scores : 0.620291650295, gt : 0
scores : 0.912857890129, gt : 1
scores : 0.626294493675, gt : 0
scores : 0.952058196068, gt : 1
scores : 0.640718281269, gt : 0
scores : 0.933943748474, gt : 1
scores : 0.51838862896, gt : 0
scores : 0.861519873142, gt : 1
scores : 0.771008253098, gt : 0
scores : 0.881197452545, gt : 1
scores : 0.641325950623, gt : 0
scores : 0.885362446308, gt : 1
scores : 0.748977065086, gt : 0
scores : 0.839608311653, gt : 1
scores : 0.611160635948, gt : 0
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:610: RuntimeWarning: divide by zero encountered in true_divide
slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]
/home/ubuntu/anaconda3/envs/python27/lib/python2.7/site-packages/scipy/interpolate/interpolate.py:613: RuntimeWarning: invalid value encountered in multiply
y_new = slope*(x_new - x_lo)[:, None] + y_lo
==> model : ../model/gvlad_softmax/resnet34_vlad8_ghost2_bdim512_deploy/weights.h5, EER: 0.0
i use 6 gpu to train the model with batch size 200, but each gpu-util is only between 0-22%, and cost 2 hours per epoch, but memory usage is high. The time is not increased compared with 2 gpu, while 2 gpu-util is about 80%.
and then i change the multi-thread to 128, the usage can be increased to 30-50%, and cost 1 hour per epoch, is there any way to increased the gpu usage anymore?
What is the output of this array (feat), (it is 512D vector), can I use this as extracted feature vector?
I've not found information about hardware. Can you tell me what gpu's model you used? In my implementation (3.6 mln parameters approximately as in paper) 35 element ~ 11 gb in memory (1080ti). Maybe I made a mistake.
Hi, sorry for bothering you again.
What was the final train loss of the model? Actually, I'm worried about our model being overfitted to VoxCeleb2 data, so I'm not sure whether I should use early-stopping here, or just wait until the convergence.
hi.
i found your code really complete and i tried to use is but:
when i use your model called in "readme" to predict it works but when i try to use my own data with "python main.py .... " and then i want to use it for prediction , i found an error about not same shape , ....
would you please help me about that?
I'm getting error
Using TensorFlow backend.
Traceback (most recent call last):
File "src/main.py", line 195, in <module>
main()
File "src/main.py", line 41, in main
toolkits.initialize_GPU(args)
AttributeError: 'module' object has no attribute 'initialize_GPU'
It makes sense, since it has only import toolkits
Run command:
python src/main.py --net resnet34s --batch_size 160 --gpu 2,3 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --loss softmax --data_path ../../data/voxceleb1
My libraries:
tensorflow 1.8.0
toolkits 0.1.28```
Hi, thanks again for open-sourcing this Speaker Recognition system and kindly replying to every issue.
I have a question about the model shown here. When evaluating final 512-dimension output(embedding vector), ReLU activation is applied at last, as shown in https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/model.py#L139-L147. Hence, the output looks like:
[[1.15972664e-02 5.04462933e-03 2.85871420e-02 0.00000000e+00
3.08723319e-02 0.00000000e+00 3.42872031e-02 4.36003655e-02
0.00000000e+00 1.12573527e-01 5.46368458e-31 5.64192347e-02
2.56476291e-02 0.00000000e+00 2.51553692e-02 4.77801599e-02
0.00000000e+00 3.06680351e-02 2.24540825e-03 0.00000000e+00
1.33734914e-02 2.91635211e-31 2.31502447e-02 5.39273359e-02
9.22401696e-02 0.00000000e+00 3.31045166e-02 5.57319149e-02
1.24792336e-02 4.04326282e-02 6.75894767e-02 0.00000000e+00
6.08060285e-02 4.47864346e-02 2.85473187e-02 0.00000000e+00
... (truncated)
Here, we can observe that some values are 0
.
In my opinion, the last ReLU layer is eliminating some information by erasing all negative values. Moreover, it limits the area of hypersphere where embeddings can exist, by a factor of 1/2^512
. So, my question is: was the last ReLU layer necessary?
I strongly believe that it was necessary(since it's currently SotA on Speaker Recognition in the wild!), but I couldn't guess the necessity of the last ReLU layer. I would like to kindly ask you about that. Thanks in advance.
Hi, Weidi .
Thank you for your prompt reply !!!
And sorry to interrupt again.
As you said yesterday, I can successfully load the pre-training model.
But I encountered a new mistake as the title shows.
This time I used the following command line:
python main.py --net resnet34s --batch_size 3 --gpu 0 --lr 0.001 --ghost_cluster 2 --vlad_cluster 8 --warmup_ratio 0.1 --optimizer adam --epochs 20 --multiprocess 1 --loss softmax --data_path ''
I know that the model loaded successfully through the information displayed in the terminal.
But after print Learning rate for epoch 1 is 0.0001.
, the new error occured.
**Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, self._kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 559, in _run
sequence = list(range(len(self.sequence)))
TypeError: 'int' object is not c
I think it's a multi-process or multi-threading problem.
So I tried to set --multiprocess 0
, or comment out areas of code that involve multiple processes.
But nothing changed.
I found one issues, he set --workers 0
, so I changed fit_generator
function's workers=0
in main.py.
A "new" error occured:
**Traceback (most recent call last):
File "main.py", line 223, in
main()
File "main.py", line 162, in main
verbose=0)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, kwargs)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
generator_output = next(output_generator)
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/engine/training_utils.py", line 590, in iter_sequence_infinite
for item in seq:
File "/usr/local/anaconda3/envs/py3.6/lib/python3.6/site-packages/keras/utils/data_utils.py", line 372, in iter
for item in (self[i] for i in range(len(self))):
TypeError: 'int' object is not callable
The same error TypeError: 'int' object is not callable
occured, someone said that your custom variable name and the default function or class name duplicate will cause this problem.
I can't solve the problem, So I would like to ask if you have come across this problem, or have any ideas to solve it.
Sorry again for the interruption, and thanks in advance !!!
tool/toolkits.py
os.environ["CUDA_VISIBLE_DEVICES"] = args.gpu
modified
os.environ['TF_CPP_MIN_LOG_LEVEL'] = args.gpu
Epoch 1/128
Learning rate for epoch 1 is 0.001.
1/7492 [..............................] - ETA: 294:55:57 - loss: 9.3743 - acc: 0.0000e+00
but trained slowly, gpu wasn't work.
Thanks for open sourcing the code and pretrained weights!
I'd like to be able to use/modify this awesome code in our company. However, I can't find any license specified here.
In specific manner, we need to know whether these code and model could be used in commercial use or not.
Again, thank you for open sourcing the code of your paper.
Your model is trained under multiple gpu, is the single gpu weight saved so that we can predict on a machine with a single gpu
Hi, thank you for your work. What I am confusing is the meaning of self.annealing. I didn't see the definition of it. It this a Keras (which I am not familiar with) variable?
VGG-Speaker-Recognition/src/new_layers.py
Line 87 in 8024093
Hi,
This paper and the idea is pretty interesting! May I ask two questions about the details please?
I found the idea of LDE (learnable dictionary encoding, cited in the paper as [Cai et.al.]) is very similar with the NetVLAD (if not the same). I'm wondering what your opinion is about the different between LDE and NetVLAD used in this paper?
After going through the code, I found the forward propagation of the VLAD and average pooling seems different.
For average pooling, the output of resnet_2D_v1/v2 is directly used, which makes the shape to be
[batch, 7, 16, D] -> [batch, 84, D] (after pooling, no additional layer)
For VLAD, the output is processing by an additional Conv2D layer, making the shape:
[batch, 7, 16, D] -> [batch, 1, 16, D] (feat, use Conv2D) / [batch, 1, 16, n_clusters] (cluster_score) -> [batch, D * n_clusters] (after VLAD)
The additional layer may lead to better performance. Maybe this is part of the reasons why the TAP performs poorly in the paper?
Last, the performance comparison in the paper is really useful. Good work :-)
hi, weidi, I'm puzzled by one problem. There're 1251 speakers in voxceleb1 dataset , if for each speaker, 100 positive pairs and 100 negative pairs are sampled, it sounds more likely reasonable that 250,200 verification pairs in total?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.