Keras implementation of ‘’Deep Speaker: an End-to-End Neural Speaker Embedding System‘’ (speaker recognition)

Python 98.59% Shell 1.41%

speech speaker-recognition speaker-embedding triplet-loss keras

deep_speaker-speaker_recognition_system's Introduction

Deep Speaker: speaker recognition system

Data Set: LibriSpeech
Reference paper: Deep Speaker: an End-to-End Neural Speaker Embedding System
Reference code : https://github.com/philipperemy/deep-speaker (Thanks to Philippe Rémy)

This code was trained on librispeech-train-clean dataset, tested on librispeech-test-clean dataset. In my code, librispeech dataset shows ~5% EER with CNN model.

About the Code

train.py
This is the main file, contains training, evaluation and save-model function
models.py
The neural network used for the experiment. This file contains three models, CNN model (same with the paper’s CNN), GRU model (same with the paper's GRU), simple_cnn model. simple_cnn model has similar performance with the original CNN model, but the number of trained parameter dropped from 24M to 7M.
select_batch.py
Choose the optimal batch feed to the network. This is one of the cores of this experiment.
triplet_loss.py
This is a code to calculate triplet-loss for network training. Implementation is the same as paper.
test_model.py
This is a code that evaluates (test) the model, in terms of EER...
eval_matrics.py
For calculating equal error rate, f-measure, accuracy, and other metrics
pretaining.py
This is for pre-training on softmax classification loss.
pre_process.py
Load the utterance, filter out the mute, extract the fbank feature and save the module in .npy format.

Experimental Results

This code was trained on librispeech-train-clean dataset, tested on librispeech-test-clean dataset. In my code, librispeech dataset shows ~5% EER with CNN model.

More Details

If you want to know more details, please read deep_speaker_report.pdf (English) or deep_speaker实验报告.pdf (中文).

Simple Use

Preprare data.
I provide the sample data in audio/LibriSpeechSamples/ or you can download full LibriSpeech data or prepare your own data.
Preprocessing.
Extract feature and preprocessing: python preprocess.py.
Training.
If you want to train your model with Triplet Loss: python train.py.
If you want to pretrain with softmax loss first: python pretraining.py then python train.py.
Note: If you want to pretrain or not, you need to set PRE_TRAIN(in constants.py) flag with True or False.
Evaluation.
Evaluate the model in terms of EER: test_model.py.
Note: During training, train.py also evaluates the model.
Plot loss curve.
Plot loss curve and EER curve with utils.py.

import constants as c
from utils import plot_loss
loss_file=c.CHECKPOINT_FOLDER+'/losses.txt' # loss file path
plot_loss(loss_file)

deep_speaker-speaker_recognition_system's People

Contributors

Stargazers

Watchers

Forkers

namdn suzhidong zhangshyang tarsbase knhuq speechprojects iamweiweishi anubhople iterrypeng shiwanglei legendtianjin enricoversino ming0818 gzfffff irving1997 wenhaosong xingzai0617 zy230530 engigu happypanda5 zhaoyj1122 yangzhixue1 xiangyangai df14jsj117 tuanad121 siomarry audiocamp gdy1201 juneren shania7 yuelupenbgpeng123 kingcrane666 unoti izhangy littlemawen zenghorace sanghy changyungting hehua1311553678 connorgod ooobsidian fnhweiqs wurumushan whaozl 18030410071 shenyi666666 juanting zhuangleiscut jiuyue99207 elliotthwang gavin-keli kbitc darrenonly silvadirceu flavio58it amirhussein96 becausey philipismyen wenwanchen ishine shmily-programmer neillint mruge1999 samantha-fu zardkk abhinandanshrestha ansenquan armlynobinguar pup-hygears hard-working-bee sksachin7z2 mhdabdellah veryquant imomoe233 jaedukseo summithwangcn audumbar20 hyena2001 liyuhao413

deep_speaker-speaker_recognition_system's Issues

details in english

hi.
I'm study speech recognitions and your code looks really valuable.
but
please can you share the details in English?
Thanks :)

关于选择最优样本(喂入网络)的问题

你好，最近看了你的关于deep speaker的实验，学到了很多东西，但是现在有个地方有些疑问，请教一下。在deep_speaker实验报告里介绍到根据当前网络获取样本的网络输出，再与历史样本的网络输出组合从而挑选出最优样本，不知道我的理解对不？如果是这样的，那历史的输出和当前的输出所对应的网络参数是不同的吧，这样选择出来的样本有意义么？可靠么？

ask reason2

您好！我还是上次提问的学生，能得到您的回答很感激！
这次还想问一下对于原文百度文章中的实现，您的改进是什么？或者说和原文不同的是什么？为什么可以达到比原文还要好的效果？这个理论依据是？

训练完成后，每次测试的结果在变化

学长你好！我从train-clean-100选择了5个人的数据。按照训练：测试=4:1的比例划分了训练数据和测试数据。然后我训练了5个人的数据，训练好之后，我再测试这5个人的测试数据。测试和训练都能跑通。
但是我发现了一个问题，同样是这5个人的测试数据，每次测试输出的结果都不同，f-measure、true positive rate、accuracy和eer是在变化的。
我原本以为训练好之后，模型的权重就定了，测试的时候调用固定权重之后的参数去测试，测试结果应该也是不变的才行呀？但是每次测试，结果都不一样。
学长，怎么样才能让测试的结果固定呀？
谢谢学长！刚入门声纹识别这一领域，麻烦学长了

some question（一些问题）

你好，最近在写说话人识别的论文，学习了下你的代码，有2个疑问：（1）-我看了这篇论文好像是基于Resnet的CNN，最后是一个512维度的全连接层。我又对比了下您在model.py中的convolutional_model_simple和convolutional_model方法，发现您在简化的方法convolutional_model_simple注释掉了一个512个卷积核的卷积层，实际的GRU模型并没有启用是嘛？那么请问您在这里注释掉此512个卷积核的卷积层的目的是什么？您还有其他方面的改进方向或者可能改进的方向吗？（2）-您在做静默处理的silence_detector.py里面的main中有一个绝对路径（wav_fn），请问下这里的作用是？？
Hello, I recently wrote a paper on speaker recognition and learned your code. There are two questions: (1) - I saw this paper as a CNN based on Resnet, and finally as a 512-dimensional full-connection layer. I also compared your convolutional_model_simple and convolutional_model methods in model.py, and found that you commented out a 512 filters convolutional_model_simple in the simplified method. The actual GRU model was not enabled, right? So what is the purpose of annotating the convolution layer of 512 convolution cores here? Do you have any other improvement direction or possible improvement direction? (2) - You have an absolute path (wav_fn) in silence_detector.py, What advantage is??

checkpoints/model_17200_0.54980.h5模型问题

你好，请问 checkpoints/model_17200_0.54980.h5 这个模型是在什么情况下生成的？

在train-clean-100数据集直接用 select_batch 的triplet-loss进行训练, 得到的模型大体在 7%，是否是因为还没有训练完全？下面是训练过程的 summary:

训练过程中得到了一个 best model : best_model53600_0.03806.h5
使用这个模型在 test-clean 上测试得到 EER = 7%~10%

在训练或者测试过程中，我遗漏了什么或者还有什么 tricks 的吗？期待你的回复，谢谢！

再次感谢您上传如此优质的项目！

question about run and dataset

Hello, I am currently researching speaker recognition. I think your code is very valuable. I want to run it for learning. How can I run this code? How can I get an eer and accuracy image as shown below? Is it possible to provide a tutorial like this.... Besides, can you share a link to the speech library you are using to implement this code? thank you~

Asking for clarification

Hi,
Thank you for the code that is extremely helpful. I have some questions please:

In the paper, it is noticed that the reported results are both of speaker identification and verification. I presume that the accuracy stands for the identification result. But i'm still confused since the speakers in the training set are different from those of the testing set.
what is the difference between select_batch.create_data_producer and stochastic_mini_batch?
Which variable indicates the number of epochs used?
In the paper, it is indicated that 64 utterances are used per mini-batch. Does it corresponds to candidates_per_batch variable that is set to 640?
For running the code (training and testing). Is running train.py sufficient ? How much time it takes to get results please?

Thank you in advance

Select_batch速度很慢

您好，感谢代码的分享。我在训练train.py时，使用您设计的select_batch函数，但是在每次训练前选择最优的batch时间消耗较多，这里您是如何优化的呢？

关于数据预处理和softmax pretrain

学长您好，我最近在做类似的项目。有几个问题想要请教一下：

语音特征提取之后是否需要对特征标准化和归一化？特征选择多少个dim的fbank会比较好？

triplet loss是否一定需要softmax pretrain? pretrain的话是用一般的softmax cross entropy loss吗？有没有什么特别的技巧和注意点？

非常感谢！

voice sample length

Hi,

It seems that 1.6 seconds is quite short. I see in papers they use 3 or 5 seconds or ever longer. But increasing let's say 2 times to 3.2 seconds will results in 320 frames. If using convolutional model. it will result averaging 20 embeddings at the end instead of 10. It feels that this averaging is not the best thing. To avoid this, extending network will double embedding to 1024 from 512 as I see.

Please let me know your views.

关于model.train_on_batch

版主您好，看了您的代码很受用．我有几个问题想向您请教，还望您的解答：
１．关于train.py中,选择最优batch之后，
　　x, _ = select_batch.best_batch(model, batch_size=c.BATCH_SIZE)
print("select_batch_time:", time() - orig_time)
y = np.random.uniform(size=(x.shape[0], 1))
logging.info('== Presenting step #{0}'.format(grad_steps))
orig_time = time()
loss = model.train_on_batch(x, y)
您在训练的时候为什么y是随机生成的，按理说，不应该该是label吗？
２．我想我想用softmax预训练的时候，我是不是只要把PRE_TRAIN改成True即可对吧．
谢谢

关于完整数据集的一些问题

学长，我下载的数据集解压后全部是flac格式，但是我看您给出的sample中有flac格式也有对应的wav格式。我用pre_process.py来处理数据，但总是得不到npy格式的文件，没有发生任何变化。我觉得是我的操作又有问题了，我是不是应该先把flac格式转化为wav格式？在pre_process.py文件中，我是不是应该修改129行filename = '/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/audio/LibriSpeechSamples/train-clean-100/19/227/19-227-0036.wav'和137行preprocess_and_save("/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/audio/LibriSpeechSamples/train-clean-100")。
129行中为什么精确到了某一个wav文件？不可以直接到train-clean-100这个文件夹吗？137行中预处理后保存的文件不应该存在train-clean-100-npy这个文件夹中吗？
不知道是不是我理解的不对，烦请您有空时解答，谢谢您！

tensorflow和keras版本

请问你们的tensorflow和keras版本是多少

关于挑选样本

你好，又来打扰您了，我发现到后期训练的时候的时候选样本的时间一次能达到１－２分钟，这个正常吗？感觉好慢啊

How to do inference with the pre-trained model

Hi there,
I'm looking for how to embed data of one speaker with the pre-trained model.
In the training process, the input data has anchor, positive, and negative speaker. Each speaker has 32 sentences with 160 frames for each sentence.
I wonder how we use the model to embed one speaker. Do we need to prepare anchor, positive and negative speaker as in training?
Thanks for spend your time on my question. ^^

关于如何进行声纹识别

您好！我最近在写一篇关于声纹识别，语音伪造取证的相关论文。您的模型主要是做说话人识别，但我想应该也可以用来进行声纹识别。请问如果我想单独的训练一个单一说话人的模型，在模型验证的过程中，输入一段伪造的语音，让模型网络来判断是否是真实语音，也就是说模型的输出为true或者false。如果是这种场景请问应该怎么在代码上进行改变？非常感谢！

一些问题

您好，我是一名学生，最近在研究Speech Identification，您的这个项目对我非常有帮助。我有一些疑问希望您能解答

我看到您在预训练的过程抽取了部分样本作为测试集，但是在triplet loss训练过程中使用的测试数据也是来自训练数据的，并没有做分割，这是为什么？
2.我直接在您的原来项目上跑，train.py得到的EER都很低，训练了20000个steps后得到的EER差不多在0.4%-2%的范围内，并没有看到您在pdf中说的6%-7%的EER，但是我实际分割了30%的数据作为测试集进行triplet loss训练得到的EER大致在6%-7%的样子，是不是您论文中说的6%-7%和5%-6%的结果是在分割了测试集的情况下跑出来的？
3.我想问一下您test_model()中的eval_model()是随机抽取部分测试集而不是全部测试集数据吗，因为我想能在全部的测试集上进行一次评估，请问我该怎么调整相应代码？
提前表示感谢，感谢您上传如此优质的项目

运行出错

Traceback (most recent call last):
File "C:/Users/xiongmeiyan/Desktop/Deep_Speaker-speaker_recognition_system-master/Deep_Speaker-speaker_recognition_system-master/train.py", line 180, in
main()
File "C:/Users/xiongmeiyan/Desktop/Deep_Speaker-speaker_recognition_system-master/Deep_Speaker-speaker_recognition_system-master/train.py", line 155, in main
map(lambda f: os.path.join(c.BEST_CHECKPOINT_FOLDER, f), os.listdir(c.BEST_CHECKPOINT_FOLDER))),
FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'best_checkpoint'

求指教

运行错误

您好学长，我正在复现您的代码，然后出现了错误，我一直没有解决掉，您能帮我看看吗？
Using TensorFlow backend.
Traceback (most recent call last):
File "train.py", line 22, in
import select_batch
File "/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/select_batch.py", line 18, in
from pre_process import data_catalog
File "/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/pre_process.py", line 18, in
np.set_printoptions(threshold=np.nan)
File "/home/dcase/miniconda3/lib/python3.7/site-packages/numpy/core/arrayprint.py", line 246, in set_printoptions
floatmode, legacy)
File "/home/dcase/miniconda3/lib/python3.7/site-packages/numpy/core/arrayprint.py", line 93, in _make_options_dict
raise ValueError("threshold must be numeric and non-NAN, try "
ValueError: threshold must be numeric and non-NAN, try sys.maxsize for untruncated representation
我是下载代码后，在ubuntu16.04上运行的，python版本是3.7.3.我首先运行的是python train.py，然后就出现了这个错误，诚盼您的回复，谢谢学长

Details in English

hi many thanks for your code

please can you share the details in English

Thanks in advance

The normalisation of Fbank?

The fbank is normalised at the dimension of feature rather than frame, which is different from the definition. I'm confused with this. Could you tell me why?

您好！

您好，最近我在研究您的这个deep speaker程序，准备用中文语料来训练，在完成pre_process后，训练train时，出现如下错误：
forward process time 9.14s
beginning to select..........
select best batch time 0.0675s
select_batch_time: 9.400545120239258
2019-02-21 20:00:51,647 [INFO] train.py/main | == Presenting step #0
Traceback (most recent call last):
File "train.py", line 182, in
main()
File "train.py", line 126, in main
loss = model.train_on_batch(x, y)
File "/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1808, in train_on_batch
check_batch_axis=True)
File "/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1411, in _standardize_user_data
exception_prefix='target')
File "/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 153, in _standardize_input_data
str(array.shape))
ValueError: Error when checking target: expected ln to have shape (None, 512) but got array with shape (96, 1)
找了半天，没发现原因，请教下需要怎么处理，期待您的回复~

the problem when run pretraining.py

In constants.py, I set the batch_size= 16. When run the pretraining.py, I always get the error which "tensorflow.python.framework.errors_impl.InvalidArgumentError: Inputs to operation training/Adam/gradients/AddN_2 of type _MklAddN must have the same size and shape. Input 0: [983040] != input 1: [48,10,4,512]". Who know it and help me.

得到结果的含义

学长您好，在test_model.py得到的测试结果中，为什么每一次测试的结果并不相同？请问true positive rate = 1.0, accuracy ， equal error rate 分别是什么含义？true positive rate=1.0是指测试的说话人都包含在训练的说话人中？accuracy是指什么的精度呢？
诚盼您的回复，谢谢您！

关于pretaining的问题

你好呀！

我想问下，这个pretrain在convolutional_model后面加了一个softmax层，这样的话整个网络的作用就相当于提取说话人声音的特征，然后做一个分类，实现一个说话人的身份识别，即给一段语音的特征，他能输出这段语音对应的说话人是吗？

另外还想问下这个预训练对后面的语音验证（验证语音A是不是speaker B发出的）起到了什么作用呢？

期待你的回复~

GPU版本

你好，请问这套代码有GPU版本的吗，现在在CPU上跑（渣处理器）出结果很慢~

hard-negative mining

Hi,
I see in the article for FaceNet.
https://blog.csdn.net/baidu_27643275/article/details/79222206
They select all positives, but from negatives that satisfy criteria, they pick at random from the set instead of hardest negatives only.
This feels like more representative picking of samples instead of picking only the hardest.
Any views on this.
Thanks!

测试出现问题

学长你好，我下载了train-clean-100数据集，然后只选择了其中的254、289、298三个人的数据，因为每个人大致有120段左右的语音，所以我按照训练：测试=4:1这样的比例，划分出了训练和测试数据集。
然后我运行train.py，去训练这三个人的训练数据，原本你的程序里是while循环，我在里面设置了当train loss 小于1的时候，就break。训练没有报错，一会儿就结束了。但是我在运行test_model.py去测试这三人的测试数据时，突然报了下面的错误，学长你知道是怎么回事吗？

Found checkpoint [checkpoints/model_28000_4.25404.h5]. Resume from here...
Found 0000064 files with 00003 different speakers.
Traceback (most recent call last):
File "test_model.py", line 177, in
fm, tpr, acc, eer = eval_model(model, check_partial=True,gru_model=gru_model)
File "test_model.py", line 115, in eval_model
x, y_true = create_test_data(test_dir,check_partial)
File "test_model.py", line 75, in create_test_data
negative_files = libri[libri['speaker_id'] != unique_speakers[ii]].sample(n=num_neg, replace=False)
File "/home/hdc/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4865, in sample
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1168, in mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'
谢谢学长

help！！

你好！感谢您之前的帮助，想请问您手上有没有一些代码框架注释说明，想向您请求一下！可以邮件发给我么？非常感谢！！！

关于NUM_FRAMES等参数设置的问题

新年好呀！

您好，最近我在研究您的这个deep speaker程序，在自己的笔记本上用CPU跑的，跑train.py的时候会出现资源不够用的情况：

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[96,128,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc..

然后我发现改小NUM_FRAMES的值，比如调整至32，train.py就可以跑起来了。

研究了下您的代码，发现是在minibatch的时候，使用clipped_audio函数对之前生成的特征做了截取，想请教一下调小NUM_FRAMES对训练过程会有多大影响，会使训练过程因为特征过少导致模型训练准确度降低吗？

还有想了解下在CPU上重新训练这个模型大概需要多长时间呢？我看您的报告里经过6000个setps大约花了60个小时，这是在GPU上跑的数据嘛？

期待您的回复~

Hard negative mining

Based on your report, training with random batch was better than using hard negative mining, right?

网络

您好！我想问一下，您这个搭建了多少层网络，和百度原文的区别是什么？

train.py运行问题

您好，首先感谢您提供优质代码。我想请问我在运行train.py时，控制台报错：FileNotFoundError: [Errno 2] No such file or directory: 'best_checkpoint'，这个怎么解决呢，谢谢

关于threshold问题

您好，又来打扰您了，我看您在选择阈值的时候时从０到１之间选取的，理由是什么呢，相似度按理说不应该时在－１到１这个取值范围吗？

程序出错

运行train.py时，提示 No module named numpy ，import ... as....或者from .... import.....的都会提示这样的错误

batchTrainingImageLoader in pretraining.py

The implementation is impressive.

However, I wonder the following code's necessity:

def batchTrainingImageLoader(train_data, labels_to_id, no_of_speakers, batch_size=c.BATCH_SIZE * c.TRIPLET_PER_BATCH):
    paths = train_data
    L = len(paths)
    while True:
        np.random.shuffle(paths)
        batch_start = 0
        batch_end = batch_size

        while batch_end < L:
            x_train_t, y_train_t = loadFromList(paths, batch_start, batch_end, labels_to_id, no_of_speakers)
            randnum = random.randint(0, 100)
            random.seed(randnum)
            random.shuffle(x_train_t)
            random.seed(randnum)
            random.shuffle(y_train_t)
            yield (x_train_t, y_train_t)
            batch_start += batch_size
            batch_end += batch_size

So the purpose of

            random.seed(randnum)
            random.shuffle(x_train_t)
            random.seed(randnum)
            random.shuffle(y_train_t)

is to randomize yield batch (x_train_t, y_train_t). But, from my opinion, the order of samples within a batch
does not affect any training?

Looking forward for your reply!

自己采集数据训练遇到的问题

学长你好！又打扰你，非常抱歉。
我现在能把学长的代码跑通了，现在我想采集我们实验室5个人的数据，让网络能学会去辨认这5个人。
1、有个问题就是我们需要用什么设备去录制语音吗？
2、我刚才用手机的录音机去录，得到的是.m4a格式的语音，是不是通过ffmpeg将.m4a格式再转化成.wav格式的语音呀？
3、train-clean-100中的语音都是英语，我们在制作自己的数据集的时候，需要说英文吗？还是说中文也可以。
4、我感觉train-clean-100中的语音，他们的说话音量都很稳定，感觉没什么噪音，我们在录制自己语音的时候，该怎么尽量克服呢？
5、train-clean-100中每个speaker的语音片段大概在110-130个左右，每个片段大概10秒左右，也有3秒的，我们在录制自己语音的时候，需要一段一段的录制呢？还是让每个人连续说上20分钟的话，我们再将这个长语音用什么工具切成一个一个的语音片段？
谢谢学长，万分感谢

ask reason

您好！我是一名学生，我最近在研究您这个实验，我用train_clean_100数据集跑的结果可以达到98%以上，想问一下，这个网络的优点在哪？之前cnn网络是用来做图像的，rnn网络用来处理语音，想知道为啥这个程序可以达到这么好的效果？

大佬求救

您好，加载pretrain预训练模型用triplet loss训练时，我尝试了您实验手册的学习率，只有0.0001收敛其他都不收啊，可是0.0001我跑了 20万个step 一直下降没有像论文里面说的下降后上升您有遇到这种情况吗？

triplet loss formula

Hi,
In triplet_loss, why it is
loss = K.maximum(san - sap + alpha, 0.0)
it got to be sap - san + alpha
Am I missing something?
Thanks!

如何运行

您好，请问运行前是否需要先激活tensorflow环境？然后再运行python train.py?我激活环境运行该文件后，出现以下错误：
Using TensorFlow backend.
Traceback (most recent call last):
File "train.py", line 22, in
import select_batch
File "/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/select_batch.py", line 18, in
from pre_process import data_catalog
File "/home/dcase/mawen/SW/Deep_Speaker-speaker_recognition_system-master/pre_process.py", line 6, in
from python_speech_features import fbank, delta
ModuleNotFoundError: No module named 'python_speech_features'
在我conda install python_speech_features后，又出现以下错误：
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

python_speech_features

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.
请问该如何解决？

环境问题

您好，我想试一下您的代码，可是我的环境一直报错，您能否将环境要求告诉我一下呢?谢谢您

best_batch selection from sims

Hi,
sap = sims[ii][pinds] line 176 in select_batch.py
ii is an index to speakers, sims rows are embeddings of selected speakers.
It seems to me that row selected from sims does not correspond to anchor_index (that specific embedding for speaker)
Could you clarify?
Thanks!

你好，关于代码运行

您好，我是在校本科生，ML新手，在运行您的代码时候，遇到一些问题，向您请教。
问题1-----silence_detector.py文件下的第44行：wav_fn='/Users/walle/PycharmProjects/Speech/coding/my_deep_speaker/audio/spk_ver_20180401_20180630_70_3_reseg_test/wav'
'/spk_ver_20180401_20180630_70_3_reseg_testZEBRA_KIDS00000_110411652-ZEBRA_KIDS00000_110411652_ff3875f4fb3e5ef4.wav'，请问这个wav_fn的作用是什么呢？而且在您项目文件里没有这两个.wav文件，请问我需要怎么做呢？
问题2-----我用一个5s的任意音频替代了上述文件，train.py可以正常启动，但是运行时显示“Found 0000368 files with 00001 different speakers.”这里只分辨出1个说话人，是不是有问题呢？
问题3-----train.py正常启动后，运行到select_bacth.py文件下第187行：“ neg0_index = ninds[np.argwhere(san == max_sans[0]).flatten()[0]] “ 时出现异常，显示“list index out of range”，通过逐行debug后，发现原因是在第184行：”ninds = np.argwhere(hist_labels != speaker).flatten()“这里找不到索引，我觉得问题还是出在只有1个speaker上。但是我不知道该如何解决。请您赐教。

运行问题

您好，感谢您的优质代码，最近我也在学习声纹识别这方面的知识。我想请问您在train.py中给的循环条件是while True，我运行后它就一直在跑，为什么您要用这个条件呢？谢谢指点！

使用aishell数据集实验遇到的问题

学长抱歉，再次打扰您！上次我是用train-clean-100完整数据集进行实验后，想用aishell中文语料库跑一次程序看看结果。数据已经按照代码中的audio samples预处理好了。但是在python train.py的过程中代码报错，请问您知道原因吗？
Found 0120418 files with 120418 different speakers.
Traceback (most recent call last):
File "train.py", line 189, in
main()
File "train.py", line 71, in main
batch = stochastic_mini_batch(libri, batch_size=c.BATCH_SIZE, unique_speakers=unique_speakers)
File "/home/dcase/mawen/SW/aishell/Deep_Speaker-speaker_recognition_system-master/random_batch.py", line 89, in stochastic_mini_batch
mini_batch = MiniBatch(libri, batch_size,unique_speakers)
File "/home/dcase/mawen/SW/aishell/Deep_Speaker-speaker_recognition_system-master/random_batch.py", line 45, in init
two_different_speakers = np.random.choice(unique_speakers, size=2, replace=False)
File "mtrand.pyx", line 1125, in mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

运行报错，求助！

您好，我基于train-clean-360数据集复现实验，在运行pretraining.py和train.py文件的时候，都出现了以下错误，请问该如何解决？诚盼回复～
Traceback (most recent call last):

File "", line 1, in
runfile('/Users/tongmeng/Desktop/speaker_recognition/code/TensorFlow-based_Deep_Speaker/Deep_Speaker-speaker_recognition_system-master-1/pretraining.py')

File "/anaconda3/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "/anaconda3/lib/python3.6/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/Users/tongmeng/Desktop/speaker_recognition/code/TensorFlow-based_Deep_Speaker/Deep_Speaker-speaker_recognition_system-master-1/pretraining.py", line 162, in
main()

File "/Users/tongmeng/Desktop/speaker_recognition/code/TensorFlow-based_Deep_Speaker/Deep_Speaker-speaker_recognition_system-master-1/pretraining.py", line 134, in main
x_train, y_train = batchloader.next()

File "/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 447, in load
pickle_kwargs=pickle_kwargs)

File "/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 742, in read_array
array.shape = shape

ValueError: cannot reshape array of size 29680 into shape (1269,64,1)

运行报错！求助

您好，我正在复现您的实验，unbuntu16.04，python3.7，我首先运行了train.py文件，出现了以下错误，请问该如何解决？诚盼回复
model_build_time 5.370615005493164
get batch time 1.98e-05s
forward process time 7.57s
beginning to select..........
select best batch time 0.188s
select_batch_time: 7.82932448387146
Traceback (most recent call last):
File "train.py", line 181, in
main()
File "train.py", line 125, in main
loss = model.train_on_batch(x, y)
File "/home/dcase/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1808, in train_on_batch
check_batch_axis=True)
File "/home/dcase/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1411, in _standardize_user_data
exception_prefix='target')
File "/home/dcase/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 153, in _standardize_input_data
str(array.shape))
ValueError: Error when checking target: expected ln to have shape (None, 512) but got array with shape (96, 1)

walleclipse / deep_speaker-speaker_recognition_system Goto Github PK