zhr1201 / deep-clustering Goto Github PK

A tensorflow implementation for Deep clustering: Discriminative embeddings for segmentation and separation

Python 100.00%

speech-seperation tensorflow deep-learning

deep-clustering's Introduction

A tensorflow implementation of deep clustering for speech seperation

This is a tensorflow implementation of the deep clustering paper: https://arxiv.org/abs/1508.04306 A few exmaples from the test set can be viewed in visualization_samples/ and speech_samples/

Requirements

Python 2 and its packages:

tensorflow r0.11
numpy
scikit-learn
matplotlib
librosa

File documentation

GlobalConstant.py: Gloabl constants.
datagenerator.py: Transform seperate speech files in a dir into .pkl format data set.
datagenerator2.py: A class to read the .pkl data set and generate batches of data for training the net.
model.py: A class defining the net structure.
train_net.py: Train the DC model.
mix_samples.py: Mix up two pieces of speech signals for test.
AudioSampleReader.py: Transform a .wav file into chunks of frames to be fed to the models during test.
visualization_of_samples.py: Visualize the active embedding points using PCA.
audio_test.py: Take in two speaker mix sample and seperate them.

Training procedure

Orgnize your speech data files as the following format: root_dir/speaker_id/speech_files.wav
Make some changes dir of the datagenerator.py and run it, you may want to rename the .pkl file properly. 3. Make dirs for write summaries and checkpoints, update your dirs in the train_net.py. The changes of the .pkl file list for training and validation are also need to be made.
Train the model.
Generate some mixtures using mix_samples.py, and modify the checkpoints in audio_test.py.
Enjoy yourself!

Some other things

The optimizer is not the same as that in the original paper, and also no 3 speaker mixture generator is provided, and we are moving on to the next stage of work and will not bother to do that. If you are interested and implemente that, we are glad to merge your branch.

References

https://arxiv.org/abs/1508.04306

deep-clustering's People

Contributors

Stargazers

Watchers

deep-clustering's Issues

about the result

Hello，Thank you for your codes. I use your code on my dataset and I run audio_test.py to get separation results(WAV_out). But when I play the separated .wav files, it tells me there is a code error and the wav files can not work. Can you tell me how can I solve this problem.
My email address : [email protected]

How can I fix the following errors? - audio_test.py

C:\Users\chilk\AppData\Local\Programs\Python\Python35\python.exe D:/data/python/deep-clustering-master/audio_test.py
Traceback (most recent call last):
File "D:/data/python/deep-clustering-master/audio_test.py", line 237, in
out_put(4)
File "D:/data/python/deep-clustering-master/audio_test.py", line 77, in out_put
data_generator = AudioSampleReader(data_dir)
File "D:\data\python\deep-clustering-master\AudioSampleReader.py", line 44, in init
speech_mix, _ = librosa.load(data_dir, SAMPLING_RATE)
File "C:\Users\chilk\AppData\Local\Programs\Python\Python35\lib\site-packages\librosa\core\audio.py", line 107, in load
with audioread.audio_open(os.path.realpath(path)) as input_file:
File "C:\Users\chilk\AppData\Local\Programs\Python\Python35\lib\site-packages\audioread_init_.py", line 80, in audio_open
return rawread.RawAudioFile(path)
File "C:\Users\chilk\AppData\Local\Programs\Python\Python35\lib\site-packages\audioread\rawread.py", line 61, in init
self._fh = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'D:\data\python\deep-clustering-master\mix.wav'

Folder 'dcdata'

Hi there,
In the file train_net.py, you are trying to access 1.pkl, 2.pkl,...12.pkl from a folder dcdata
But we haven't generated any such files. Ofc we have generated val.pkl using datagenerator.py

Please help me understand what these are for and how to generate them.

Memory requirements

On an Amazon g2.2xlarge instance, train_net.py, I get an out-of-memory error.

Stats:

Limit: 3868721152
InUse: 3824706816
MaxInUse: 3825321984
NumAllocs: 35475
MaxAllocSize: 370720768

What are the memory requirements for training the system with 40-dimensional (default) embedding ?

Where are the pkl_list defined in the train_net.py?

Where are the pkl_list defined in the train_net.py?
pkl_list = ['../dcdata/' + str(i) + '.pkl' for i in range(1, 12)]

I just got one val.pkl generated in datagenerator.py
cPickle.dump(self.samples, open('val.pkl', 'wb'))

How can I fix the following errors? - audio_test.py

Scaling in STFT

I noticed that you scale the windows in the STFT with 0.5016:
https://github.com/zhr1201/deep-clustering/blob/master/audio_test.py#L229

Is this related to COLA (https://gauss256.github.io/blog/cola.html)? If not how did you decide on this constant?

how can i fix the following errors? -- audio_test.py

frame_out1 = np.fft.ifft(out1).astype(np.float64)

Error: ComplexWarning: Casting complex values to real discards the imaginary part
frame_out1 = np.fft.ifft(out1.real).astype(np.float64)

valueError

Weights for trained model

Can you include the weights for a trained model?

training process is too slow

hi,
thanks very much for your project about the deep-clustering. I use your program to train my model use the WSJ0 in the paper, but the process is too slow. About 7 days the epoch is about 35, and i use 4 GPUs. Do you have this problems? Thanks very much!

the seperated speech can't play

I ran the audio_test.py in the Pycharm,and I got the warnning as follows:ComplexWarning: Casting complex values to real discards the imaginary part
frame_out1 = np.fft.ifft(out1).astype(np.float64)
D:/Python/process/speechseperation/infer.py:192: ComplexWarning: Casting complex values to real discards the imaginary part
frame_out2 = np.fft.ifft(out2).astype(np.float64)
D:/Python/process/speechseperation/infer.py:193: ComplexWarning: Casting complex values to real discards the imaginary part
frame_mix = np.fft.ifft(out_mix).astype(np.float64)

and the seperated speech can not play, I don't know why, due to my model or the code? your prompt reply will be highly appreciated!!!

Typo in model.py causes error

global name 'EMBBEDDING_D' is not defined

Should be EMBEDDING, not EMBBEDDING

Test audio sample clipped.

I'm using WSJ0 dataset, and the model convergences. When I use the utility audio_test.py, however, I get only 3 second output when I feed it a 5 second mixed audio. Is that any issue in this python script causing the audio to be clipped?

Mistake in visualization_of_samples.py

In line 50 of visualization_of_samples.py, we are trying to initializing the model with 3 parameters, but as per the model.py file, it needs 4.

Please give it a check.

I get a vary high loss

Thanks for your great job.I rewrite you program to make it run with tensorflow1.12 and python 3. I only use 2BLSTM not 4.At the begining of train stage ,the train loss is 3000k, after 66k steps ,the train loss is 1000k . I want to know thats correct or not.

dc model
`
with tf.variable_scope('BLSTM1') as scope:

        layer1_fw = tf.nn.rnn_cell.LSTMCell(self.n_hidden)
        layer1_bw = tf.nn.rnn_cell.LSTMCell(self.n_hidden)

        #dropout
        layer1_fw_dropout = tf.nn.rnn_cell.DropoutWrapper(layer1_fw,self.p_keep_fw)
        layer1_bw_dropout = tf.nn.rnn_cell.DropoutWrapper(layer1_bw,self.p_keep_fw)

        #layer1 outputs
        layer1_outputs,_ = tf.nn.bidirectional_dynamic_rnn(layer1_fw_dropout,layer1_bw_dropout,x,sequence_length=[FRAMES_PER_SAMPLE] * self.batch_size,dtype=tf.float32)

        #拼接两个输出
        layer1_output = tf.concat(layer1_outputs,2)
        #第一层结束

    #第二层BLSTM
    with tf.variable_scope('BLSTM2') as scope:
        layer2_fw = tf.nn.rnn_cell.LSTMCell(self.n_hidden)
        layer2_bw = tf.nn.rnn_cell.LSTMCell(self.n_hidden)

        # dropout
        layer2_fw_dropout = tf.nn.rnn_cell.DropoutWrapper(layer2_fw, self.p_keep_fw)
        layer2_bw_dropout = tf.nn.rnn_cell.DropoutWrapper(layer2_bw, self.p_keep_fw)

        # layer1 output
        layer2_outputs, _ = tf.nn.bidirectional_dynamic_rnn(layer2_fw_dropout, layer2_bw_dropout, layer1_output,
                                                            sequence_length=[FRAMES_PER_SAMPLE] * self.batch_size,
                                                            dtype=tf.float32)

        # 拼接两个输出
        layer2_output = tf.concat(layer2_outputs, 2)
        # 第二层结束

    #feedfoward layer
    with tf.variable_scope('feedfoward') as scope:
        blstm_output = tf.reshape(layer2_output, [-1, self.n_hidden * 2])
        emb_out = tf.matmul(blstm_output,self.weights['out']) + self.biases['out']
        #tanh激活函数
        emb_out = tf.nn.tanh(emb_out)
        reshaped_emb = tf.reshape(emb_out, [-1, NEFF, EMBBEDDING_D])
        # #L2正则化
        normalized_emb = tf.nn.l2_normalize(reshaped_emb, 2)
    return normalized_emb`

Model correctly separates sources at any given time but mixes up speakers over utterance

As an experiment, I built a training set with two speakers, one male and the other female, ~2 hrs of speech each, split into 3-second utterances. My goal was to fit the training set, without concern for generalization yet.

After a few hours of training, I mixed two samples from the training set and ran audio_test.py.

The result cleanly separates the sources (no voice overlap) but both speakers can be sequentially heard in each output file.

For instance, if speaker 1 says "the coming elections will be a battle" and speaker 2 says "the weather is glorious in Casblanca", then I get something like:

output file 1: "the coming elections ... is glorious in Casablanca"
output file 2: "the weather ... will be a battle".

Any idea how to get only one speaker per output file ?

Model fails to separate sources

I used your code to train the model with 358000 iterations on TIMIT dataset. But the model fails to separate sources in "audio_test". The two speakers' voices are still mixed in each output file. I have tried to change the FRAMES_PER_SAMPLE to bigger but it dose not work either.
Any suggestions is appreciated. And if possible could you share your pretrained weights file for me to have a try? My email is [email protected]
Thanks a lot

Preprocessing of Dataset to feed into BLSTM

I have been trying to implement paper "Deep clustering: Discriminative embeddings for segmentation and separation", but I am not able to create batches because each audio file have different no of frames. I came across one sentence in experimental setup section that "To ensure the local coherency, the mixture speech was segmented with the length of 100 frames". What I understand is that authors are dividing each sample into 100 frames chunks and use each of this as input. Is that how do author handle variable length input to LSTM??

Won't respond to issues that are too simple

Simple means can be solved if you have a reasonable understanding of Python and TF.

Convergence issue

Hi Haoran Zhou,

did you fix convergence issue that you mentioned in this discussion jcsilva/deep-clustering#1? I have tested your implementation on different amount of samples from wsj0 dataset and it stucks in a very bad local minima even for small training datasets (1000 utterances) in my case. I did not change your code, just tried out different optimizers but it did not bring any improvements.

DataSet

What dataset did you use ? Can you please mention it

How many Epoch would it take for TIMIT dataset approximately?

Hi, Haoran Zhou:
I've noticed that you mentioned that it may takes 10 hours to train the model on TIMIT dataset. However, it seems that it's not the case on my machine. I'm wondering if I'm underfitting or overfitting.

Would you mind giving me an approximate number of epoch it takes to train a good model on TIMIT dataset(Mine has been trained over 2000 epoch and it's on 144000 step)?

Thank you in advance~

Question about cluster permutation

Hi! @zhr1201
Thank you for your great job! I ran your code successfully and the separation performance was good! But I still got some question, I would appreciate a lot If you could take a look:
In audio_test.py, It seems that you used a bool ( cor[1] > cor[0]) to decide whether to change the order of cluster1 and cluster2. But the definition of the variable "cor" really confused me. I wonder why you choose the inner product of Clusters to represent the "rate of persistence"(or something like that). I didn't find it in original paper, Did I miss something?

Could you provide a pre-trained model for testing?

Hi,
I want to know whether the method is applied to my project. So I can test the performance by my test data. Could you provide a pre-trained model for my testing?