leichtrhino / chimeranet Goto Github PK

View Code? Open in Web Editor NEW

13.0 3.0 2.0 18.31 MB

Unofficial implementation of music separation model by Luo et.al.

License: MIT License

Python 100.00%

music-separation keras

chimeranet's Introduction

ChimeraNet

An implementation of music separation model by Luo et.al.

Getting started

Sample separation task with pretrained model

Prepare .wav files to separate.
Install library pip install git+https://github.com/leichtrhino/ChimeraNet
Download pretrained model.
Download sample script.
Run script

python chimeranet-separate.py -i ${input_dir}/*.wav \
    -m model.hdf5 \
    --replace-top-directory ${output_dir}

Output in nutshell

the format of filename is ${input_file}_{embd,mask}_ch[12].wav.
embd and mask indicates that it was inferred from deep clustering and mask respectively.
ch1 and ch2 are voice and music channel respectively.

Train and separation examples

See Example section on ChimeraNet documentation.

Install

Requirements

keras
one of keras' backends (i.e. TensorFlow, CNTK, Theano)
sklearn
librosa
soundfile

Instructions

Run pip install git+https://github.com/leichtrhino/ChimeraNet or any python package installer. (Currently, ChimeraNet is not in PyPI.)
Install keras' backend if the environment does not have any. Install tensorflow if unsure.

chimeranet's People

Contributors

Stargazers

Watchers

Forkers

ishine jhlusko

chimeranet's Issues

sample_rate argument

Hello again! I have tried my own wav files with the following characteristics shown by the soxi command. With the wav files, chimeranet-train.py threw an error when I set the --sr argument to either 44100 or 22050 (the error message >> Error when checking input: expected input_1 to have shape (64, 259) but got array with shape (64, 188)), whereas the default sampling rate of 16,000 just works fine to train. I am wondering whether higher than 16,000 SR is accepted for training (or higher SR makes any sense to be used in your system). Thanks.

Input File : 'XXX.wav'
Channels : 1
Sample Rate : 44100
Precision : 16-bit
Duration : 00:00:02.00 = 88200 samples = 150 CDDA sectors
File Size : 184k
Bit Rate : 734k
Sample Encoding: 16-bit Signed Integer PCM

folder format

Hello!
It is easy to use and seems to work. Before pursuing more with my own data, please guide me on the data placement. I placed the wav. files as shown below. Is this the correct way to process melody1 and vocal1, etc as pairs?? Adding more readme would be appreciated. Thanks!!

Total training samples

Hi,

I have a question about the data_generator.py, generate_test_data function.

How do you calculate the number of steps (samples)=7200 in your training script using this function. The generator function for keras requires you to know the number of samples/steps per epoch.

How can I use this to calculate the samples for a different dataset ?
Also, how can I calculate the same for a validation set?

Any help for the understanding would be appreciated.

Speech Separation

Hi,

Did you try to use this network for separating two speakers sources from a mixture of them?

Details about the pretrained net

Hi,
thanks for implementing the chimera network! Could you explain what data the pretrained model was trained on? What are the sources it separates? What does the 120 in the name stand for? Do you have more pretrained models that you can share?

Thanks and best regards
Verena

Joining coding forces

Hi leichtrhino,

I'd like to ask if you'd like to join our dev team of Asteroid, this would be really great as we would be faster to develop things and have even more coverage of the architectures and dataset we can quickly experiment with 🚀

Don't hesitate to join our slack to discuss potential collaboration ! 😃

Mask-Inference layers

https://github.com/arity-r/ChimeraNet/blob/6341383c61f238a83a0be8c7d4972aac4e7d958a/chimeranet/model.py#L57-L69

Correct me if I am wrong. One can simply change this block of code with Dense, and Reshape layers like this:

mask_linear = Dense(self.F*self.C, activation='softmax', name='mask_linear')(body_linear)
mask = Reshape((self.T, self.F, self.C), name='mask')(mask_linear)

I think, the API can handle the gradients update accordingly because of the Reshape layer. Also, it does not require much memory because the masks are not extracted a list. But I wonder if it would be a a correct definition for the mask-inference head of the model.

The reference I use is the Chimera++ network from the paper ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING which redefines the architecture for speaker separation.

Topological sorting of the Graph

When running the training, before the initial epoch I get a log like:

2019-08-02 08:32:55.251530: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2019-08-02 08:32:55.305719: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2019-08-02 08:32:56.161844: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2019-08-02 08:32:56.213568: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:666] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.

Although, I don't have any problem with the final model accuracy. I think, this is related to the fact that there are multiple heads in the Functional definition of the model. Do you also get the same when you run the script? Which Keras/tensorflow version do you use?

The versions, I am using are:

tensorflow-gpu: 1.12.0
Keras: 2.2.4
CUDA Version: 9.0.176