Giter VIP home page Giter VIP logo

dnn-based_source_separation's Introduction

DNN-based source separation

A PyTorch implementation of DNN-based source separation.

New information

  • v0.7.2
    • Update jupyter notebooks.

Model

Model Reference Done
WaveNet WaveNet: A Generative Model for Raw Audio
Wave-U-Net Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Deep Clustering Deep Clustering: Discriminative Embeddings for Segmentation and Separation
Deep Clustering++ Single-Channel Multi-Speaker Separation using Deep Clustering
Chimera Alternative Objective Functions for Deep Clustering
DANet Deep Attractor Network for Single-microphone Apeaker Aeparation
ADANet Speaker-independent Speech Separation with Deep Attractor Network
TasNet TasNet: Time-domain Audio Separation Network for Real-time, Single-channel Speech Separation
Conv-TasNet Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
DPRNN-TasNet Dual-path RNN: Efficient Long Sequence Modeling for Time-domain Single-channel Speech Separation
Gated DPRNN-TasNet Voice Separation with an Unknown Number of Multiple Speakers
FurcaNet FurcaNet: An End-to-End Deep Gated Convolutional, Long Short-term Memory, Deep Neural Networks for Single Channel Speech Separation
FurcaNeXt FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks
DeepCASA Divide and Conquer: A Deep Casa Approach to Talker-independent Monaural Speaker Separation
Conditioned-U-Net Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for multiple source separations
MMDenseNet Multi-scale Multi-band DenseNets for Audio Source Separation
MMDenseLSTM MMDenseLSTM: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation
Open-Unmix (UMX) Open-Unmix - A Reference Implementation for Music Source Separation
Wavesplit Wavesplit: End-to-End Speech Separation by Speaker Clustering
Hydranet Hydranet: A Real-Time Waveform Separation Network
Dual-Path Transformer Network (DPTNet) Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
CrossNet-Open-Unmix (X-UMX) All for One and One for All: Improving Music Separation by Bridging Networks
D3Net D3Net: Densely connected multidilated DenseNet for music source separation
LaSAFT LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation
SepFormer Attention is All You Need in Speech Separation
GALR Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Reccurent networks
HRNet Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation
MRX The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

Modules

Module Reference Done
Depthwise-separable convolution Xception: Deep Learning with Depthwise Separable Convolutions
Gated Linear Units (GLU) Language Modeling with Gated Convolutional Networks
Sigmoid Linear Units (SiLU) Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning
Feature-wise Linear Modulation (FiLM) FiLM: Visual Reasoning with a General Conditioning Layer
Point-wise Convolutional Modulation (PoCM) LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation

Method related to training

Method Reference Done
Pemutation invariant training (PIT) Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
One-and-rest PIT Recursive Speech Separation for Unknown Number of Speakers
Probabilistic PIT Probabilistic Permutation Invariant Training for Speech Separation
Sinkhorn PIT Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm
Combination Loss All for One and One for All: Improving Music Separation by Bridging Networks

Example

Open In Colab

LibriSpeech example using Conv-TasNet

You can check other tutorials in <REPOSITORY_ROOT>/egs/tutorials/.

0. Preparation

cd <REPOSITORY_ROOT>/egs/tutorials/common/
. ./prepare_librispeech.sh \
--librispeech_root <LIBRISPEECH_ROOT> \
--n_sources <#SPEAKERS>

1. Training

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./train.sh \
--exp_dir <OUTPUT_DIR>

If you want to resume training,

. ./train.sh \
--exp_dir <OUTPUT_DIR> \
--continue_from <MODEL_PATH>

2. Evaluation

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./test.sh \
--exp_dir <OUTPUT_DIR>

3. Demo

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./demo.sh

Pretrained Models

You need gdown to download pretrained models.

pip install gdown

You can load pretrained models.

from models.conv_tasnet import ConvTasNet

model = ConvTasNet.build_from_pretrained(task="musdb18", sample_rate=44100, target="vocals")

See PRETRAINED.md, egs/tutorials/hub/pretrained.ipynb or click Open In Colab for details.

Time Domain Wrappers for Time-Frequency Domain Models

See egs/tutorials/hub/time-domain_wrapper.ipynb or click Open In Colab.

Speech Separation by Pretrained Models

See egs/tutorials/hub/speech-separation.ipynb or click Open In Colab.

Music Source Separation by Pretrained Models

See egs/tutorials/hub/music-source-separation.ipynb or click Open In Colab.

If you want to separate your own music file, see below:

  • MMDenseLSTM: See egs/tutorials/mm-dense-lstm/separate_music.ipynb or click Open In Colab.
  • Conv-TasNet: See egs/tutorials/conv-tasnet/separate_music.ipynb or click Open In Colab.
  • UMX: See egs/tutorials/umx/separate_music.ipynb or click Open In Colab.
  • X-UMX: See egs/tutorials/x-umx/separate_music.ipynb or click Open In Colab.
  • D3Net: See egs/tutorials/d3net/separate_music.ipynb or click Open In Colab.

dnn-based_source_separation's People

Contributors

tky823 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dnn-based_source_separation's Issues

Linear encoder.

Now, the encoder of TasNet requires a nonlinear function (enc_nonlinear). However, a linear encoder is used in the paper.

Separating a specific drum

Hi, great work! I am interested in separating a specific drum from tracks. We could provide training data. Would you be interested in a gig (could be everything from helping pointing out what needs to be done to adapt your code to take our music files as training to writing the code).

Thanks!

Unstable training of DPRNN-TasNet

The network wouldn't be trained when

  • Layer normalization is applied before the bottleneck convolution.
  • Batch size / GPU is small (e.g. 1).
    • This phenomenon is observed in Conv-TasNet.

No bottleneck convolution seems to be better.

Evaluation metrics

Evaluation of source separation by

  • SDR (improvement)
  • SIR (improvement)
  • SAR

These are realized by mir_eval.

DPRNN-TasNet architecture

Questions:

  • Does DPRNN-TasNet require the bottleneck convolution?
  • separable and dilated option is unnecessary.

PESQ error

PESQ sometimes raises processing error. 443c020x_0.18686_447c0205_-0.18686_22go010j_0.wav may cause the error because the target 443c020x has too short utterance part.

Finetuning

A script of finetuning for source separation with an unknown number of sources is required.

missing SPEAKERS.TXT for example

When running the 0. Preparation section of the Example in the Readme.md file, I get an error:
FileNotFoundError: [Errno 2] No such file or directory: '../../../dataset/SPEAKERS.TXT'
If I just create an empty file, ./prepare_librispeech.sh runs without error, but then ./train.sh gives an error ValueError: num_samples should be a positive integer value, but got num_samples=0
Please give some advice on the required content and format of SPEAKERS.TXT
Thank you.

Just a question

Close this off afterwards but is there such a thing as a targeted keyword separation as say voicefilter is targeted speech separation I just wondered if similar had been applied to a known word(s) than a specific voice spectra and if you knew of anything or maybe could point out possibles?

Training for ORPIT.

For the training of ORPIT, the number of ground-truth sources is mixed in one batch.
e.g.)

sources_A # tensor with the shape (2, T)
sources_B # tensor with the shape (3, T)

We cannot concatenate them.

# in collate_fn
minibatch = torch.cat([sourcesA.unsqueeze(dim=0), sourcesB.unsqueeze(dim=0)], dim=0) # Error

How to handle them using ORPIT class in criterion/pit.py

Join efforts?

Hi @tky823, nice repo !

We'd welcome most of this code in Asteroid if you'd like to contribute 😃 Would you?

Cheers,

problem in google colab

I'm getting error when running cell 2 of all google colab notebook, how to solve? thanks.

|████████████████████████████████| 596 kB 8.8 MB/s 
     |████████████████████████████████| 963 kB 49.4 MB/s 
     |████████████████████████████████| 130 kB 74.6 MB/s 
Download CrossNet-Open-Unmix. (Dataset: MUSDB18, sampling frequency 44.1kHz)
Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1yQC00DFvHgs4U012Wzcg69lvRxw5K9Jj 

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/content/DNN-based_source_separation/src/utils/utils.py", line 43, in download_pretrained_model_from_google_drive
    with zipfile.ZipFile(zip_path) as f:
  File "/usr/lib/python3.7/zipfile.py", line 1240, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/53e252ff-8063-4c95-87ac-01fdaff0341b.zip'
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[<ipython-input-3-b990f481835a>](https://localhost:8080/#) in <module>()
----> 1 get_ipython().run_cell_magic('shell', '', 'cd "/content/DNN-based_source_separation/egs/tutorials/x-umx"\n\n# Build environment\npip install -r requirements.txt -q\n\n# Download pretrained model\nmodel_name="musdb18"\n\n. ./prepare.sh --model_name "${model_name}"')

2 frames
[/usr/local/lib/python3.7/dist-packages/google/colab/_system_commands.py](https://localhost:8080/#) in check_returncode(self)
    137     if self.returncode:
    138       raise subprocess.CalledProcessError(
--> 139           returncode=self.returncode, cmd=self.args, output=self.output)
    140 
    141   def _repr_pretty_(self, p, cycle):  # pylint:disable=unused-argument

CalledProcessError: Command 'cd "/content/DNN-based_source_separation/egs/tutorials/x-umx"

# Build environment
pip install -r requirements.txt -q

# Download pretrained model
model_name="musdb18"

. ./prepare.sh --model_name "${model_name}"' returned non-zero exit status 1.

Screenshot_1

`hidden_channels` parameter in LSTM

Now, the hidden_channels is used like

import torch.nn as nn

if causal:
    lstm = nn.LSTM(num_features, hidden_channels, bidirectional=False)
else:
    lstm = nn.LSTM(num_features, hidden_channels//2, bidirectional=True)

in dprnn.py.

This configuration may be confusing.

Conv-TasNet Cumulative Layer Norm Bug?

shouldn't lines 78-92 be

`
step_sum = input.sum(dim=1) # -> (batch_size, T)
cum_sum = torch.cumsum(step_sum, dim=1) # -> (batch_size, T)

cum_num = torch.arange(C, C*(T+1), C, dtype=torch.float) # -> (T, ): [C, 2C, ..., TC]
cum_mean = cum_sum / cum_num # (batch_size, T)
cum_var = (cum_sum - cum_mean)**2/cum_num

cum_mean = cum_mean.unsqueeze(dim=1)
cum_var = cum_var.unsqueeze(dim=1)

output = (input - cum_mean) / (torch.sqrt(cum_var) + eps) * self.gamma + self.beta
`

according to the Conv-TasNet paper?

Model common usage

Hi tky823, thank you so much for your work, this is a great framework. I've just a quick question about the usage of the models. I see you provide Conv-TasNet & D3Net for a training over the MUSDB but not using the TasNet and more specificaly the DPRNNTasNet. Are theses models eventually working on other than pure speech signal? And, in your opinion, what is the best (current) model that cover musical signal? Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.