jusperlee / conv-tasnet Goto Github PK

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation Pytorch's Implement

Python 100.00%

cnn-architecture deep-learning pytorch speech-separation

conv-tasnet's Introduction

Hey 👋🏽, I'm Kai Li!

My name is Kai Li (Chinese name: 李凯). I'm a second-year master student at Department of Computer Science and Technology, Tsinghua University, supervised by Prof. Xiaolin Hu (胡晓林). I am also a member of TSAIL Group directed by Prof. Bo Zhang (张拨) and Prof. Jun zhu (朱军). I am an intern at Tencent AI Lab, mainly doing research on causal speech separation, supervised by Yi Luo (罗艺).

🤗 These works are open source to the best of my ability.

🤗 I am currently doing research on multimodal speech separation, and am interested in other speech tasks (e.g., pre-training models and neuralscience). If you would like to collaborate, please contact me. Many thanks.

🔖 Homepages

: Kai Li : Jusper Lee : cslikai.cn

📅 News

2023.07: 🎲 One paper is accepted by ECAI 2023.
2023.05: 🧩 Two papers are accepted by Interspeech 2023.
2023.05: 🎉 We won the first prize 🥇 of the Cinematic Sound Demixing Track 23 in the Leaderboard A and B.
2023.05: 🎉 We won the first prize 🥇 of the ASC23 and Best Application Award.
2023.04: 🎲 One paper is appeared by Arxiv.
2023.02: 🧩 One paper is accepted by ICASSP 2023.
2023.01: 🧩 One paper is accepted by ICLR 2023.

📰 Selected Publications:

See Google Scholar for a full list of publications.

Speech Separation

An efficient encoder-decoder architecture with top-down attention for speech separation. Kai Li, Runxuan Yang, Xiaolin Hu. ICLR 2023.
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits. Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu. Arxiv 2022.
Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network Xiaolin Hu, Kai Li, Weiyi Zhang, Yi Luo, Jean-Marie Lemercier, Timo Gerkmann. NeurIPS 2021.

Neuroscience

Inferring mechanisms of auditory attentional modulation with deep neural networks. Ting-Yu Kuo, Yuanda Liao, Kai Li, Bo Hong, Xiaolin Hu. Neural Computation 2022.

Cloud Removal

PMAA: A Progressive Multi-scale Attention Autoencoder Model for High-Performance Cloud Removal from Multi-temporal Satellite Imagery. Xuechao Zou, Kai Li, Junliang Xing, Pin Tao#, Yachao Cui. ECAI 2023.

Super Resolution

A Survey of Single Image Super Resolution Reconstruction. Kai Li, Shenghao Yang, Runting Dong, Jianqiang Huang, Xiaoying Wang. IET Image Processing 2020.
Single Image Super-resolution Reconstruction of Enhanced Loss Function with Multi-GPU Training. Jianqiang Huang, Kai Li, Xiaoying Wang. ISPA 2019.

conv-tasnet's People

Contributors

Stargazers

Watchers

conv-tasnet's Issues

Pre-trained model test has no effect.Can you upload a new one?

The current pre-trained model has no separation effect.Can you re-upload one？Thank you!

About the conv block architecture

Hi Jusper, I noticed that you removed the second PReLU2 and Norm2 in the Conv1D Block in the latest version.
May I know why you did this?
'Cause in the original paper they are there.

pretrained causal model

Great job! Could you share the pre-trained causal model?

using pre-train model error

你好，我使用你在readme中提供的预训练模型best.pt对一段混合语音进行测试，在保存音频文件时出现以下错误，请问该如何解决呢

train loss

model_size issue

I have tried the parameter configuration on the paper, but the model size is still only about 3.5M, not up to the baseline of 5.1M. What else can I do to increase the model size besides modifying the yml file?

DataLoaders到底是在干嘛

重新写一个Dataloader那个__iter__的意义是什么啊

训练问题

大佬您好，我刚开始跑代码，训练的时候有一个错误是说在validation model 那部分，trainer.py里面
total_loss_avg = sum(losses)/len(losses)
ZeroDivisionError: division by zero ，这个losses为0是什么原因呢？
谢谢！

您好，可以分享一下wsj0数据集么

想求您的wsj0数据集
万分感谢

torchaudio的问题

50行的key意义不清楚，随便换了个文件名，导致格式报错，希望可以指出key的错误和用法

求解

请问测试单个音频时46行的key是什么，没有找到它的定义，希望您能指出它的解决方法，万分感谢

噪声问题

作者您好，非常感谢您提供的代码，这份代码我运行了，在测试的时候，分离效果几乎很小，反而还增加了噪声，现在就是有几个小问题，和这份代码无关的小问题，我是在自己的数据集上跑的，我把自己的数据集切成了3s的然后随机混合，train 16800条,val 4200条，是不是我在测试的时候就是运行Seperation.py得加一个专门的test集，这个test集可不可以直接用val呢？还是说得再重做呢？还有一个问题就是音频长度是否会影响实验结果呢？3s的会不会太短了？

Conv-TasNet-Skip用途

Noise Issue

I trained the model on two-speaker-mix clean datasets.
It split well, but its output brings too many noise.
Is this a common phenomenon?

And the spilt on two-speaker-mix with noise datasets is bad. Loss can not be reduced.

Inference themodel

Where could I find ./config/train/train.yml file?

配置

作者您好，感谢您提供的代码，这个代码在只有一块GTX1080Ti,上可以跑吗？

Question about Conv1D_Block

Thanks for your excellent work! But I have a question ,the D_conv in the 1-D Conv_block in the paper uses PReLU and normalization later, but it seems that there is no such thing in Conv1D_Block class, is there any other consideration for doing so?

复现结果问题

大佬您好！非常感谢您提供的这份代码。我有几个问题想要请教一下：我在复现的时候跑不出原文的效果，是不是需要对数据做什么样的预处理？您合成混合语音就是直接用您提到的matlab脚本吗？我发现里面的.txt涉及到的部分语音在我的数据集里面不存在，会是因为我获得的WSJ0数据集不完整吗？以及您用于合成语音的是 “chime2_wsj0/data/chime2_wsj0/isolated/*” 下面的语音数据吗？
非常感谢！

关于Encoder与mask复现

请问该如何像论文中一样复现encoder与decoder的过程图呢？

Error in Loss Function

This line

Conv-TasNet/Conv-TasNet_lightning/Loss.py

Line 48 in 9eac70d

N = ests[0].size(0)

could have two different outputs given your test:

if we continue with

if __name__ == "__main__":
    ests = torch.randn(4,320)
    egs = torch.randn(4,320)
    loss = Loss()
    print(loss.compute_loss(ests, egs))

N will be length of audio 320

however if we have a list of outputs (two speakers)


if __name__ == "__main__":
    ests = [torch.randn(4,320),torch.randn(4,320)]
    egs = [torch.randn(4,320),torch.randn(4,320)]
    loss = Loss()
    print(loss.compute_loss(ests, egs))

N will be batch size.

Could you tell me please which one is the correct approach?

I guess your test is wrong and it should be

if __name__ == "__main__":
    ests = [torch.randn(4,320)]
    egs = [torch.randn(4,320)]
    loss = Loss()
    print(loss.compute_loss(ests, egs))

Could you validate this?

using TIMIT data

Thank you for your useful code.
I can not find WSJ data and want to use "create-speaker-mixtures.zip" to make mixture with TIMIT.
How can I use this with TIMIT configuration files?

从option包中引用parse报错

请问这个怎么解决

硬件规格

您好，想请问您gpu规格是什摸，我使用RTX 3090，出现了CUDA out of memory，所以想跟您询问，谢谢

训练时出错

您好，我在使用Conv_TasNet_Pytorch中的train.py进行训练时遇到了如下报错：

/home/layers/miniconda3/envs/asteroid/lib/python3.7/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  '"sox" backend is being deprecated. '
2023-04-14 17:28:59 [./options/option.py:13 - INFO ] Reading .yml file .......
2023-04-14 17:28:59 [./options/option.py:19 - INFO ] Export CUDA_VISIBLE_DEVICES = 0
2023-04-14 17:28:59 [train.py:20 - INFO ] Building the model of Conv-TasNet
2023-04-14 17:28:59 [train.py:23 - INFO ] Building the trainer of Conv-TasNet
2023-04-14 17:29:00 [/home/layers/audio/speech_enhancement/Conv-TasNet/Conv_TasNet_Pytorch/trainer.py:136 - INFO ] Create optimizer adam: {'lr': 0.001, 'weight_decay': 1e-05}
2023-04-14 17:29:00 [/home/layers/audio/speech_enhancement/Conv-TasNet/Conv_TasNet_Pytorch/trainer.py:105 - INFO ] Starting preparing model ............
2023-04-14 17:29:00 [/home/layers/audio/speech_enhancement/Conv-TasNet/Conv_TasNet_Pytorch/trainer.py:107 - INFO ] Loading model to GPUs:(0,), #param: 3.48M
2023-04-14 17:29:00 [/home/layers/audio/speech_enhancement/Conv-TasNet/Conv_TasNet_Pytorch/trainer.py:112 - INFO ] Gradient clipping by 200, default L2
2023-04-14 17:29:00 [train.py:28 - INFO ] Making the train and test data loader
Traceback (most recent call last):
  File "train.py", line 38, in <module>
    main()
  File "train.py", line 33, in main
    logger.info('Train data loader: {}, Test data loader: {}'.format(len(train_loader), len(val_loader)))
TypeError: object of type 'DataLoaders' has no len()

请问这是什么原因造成的呢？另外，我发现有dataloader.py和dataloader_new.py两个文件，而目前的train.py中使用的是dataloader.py，我是否应该使用dataloader_new.py？

About pre-trained model

Hi,

In the README file, it says the pre-trained model has been provided.
But I cannot find the model path in this repo.
Can you help to provide it again, thank you!

The dimension problem

CPU version

Is it possibile to adapt the code run over CPU?

Currently if I try I get

# python Separation_wav.py -mix_scp /root/audio/test.wav -yaml options/train/train.yml -model /root/convtasnet/best.pt -save_path ./checkpoint
2020-12-21 11:44:09 [/app/option.py:13 - INFO ] Reading .yml file .......
2020-12-21 11:44:09 [/app/option.py:19 - INFO ] Export CUDA_VISIBLE_DEVICES = 1,2,3,4,5,6,7
2020-12-21 11:44:09 [Separation_wav.py:23 - INFO ] Load checkpoint from /root/convtasnet/best.pt, epoch  70
Traceback (most recent call last):
  File "Separation_wav.py", line 70, in <module>
    main()
  File "Separation_wav.py", line 65, in main
    separation=Separation(args.mix_scp, args.yaml, args.model, gpuid)
  File "Separation_wav.py", line 24, in __init__
    self.net=net.cuda()
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 305, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 202, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 224, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 305, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 102, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

Thank you

About the network architecture

What's the motivation of using a one-layer-encoder instead of a deeper one?

复现结果loss不正确

您好，感谢您的代码分享。

我这边代码可以正常运行，使用了WSJ0的数据，在配置方面只是把Batchsize改成了8，并且只用了一个gpu，其他相同
但是我的前三个epoch的loss和您代码中conv_tasnet_loss.png中相差甚远:

epoch/train loss/eval loss:
0/没有train loss/27.94
1/-0.191/-0.231
2/-0.239/-0.232

我的资源不是很多，所以训练比较慢，但是感觉这样下去，将无法达到您图中所示的loss下降曲线
gpu和Batchsize的影响真的这么大吗？还是可能有其他一些元音。

DataLoaders_new 结尾的print函数的用处是什么

训练模型时，里面显示index out of range ，查找原因，Dataloaders_new 运行时的结果为0，结尾的print函数不知道用处，老师您能帮忙解答下吗

有关model_size

学长好，我看这个论文里结果model_size连10m都不到，为什么宁的模型大小有30m呢？

预训练模型单一音频测试问题

您好，我在按照README文件中提供的单一音频测试命令测试音频数据时遇到了问题，README中的示例命令为：
python Separation_wav.py -mix_wav 1.wav -yaml ./config/train/train.yml -model best.pt -gpuid [0,1,2,3,4,5,6,7] -save_path ./checkpoint
在执行前，我做了如下的修改：
1.修改了读取音频路径、预训练模型路径和保存路径
2.去掉了-gpuid这一参数，使用默认值
3.yaml文件路径改为./options/train/train.yml
随后，执行了修改后的命令，报错如下：

希望您能指出报错原因，谢谢。

另外，Separation_wav.py文件中使用的参数为-mix_scp，但提供的示例命令中使用参数为-mix_wav，此处可能有误。

Where is the scp file?

/home/likai/data1/Dataset/wsj0-mix/2speakers/wav8k/min/tr/mix in the generate_scp file you have given the file path. can you show me where these file are?

could you tell me your own results of Conv-TasNet?

hi, JusperLee,
Your codes seem not the latest version of Conv-TasNet, I think it maybe the 2nd version of TasNet, because there is absent for the summation of all outputs from 1-D convblock ,the 1-D convblock is also lose the two path of outputs.
My codes add them but the results are not ideal, even less than the 2nd version, only 15.3dB.

Looking forward to your reply!

Train for Music Source Separation

Hi,
I would like to know if you would suggest using this architecture for the task of Music Source Separation and also what changes with respect to this other approach called demucs.
Thanks a lot!

有关数据集的

你好，请问要是我只有wav文件的话要怎样生成scp文件呀，我的wsj0数据集是别人发的已经混合好的，但是没有那个csv或者scp文件，请问我要怎么做才能生成那个scp呢，文件的截图如下，感谢！

problem about music source separation

你好，我第一次用conv-tasnet，我用的是musdb数据集，采样用44100，用做分离语音vocals和背景音accompaniment音色，loss一直在10+下不去。请问你有做过用这样的尝试么，不知道我哪里出了问题。。。由于我gpu内存不够，chunk_size改成了44100,其他训练参数不变。

1

非常感谢！

这个代码用来可以训练并分离有3个人声的音频吗

关于原文中的skip-connection

你好！论文中的网络用到了 residual和skip-connection,但是你的代码中好像只有residual而没用到skip-connection？

中断

作者，您好，我这边训练conv_tasnet，训练了一周了到53轮，时候突然中断了，可以从断的地方继续训练么？如果可以的话，是修改哪里呢？yml里面的resume么？

Normalization implementation

The normalization implementation is interesting here compared to the paper version and Yi Luo's implementation.

Groupnorm: while Yi Luo used the nn.GroupNorm directly, this repo implement a new version? What is the difference between them?

cLN: seems this one is not cumulative. Both cumulative version and simple layerwise norm perform terribly.

key is not defined

During the inference,key is not defined .what is the value of key?

integrate into onssen?

Hi Jusper,
Thanks for sharing the Conv-TasNet implementation! I'm writing an open-source library to contain speech-separation algorithms, especially the deep learning ones. Would you like me to integrate your Conv-TasNet code into the onssen library? I will add the copyright header before I push it to the repo. Thanks!

LibriSpeech 数据集训练

你好，我使用LibriSpeech 数据集混合了20000个语音进行训练，但下降并不是很明显，且在70多epoch的时候就early stop了损失只下降到-7.几，请问可能有哪些原因呢

博主，有联系方式吗

预训练模型

请问预训练模型的配置文件和options/train/train.yml 的网络配置参数是相同的吗？

Example of output

Hi @JusperLee

I have used your pre trained weights and inferred on 1 mixed wav file found here: https://www.merl.com/demos/deep-clustering/media/male-male-mixture.wav.

The output contains a very high pitched tone and hard to hear audio.

Have you experienced this as well when inferring?

Thanks

Error While Running Con-Tas net

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Hi, Jusper,

I just downloaded the model best.pt, and changed options gpuid to [1,2,3], then I got following error when do inference.

root:/JusperLee-Conv-TasNet/Conv_TasNet_Pytorch# python Separation.py -mix_scp sample/test.scp -yaml ./options/train/train.yml -model best.pt -gpuid 0,1,2,3 -save_path ./checkpoint
2020-08-25 06:48:21 [./options/option.py:13 - INFO ] Reading .yml file .......
2020-08-25 06:48:21 [./options/option.py:19 - INFO ] Export CUDA_VISIBLE_DEVICES = 1,2,3
2020-08-25 06:48:21 [Separation.py:23 - INFO ] Load checkpoint from best.pt, epoch 70
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "Separation.py", line 72, in
main()
File "Separation.py", line 68, in main
separation.inference(args.save_path)
File "Separation.py", line 45, in inference
s = s*norm/torch.max(torch.abs(s))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!