sanyuan-chen / css_with_conformer Goto Github PK

View Code? Open in Web Editor NEW

108.0 108.0 19.0 90 KB

Code for the ICASSP-2021 paper: Continuous Speech Separation with Conformer.

Python 100.00%

css_with_conformer's Introduction

Hi, I'm Sanyuan Chen 👋

🎓 I’m currently a Ph.D. student at Harbin Institute of Technology and a research intern in Microsoft Research Asia.
🌱 My research interests include self-supervised learning, speech and audio processing and spoken language processing.
📄 My research highlights:
- [Nov 2023] VALL-E produced the AI Audiobook of Impromptu: Amplifying Our Humanity Through AI with an “AI Reid” voice.
- [Apr 2023] VALL-E wins the UNESCO Netexplo Innovation Award 2023 (top 10 out of over 3000 innovations of the year).
- [Apr 2023] BEATs is accepted by ICML 2023 as an oral paper.
- [Mar 2023] VALL-E X, a cross-lingual version of VALL-E that can help anyone speak a foreign language in their own voice without an accent. See https://aka.ms/vallex for demos.
- [Jan 2023] VALL-E, a language modeling approach for text to speech synthesis, achieves state-of-the-art zero-shot TTS performance and emerges in-context learning capabilities. See https://aka.ms/valle for demos.
- [Dec 2022] BEATs, a discrete label prediction based audio pre-training framework, ranks 1st in the AudioSet, Balanced AudioSet and ESC-50 leaderboards. We released the codes and pre-trained models.
- [Nov 2022] WavLM is now available on TorchAudio. Try to use it here.
- [Sep 2022] SpeechLM, a textual enhanced speech pre-training model, achieves 16% relative WER reduction over data2vec with only 10K text sentences on the LibriSpeech speech recognition benchmark. We released the codes and pre-trained models.
- [Sep 2022] WavLM is published in IEEE Journal of Selected Topics in Signal Processing.
- [Jan 2022] WavLM ranks 1st in the VoxSRC 2021 speaker verification permanent leaderboard.
- [Dec 2021] WavLM demo of speaker verification is on Huggingface.
- [Nov 2021] WavLM codes and pre-trained models are released here.
- [Oct 2021] WavLM ranks 1st in the SUPERB leaderboard.
- [Oct 2021] WavLM, a large-scale self-supervised pre-training framework for full-stack speech processing, achieves state-of-the-art performance on 19 tasks, including all the 15 tasks on SUPERB benchmark, VoxCeleb1 speaker verification benchmark, LibriCSS speech separation benchmark, CALLHOME speech diarization benchmark and LibriSpeech speech recognition benchmark.
- [Oct 2021] Ultra fast continuous speech separation model is shipped in the Microsoft Conversation Transcription Service.
- [Dec 2020] Our continuous speech separation model is shipped in the Microsoft Conversation Transcription Service.
- [Oct 2020] Microsoft speaker diarization system with conformer-based continuous speech separation ranks 1st in the VoxCeleb Speaker Recognition Challenge 2020.
- [Aug 2020] Continuous speech separation with conformer achieves state-of-the-art performance on the LibriCSS speech separation benchmark. We released the codes and pre-trained models. See demos here.
- [Apr 2020] RecAdam, my 1st first-author paper, achieves state-of-the-art performance on the GLUE benchmark. We released the codes.

css_with_conformer's People

Contributors

Stargazers

Watchers

Forkers

bob-wangkkk youngjay0612 sciai-ai dnfcallan oucxlw ishine trendingtechnology desh2608 ioyy900205 kingfener normonisping yzhang123 wjliu0215 wendlerc 88aggressive hiyoung-asr ai-sherry romenlongcode runngezhang-jx

css_with_conformer's Issues

Why didn't you compare SI-SNR and SDR?

I used the Conformer structure for single-channel separation, but the effect was not ideal.

How to train without pretrained model?

请问如何不使用预训练模型来完成分离呢？如何在你的代码中介入评价指标呢？期待您的回复，谢谢！

模型测试音频出现问题

您的模型在测试音频上出现报错
raise RuntimeError("Got 2D (single channel) input and can "
RuntimeError: Got 2D (single channel) input and can not extract spatial features、
请问我应该如何解决呢？

403 forbidden when downloading the pre-trained model & dataset

Hi @Sanyuan-Chen,
I get 403 response when downloading the pre-trained model * dataset, what's wrong with the cloud server?

Model Training (dropout, batchsize, STFT?)

Thanks for sharing the code. I have some questions about model training.

What is the batchsize during training? 1? gradients are accumulated for every 4 samples?
Is the dropout deactivated during training? As suggested by "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR", dropout is not used.
How long does it take to train the model?
STFT configurations? I think the pre-trainied model uses 512-point STFT with half overlap which is a little bit different from the setup shown below and "log" is not applied to the spectorgram?

"The 25 ms frame size with the frame shift of 10 ms is usedfor feature generation. A 512-point FFT size and hamming win-dow are used in (i)STFT, forming the 257-dimentional masksand spectrum. The log spectrogram with utterance-wise meanvariance normalization is extracted as the input feature for allthe separation models."

Best regards, Ludvig J.

Why didn't you compare SI-SNR and SDR?

I used the Conformer structure for single-channel separation, but the effect was not ideal.

Loss function (MSE or RMSE) & the scale of the loss

When training the conformer, did you use PIT MSE or RMSE?

To compute the MSE is it correct to use the nn.MSELoss in pytorch. By default it divides the loss by the total number of elements. Should I set the reduction to sum?

I think the scale of the loss may influence the training process (https://stats.stackexchange.com/questions/346299/whats-the-effect-of-scaling-a-loss-function-in-deep-learning), so could you please provide details of how MSE is computed?

Looking forward to your reply!
Thanks for your help @Sanyuan-Chen