Giter VIP home page Giter VIP logo

comospeech's Introduction

COMOSPEECH

Implementation of the CoMospeech. For all details check out our paper accepted to ACM MM 2023: CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model.

Authors: Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo.

Update

2023-12-01

  • We also propose a well-designed Singing Voice Conversion (SVC) version based on consistency model (Code).

2023-11-30

  • We find that zero-mean Gaussian noise instead of the prior in grad-tts can also achieve similar performance. We alse release the new code and checkpoints.

2023-10-21

  • We add Heun’s 2nd order method support for teacher model (can be used for teacher model sampling and better ODE trajectory for consistency distillation).

Abstract

Demo page: link.

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines.

Prepare

Build monotonic_align code (Cython):

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Inference

Run script inference.py by providing path to the text file, path to the checkpoint, number of sampling :

    python inference.py -f <text file> -c <checkpoint> -t <sampling steps> 

Check out folder called out for generated audios. Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. In addition, we use the same vocoder in Grad-TTS. You can download it and put into checkpts folder.

Training

We use LJSpeech datasets and follow the train/test/val split in fastspeech2, you can change the split in fs2_txt folder. Then run script train.py ,

    python train.py 

Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. While training Comospeech, teacher checkpoint directory should be provide.

Checkpoints trained on LJSpeech can be download from here.

Acknowledgement

I would like to extend a special thanks to authors of Grad-TTS, since our code base is mainly borrowed from Grad-TTS.

Contact

You are welcome to send pull requests or share some ideas with me. Contact information: Zhen YE ( [email protected] )

comospeech's People

Contributors

zhenye234 avatar eschmidbauer avatar grace9994 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.