Giter VIP home page Giter VIP logo

bahnar-tts's Introduction

drawing

Grad-TTS

This is an End-to-end implementation of the Grad-TTS model and Voice Conversion model based on Diffusion Probabilistic Modelling.

Abstract

Demo page with voiced abstract: link.

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.

Installation

Firstly, install all Python package requirements:

pip install -r requirements.txt

Secondly, build monotonic_align code (Cython):

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Note: code is tested on Python==3.6.9.

Voice Conversion

Official implementation of the paper "Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme" (ICLR 2022, Oral). Link.

Abstract

Demo page with voiced abstract: link.

Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis.

Inference End-to-End model:

GradTTS setup:

You should create Bahnar-TTS\logs\bahnar_exp directory and Bahnar-TTS\checkpts\ directory:

  1. You can download Grad-TTS trained on Bahnar datasets (22kHz) from here and put it under directory Bahnar-TTS\logs\bahnar_exp
  2. You can download HiFi-GAN trained on Bahnar datasets (22kHz) from here and put it under directory Bahnar-TTS\checkpts\

After setup phase, the repo should look like this:

── Bahnar-TTS
    │
    │
    │
    ├── checkpts
    │     └── hifigan.pt
    │
    └── logs
          └── bahnar_exp
                    └── grad_1344.pt

Voice Conversion setup:

You should create Bahnar-TTS\checkpts_vc directory, under Bahnar-TTS\checkpts_vc you should create 3 sub-directories:

  • Bahnar-TTS\checkpts_vc\spk_encoder
  • Bahnar-TTS\checkpts_vc\vc
  • Bahnar-TTS\checkpts_vc\vocoder
  1. You can download pretrained Voice Conversion (22kHz) from here and put it under directory Bahnar-TTS\checkpts_vc\vc
  2. You can download pretrained Vocoder (22kHz) and config from here and put it under directory Bahnar-TTS\checkpts_vc\vocoder
  3. You can download pretrained Encoder (22kHz) from here and put it under directory Bahnar-TTS\checkpts_vc\spk_encoder

After setup phase, the repo should look like this:

── Bahnar-TTS
    └──checkpts_vc
          ├── spk_encoder
          │         │
          │         └──────  pretrained.pt
          │
          │
          ├── vc
          │    │
          │    └──────────── vc_vctk_wodyn.pt
          │
          │
          └── vocoder
                 │
                 ├──────────── generator
                 └──────────── config.json 

Data setup:

After put necessary all model checkpoints into checkpts folder and checkpts_vc folder. You should create your own data-source. You should create Bahnar-TTS\document to store your own data.

  1. Create text file with sentences you want to synthesize like Bahnar-TTS\document\text\text.txt
  2. Create target audio file you want to converse the voice like Bahnar-TTS\document\target_sound\0001.wav

You can dowload the pattern audio file and text file in this here

After create the data source, the repo should look like this:

── Bahnar-TTS
    └── document
           ├──────── target_sound
           │           └────────── 0001.wav
           │
           └──────── text
                       └────────── text.txt

Inference command:

  1. If you want to synthesize only text-to-speech without voice conversion, you should use this command:
    python inference.py -f <your-text-file> -c <grad-tts-checkpoint> -t <number-of-timesteps> -s <speaker-id-if-multispeaker>
    For example you can run the following command:
    python inference.py -f document/text/text.txt  -c logs\bahnar_exp\grad_1344.pt
  2. If you want to synthesize text-to-speech with voice conversion, you should use this command:
    python inference.py -f <your-text-file> -c <grad-tts-checkpoint> -t <number-of-timesteps> -s <speaker-id-if-multispeaker> -vc <target-speaker-for-voice-conversion>
    For example you can run the following command:
    python inference.py -f document/text/text.txt  -c logs\bahnar_exp\grad_1344.pt -vc document/target_sound/0001.wav
  3. After inference, check out folder called out for generated audios

Demo result:

You can check the result of the end-to-end Bahnar Text-To-Speech model and Voice Conversion model link

Male voice is generated by GradTTS and Female Voice is generated by Voice Conversion model.

References

  • HiFi-GAN model is used as vocoder, official github repository: link
  • Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: link
  • Phonemization utilizes CMUdict, official github repository: link
  • Voice conversion model, official github repository: link

bahnar-tts's People

Contributors

hungvocs47 avatar

Stargazers

Nguyễn Đắc Hoàng Phú avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.