Giter VIP home page Giter VIP logo

indic-tts's Introduction

AI4Bharat Indic-TTS

Towards Building Text-To-Speech Systems for the Next Billion Users

🎉 Accepted at ICASSP 2023

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

TL;DR: We open-source SOTA Text-To-Speech models for 13 Indian languages: Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil and Telugu.

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

Authors: Gokul Karthik Kumar*, Praveen S V*, Pratyush Kumar, Mitesh M. Khapra, Karthik Nandakumar

[ArXiv Preprint] [Audio Samples] [Try It Live] [Video]

Unified architecture of our TTS system

Results

Setup:

Environment Setup:

# 1. Create environment
sudo apt-get install libsndfile1-dev ffmpeg enchant
conda create -n tts-env
conda activate tts-env

# 2. Setup PyTorch
pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

# 3. Setup Trainer
git clone https://github.com/gokulkarthik/Trainer 

cd Trainer
pip3 install -e .[all]
cd ..
[or]
cp Trainer/trainer/logging/wandb_logger.py to the local Trainer installation # fixed wandb logger
cp Trainer/trainer/trainer.py to the local Trainer installation # fixed model.module.test_log and added code to log epoch 
add `gpus = [str(gpu) for gpu in gpus]` in line 53 of trainer/distribute.py

# 4. Setup TTS
git clone https://github.com/gokulkarthik/TTS 

cd TTS
pip3 install -e .[all]
cd ..
[or]
cp TTS/TTS/bin/synthesize.py to the local TTS installation # added multiple output support for TTS.bin.synthesis

# 5. Install other requirements
> pip3 install -r requirements.txt

Data Setup:

  1. Format IndicTTS dataset in LJSpeech format using preprocessing/FormatDatasets.ipynb
  2. Analyze IndicTTS dataset to check TTS suitability using preprocessing/AnalyzeDataset.ipynb

Training Steps:

  1. Set the configuration with main.py, vocoder.py, configs and run.sh. Make sure to update the CUDA_VISIBLE_DEVICES in all these files.
  2. Train and test by executing sh run.sh

Inference:

Trained model weight and config files can be downloaded at this link.

python3 -m TTS.bin.synthesize --text <TEXT> \
    --model_path <LANG>/fastpitch/best_model.pth \
    --config_path <LANG>/config.json \
    --vocoder_path <LANG>/hifigan/best_model.pth \
    --vocoder_config_path <LANG>/hifigan/config.json \
    --out_path <OUT_PATH>

Code Reference: https://github.com/coqui-ai/TTS `

indic-tts's People

Contributors

ashwin-014 avatar gokulnc avatar svp19 avatar vibhanshujainiiitr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indic-tts's Issues

cannot find str2bool module used in main.py

I have installed this project with python3.10 but when trying to run...I couldn't find reference to str2bool library used in main.py
please help to find the reference.
Thanks!

Voice cloning

Hi team. I would like to understand if this repo supports voice cloning. The difference between reference_wav/ speaker_wav variables is unclear. Similarly for reference_speaker_idx and many other variables. Can you please update the README and also let me know how to do voice cloning by providing a wav file?

Fine-Tuning Guide to Add a New Speaker

Hi,
Really love the work done in this repo, it has been really helpful. Just a request, could you please add more documentation regarding fine-tuning the models for a new voice, using available model checkpoints. It is not very clear about how to fine-tune the model on a new dataset.

Thanks in advance!

Regards,
Harsh

SSML is not honored

Tested with : https://models.ai4bharat.org/#/tts
Test sample : డైనమిక్ ప్రోగ్రామింగ్ అంటే, ఏదైనా కార్యాచరణలో ఒక సమస్యను పరిష్కరించడానికి అద్భుత సమాధానాలు అందిస్తే, అది సాధారణంగా ప్రతి ప్రకారంలో ఉన్న సమస్యలను చేర్చడానికి ఉపయోగించబడుతుంది. ఇది వివరించిన చిన్న ప్రమాణంలో పరిగణించబడిన సమస్యల మీద మనకు సమాధానం అందిస్తుంది. డైనమిక్ ప్రోగ్రామింగ్ ఒక అన్ని సమస్యల పరిష్కరణ పద్ధతిగా చేసే ఒక ఆలోచనా పద్ధతి కాదు, ఇది సమస్యల పరిష్కరణకు విశేషంగా అనుకూలంగా ఉంటుంది.
Need SSML support to improve speech synthesis.

Installation guide and ResolutionImpossible issues

Current libraries specified in requirement files give ResolutionImpossible errors. Please provide all version of libraries to import and python on which this project will work. Also please provide an installation guide in case there are any key steps

Here are few error logs, all libraries are installed:
TTS/tts/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 487, in run_setup
super(_BuildMetaLegacyBackend,
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 22, in
ModuleNotFoundError: No module named 'Cython'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Error when loading en+hi model

I am getting this error when trying to load the newly uploaded en+hi model

Traceback (most recent call last):
File "main.py", line 35, in
models[lang] = Synthesizer(
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 91, in init
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 190, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/tts/models/forward_tts.py", line 828, in load_checkpoint
self.load_state_dict(state["model"])
File "Indic-TTS/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ForwardTTS:
size mismatch for emb_g.weight: copying a param with shape torch.Size([4, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).

Any help would be appreciated. @GokulNC @ashwin-014

GPU error

When i am running sample.py file on google colab with T4 GPU. it is loading the model in GPU correctly, but when i am doing inference using inference_from_text it is showing below error,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

i have tried many ways to bring both model and input text tensor on the same device, but it is giving me the same error again and again. but it was working fine when i used it for first few times. please help @GokulNC

speaker id required

Hi,

I am trying to run en model

python -m TTS.bin.synthesize --text "Hi, how are you?" --model_path en/fastpitch/best_model.pth --config_path en/fastpitch/config.json --vocoder_path en/hifigan/best_model.pth --vocoder_config_path en/hifigan/config.json --out_path speech.wav

However, I am getting an error

[!] Look like you use a multi-speaker model. You need to define either a speaker_nameor aspeaker_wavto use a multi-speaker model.
i tried using --speaker_id with 0 and 1.

Regarding Speakers.pth file

what to do if i want to use a different speakers.pth file of fastpitch (fine tuned on hindi language ) can someone help me to know about speakers.pth file? I just want to use a celebrity voice on the synthesis part so what should i do to solve this isssue? i thought that i should fine tune coqui tts on a celebrity voice then use that as a speakers.pth to cater a celebrity voice in indic tts but i encountered error? i want help
image

Uploaded checkpoints and demos

Hi @GokulNC , thanks for the repo, really useful work.

I made the indian english model work locally but it sounds very different (lower quality) compared to the demo API.

Are the uploaded models different from the ones used for demo?

GPU usage during synthesis

Hi,

Is it a compulsory requirement to use GPU during the inference. If we use your models using our local system.

Model Loading & Inference Time

Hello Team,

Recently, I was able to setup IndicTTS in our A100 GPU instance. What I observed was that the model loading time is ~12 minutes when I used the flag use_cuda=True, which is quite huge.

When I disabled GPU with use_cuda=False model is loading very fast 1592.0169 ms but inference time is very high.

Looks like I am missing out on something can anyone help me find out what I am doing wrong to fix this time issue.

Thanks

Old Libraries

The libraries mentioned are old and not compatible with each other. Please upgrade the libraries list and use the code with latest libraries as this is causing a lot of difficulties

Adding a new Voice

How can I add another voice of my own, rather than using the one that is already provided? Do I need to change the Speakers.pth files?

lack code

when I train this code, load data:

 # load data
    train_samples, eval_samples = load_tts_samples(
        dataset_config,
        eval_split=True,
        #eval_split_size=config.eval_split_size,
        formatter=formatter_indictts
    )
    train_samples = filter_speaker(train_samples, args.speaker)
    eval_samples = filter_speaker(eval_samples, args.speaker)

can't find declaration to go to load_tts_samples
how I deal with this problem? Thanks!

base64 embed into audio element

Inside index.html in (front_end/index.html)

You are using this code:

 let arrayString = 'data:audio/wav;base64,' + response["audio"][0]["audioContent"]
                        console.log(arrayString)
                        // let arrayBuffer = atob(arrayString)
                        // console.log(arrayBuffer)
                        // const blob = new Blob([arrayBuffer], { type: "audio/wav; codecs=MS_PCM" });
                        // const url = window.URL.createObjectURL(blob);
                        audioElement = document.getElementById("audio-output")
                        audioElement.src = arrayString;
                        audioElement.style.display = 'block'```

Why do you directly adding arrayString into src of audioElement, and i tried the same in vanilla JS is does not work for me.
`generateButton.addEventListener("click", async () => {
const source = sourceRef.value
const base64 = await fetchAudio(source)
// console.log(base64)
let arrayString = "data:audio/wav;base64," + base64
audioElement.src = arrayString

    // let arrayBuffer = atob(arrayString)
    // console.log(arrayBuffer + "This is buffer")
    // const blob = new Blob([arrayBuffer], { type: "audio/wav; codecs=MS_PCM" });
    // const url = window.URL.createObjectURL(blob)
    // audioElement.src = url

})
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.