ai4bharat / indic-tts Goto Github PK

View Code? Open in Web Editor NEW

108.0 8.0 25.0 581 KB

Text-to-Speech for languages of India

License: MIT License

Python 31.57% HTML 2.84% JavaScript 0.46% CSS 0.75% Shell 0.62% Jupyter Notebook 63.39% Dockerfile 0.37%

indic-tts's Introduction

AI4Bharat Indic-TTS

Towards Building Text-To-Speech Systems for the Next Billion Users

🎉 Accepted at ICASSP 2023

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

TL;DR: We open-source SOTA Text-To-Speech models for 13 Indian languages: Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil and Telugu.

Authors: Gokul Karthik Kumar*, Praveen S V*, Pratyush Kumar, Mitesh M. Khapra, Karthik Nandakumar

[ArXiv Preprint] [Audio Samples] [Try It Live] [Video]

Unified architecture of our TTS system

Results

Setup:

Environment Setup:

# 1. Create environment
sudo apt-get install libsndfile1-dev ffmpeg enchant
conda create -n tts-env
conda activate tts-env

# 2. Setup PyTorch
pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

# 3. Setup Trainer
git clone https://github.com/gokulkarthik/Trainer 

cd Trainer
pip3 install -e .[all]
cd ..
[or]
cp Trainer/trainer/logging/wandb_logger.py to the local Trainer installation # fixed wandb logger
cp Trainer/trainer/trainer.py to the local Trainer installation # fixed model.module.test_log and added code to log epoch 
add `gpus = [str(gpu) for gpu in gpus]` in line 53 of trainer/distribute.py

# 4. Setup TTS
git clone https://github.com/gokulkarthik/TTS 

cd TTS
pip3 install -e .[all]
cd ..
[or]
cp TTS/TTS/bin/synthesize.py to the local TTS installation # added multiple output support for TTS.bin.synthesis

# 5. Install other requirements
> pip3 install -r requirements.txt

Data Setup:

Format IndicTTS dataset in LJSpeech format using preprocessing/FormatDatasets.ipynb
Analyze IndicTTS dataset to check TTS suitability using preprocessing/AnalyzeDataset.ipynb

Training Steps:

Set the configuration with main.py, vocoder.py, configs and run.sh. Make sure to update the CUDA_VISIBLE_DEVICES in all these files.
Train and test by executing sh run.sh

Inference:

Trained model weight and config files can be downloaded at this link.

python3 -m TTS.bin.synthesize --text <TEXT> \
    --model_path <LANG>/fastpitch/best_model.pth \
    --config_path <LANG>/config.json \
    --vocoder_path <LANG>/hifigan/best_model.pth \
    --vocoder_config_path <LANG>/hifigan/config.json \
    --out_path <OUT_PATH>

Code Reference: https://github.com/coqui-ai/TTS `

indic-tts's People

Contributors

Stargazers

Watchers

indic-tts's Issues

cannot find str2bool module used in main.py

I have installed this project with python3.10 but when trying to run...I couldn't find reference to str2bool library used in main.py
please help to find the reference.
Thanks!

Voice cloning

Hi team. I would like to understand if this repo supports voice cloning. The difference between reference_wav/ speaker_wav variables is unclear. Similarly for reference_speaker_idx and many other variables. Can you please update the README and also let me know how to do voice cloning by providing a wav file?

this is really hard installing guideline

how to change pitch and how to list all the available speaker ids

for "hi" model can you please give an example on how to change pitch and how to list all the available speakers for hi model

Fine-Tuning Guide to Add a New Speaker

Hi,
Really love the work done in this repo, it has been really helpful. Just a request, could you please add more documentation regarding fine-tuning the models for a new voice, using available model checkpoints. It is not very clear about how to fine-tune the model on a new dataset.

Thanks in advance!

Regards,
Harsh

SSML is not honored

Tested with : https://models.ai4bharat.org/#/tts
Test sample : డైనమిక్ ప్రోగ్రామింగ్ అంటే, ఏదైనా కార్యాచరణలో ఒక సమస్యను పరిష్కరించడానికి అద్భుత సమాధానాలు అందిస్తే, అది సాధారణంగా ప్రతి ప్రకారంలో ఉన్న సమస్యలను చేర్చడానికి ఉపయోగించబడుతుంది. ఇది వివరించిన చిన్న ప్రమాణంలో పరిగణించబడిన సమస్యల మీద మనకు సమాధానం అందిస్తుంది. డైనమిక్ ప్రోగ్రామింగ్ ఒక అన్ని సమస్యల పరిష్కరణ పద్ధతిగా చేసే ఒక ఆలోచనా పద్ధతి కాదు, ఇది సమస్యల పరిష్కరణకు విశేషంగా అనుకూలంగా ఉంటుంది.
Need SSML support to improve speech synthesis.

Installation guide and ResolutionImpossible issues

Current libraries specified in requirement files give ResolutionImpossible errors. Please provide all version of libraries to import and python on which this project will work. Also please provide an installation guide in case there are any key steps

Here are few error logs, all libraries are installed:
TTS/tts/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 487, in run_setup
super(_BuildMetaLegacyBackend,
File "/private/var/folders/jh/c851j_516js14fxxtsxk83440000gr/T/pip-build-env-op_ca15e/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 22, in
ModuleNotFoundError: No module named 'Cython'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Error when loading en+hi model

I am getting this error when trying to load the newly uploaded en+hi model

Traceback (most recent call last):
File "main.py", line 35, in
models[lang] = Synthesizer(
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 91, in init
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 190, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "Indic-TTS/env/lib/python3.8/site-packages/TTS/tts/models/forward_tts.py", line 828, in load_checkpoint
self.load_state_dict(state["model"])
File "Indic-TTS/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ForwardTTS:
size mismatch for emb_g.weight: copying a param with shape torch.Size([4, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).

Any help would be appreciated. @GokulNC @ashwin-014

Add huggingface & Collab links

Please add an inference demo on Huggingface..
Also a give a Collab template to train and add new speaker...

GPU error

When i am running sample.py file on google colab with T4 GPU. it is loading the model in GPU correctly, but when i am doing inference using inference_from_text it is showing below error,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

i have tried many ways to bring both model and input text tensor on the same device, but it is giving me the same error again and again. but it was working fine when i used it for first few times. please help @GokulNC

Cannot Used it to inference

Can you provide python code which can take in model and config paths and perform inference?

speaker id required

Hi,

I am trying to run en model

python -m TTS.bin.synthesize --text "Hi, how are you?" --model_path en/fastpitch/best_model.pth --config_path en/fastpitch/config.json --vocoder_path en/hifigan/best_model.pth --vocoder_config_path en/hifigan/config.json --out_path speech.wav

However, I am getting an error

[!] Look like you use a multi-speaker model. You need to define either a speaker_nameor aspeaker_wavto use a multi-speaker model.
i tried using --speaker_id with 0 and 1.

Regarding Speakers.pth file

what to do if i want to use a different speakers.pth file of fastpitch (fine tuned on hindi language ) can someone help me to know about speakers.pth file? I just want to use a celebrity voice on the synthesis part so what should i do to solve this isssue? i thought that i should fine tune coqui tts on a celebrity voice then use that as a speakers.pth to cater a celebrity voice in indic tts but i encountered error? i want help

Uploaded checkpoints and demos

Hi @GokulNC , thanks for the repo, really useful work.

I made the indian english model work locally but it sounds very different (lower quality) compared to the demo API.

Are the uploaded models different from the ones used for demo?

GPU usage during synthesis

Hi,

Is it a compulsory requirement to use GPU during the inference. If we use your models using our local system.

Model Loading & Inference Time

Hello Team,

Recently, I was able to setup IndicTTS in our A100 GPU instance. What I observed was that the model loading time is ~12 minutes when I used the flag use_cuda=True, which is quite huge.

When I disabled GPU with use_cuda=False model is loading very fast 1592.0169 ms but inference time is very high.

Looks like I am missing out on something can anyone help me find out what I am doing wrong to fix this time issue.

Thanks

Hi,
The below two links are throwing 404. please fix. Is there any other documentation for training new speakers? Tx in advance..
Data Setup:
Format IndicTTS dataset in LJSpeech format using preprocessing/FormatDatasets.ipynb
Analyze IndicTTS dataset to check TTS suitability using preprocessing/AnalyzeDataset.ipynb

Discarding Numerical values

model is discarding numerical values both in model language and in numeric values

lack code

when I train this code, load data:

 # load data
    train_samples, eval_samples = load_tts_samples(
        dataset_config,
        eval_split=True,
        #eval_split_size=config.eval_split_size,
        formatter=formatter_indictts
    )
    train_samples = filter_speaker(train_samples, args.speaker)
    eval_samples = filter_speaker(eval_samples, args.speaker)

can't find declaration to go to load_tts_samples
how I deal with this problem? Thanks!

base64 embed into audio element

Inside index.html in (front_end/index.html)

You are using this code:

 let arrayString = 'data:audio/wav;base64,' + response["audio"][0]["audioContent"]
                        console.log(arrayString)
                        // let arrayBuffer = atob(arrayString)
                        // console.log(arrayBuffer)
                        // const blob = new Blob([arrayBuffer], { type: "audio/wav; codecs=MS_PCM" });
                        // const url = window.URL.createObjectURL(blob);
                        audioElement = document.getElementById("audio-output")
                        audioElement.src = arrayString;
                        audioElement.style.display = 'block'```

Why do you directly adding arrayString into src of audioElement, and i tried the same in vanilla JS is does not work for me.
`generateButton.addEventListener("click", async () => {
const source = sourceRef.value
const base64 = await fetchAudio(source)
// console.log(base64)
let arrayString = "data:audio/wav;base64," + base64
audioElement.src = arrayString

    // let arrayBuffer = atob(arrayString)
    // console.log(arrayBuffer + "This is buffer")
    // const blob = new Blob([arrayBuffer], { type: "audio/wav; codecs=MS_PCM" });
    // const url = window.URL.createObjectURL(blob)
    // audioElement.src = url

})
`