Giter VIP home page Giter VIP logo

vits-tts-vietnamese's Introduction

Finetuining VITS Text-to-Speech Model in Vietnamese

We use Piper library to finetuning VITS for TTS tasks with different voice in Vietnamese.

We also built a Tornado server to deploy TTS model on microservice with Docker. The server uses ONNX model type to infer lightweight and excellent performance.

Video demo: https://youtu.be/1mAhaP26aQE

Read Project Docs: Paper

How to run this project?

With Docker (highly recommend):

On your terminal, type these commands to build a Docker Image:

docker build  ./ -f .Dockerfile -t vits-tts-vi:v1.0

Then run it with port 5004:

docker run -d -p 5004:8888 vits-tts-vi:v1.0

While the Docker Image was running, you now make a request to use the TTS task via this API on your browser.

http://localhost:5004/?text=Xin chào bạn&speed=normal

The result seems like this:

{
    "hash": "e6bc1768c82ae63ed8ee61ca2349efa4ef9f166e",
    "text": "xin chào bạn",
    "speed": "normal",
    "audio_url": "http://localhost:5004/audio/e6bc1768c82ae63ed8ee61ca2349efa4ef9f166e.wav"
}

The speed has 5 options: normal, fast, low, very_fast, very_slow

Or you can use the Web UI via this URL:

http://localhost:5004/

The repo of this React Front-end: vits-tts-webapp

With normal way:

In the repo folder, type in Terminal:

pip install -r requirements.txt

Then run the server file:

python server.py

Now, you can access the TTS API with port 8888:

http://localhost:8888/?text=Xin chào bạn&speed=normal

The result also seems like this:

{
    "hash": "e6bc1768c82ae63ed8ee61ca2349efa4ef9f166e",
    "text": "xin chào bạn",
    "speed": "normal",
    "audio_url": "http://localhost:5004/audio/e6bc1768c82ae63ed8ee61ca2349efa4ef9f166e.wav"
}

Result

Audio before finetuning voice (unmute to hear):

test_original.mov

Audio AFTER finetuining voice (unmute to hear):

test_finetuning.mov

Evaluation:

### Metrics in test data BEFORE finetuning:
Mean Square Error: (lower is better) 0.044785238588228825
Root Mean Square Error (lower is better): 2.0110066944004297
=============================================
### Metrics in test data AFTER finetuning:
Mean Square Error: (lower is better) 0.043612250527366996
Root Mean Square Error (lower is better): 1.97773962268741

In TTS tasks, the output spectrogram for a given text can be represented in many different ways. So, loss functions like MSE and MAE are just used to encourage the model to minimize the difference between the predicted and target spectrograms. The right way to Evaluate TTS model is to use MOS(mean opinion scores) BUT it is a subjective scoring system and we need human resources to do it. Reference: https://huggingface.co/learn/audio-course/chapter6/evaluation

How do we preprocess data and fine-tuning?

See train_vits.ipynb file in the repo or via this Google Colab:

https://colab.research.google.com/drive/1UK6t_AQUw9YJ_RDFvXUJmWMu-oS23XQs?usp=sharing

Author: Nguyen Thanh Phat - aka phatjk

vits-tts-vietnamese's People

Contributors

phatjkk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.