Giter VIP home page Giter VIP logo

kss's Introduction

Korean TTS Model: what is the best Hangul processing strategy for Korean speech synthesis?

Hangul is a unique script designed mostly for Korean. It is phonetic in principle like Latin letters, but you need to know much more pronunciation rules in order to pronounce it correctly than you do for German or Spanish. Hangul is syllable-based like Kana, the Japanese script, but Hangul is also different from Kana in that Hangul syllables can be decomposed into their constitutional consonants and vowels. Putting together, these are quite handy for readability in practice, but often they embarrass Korean computational linguists. Do I have to convert graphmes into phonemes first? Is it better to decompose Hangul syllables for TTS? Or do I have to take syllables without decomposition? If you know the scene behind the Hangul unicode, you will find things are even more complicated. There are two kinds of unicode blocks for contemporary Hangul consonants and vowels (called jamo in Korean): Hangul Jamo (0x01100-0x011FF) and Hangul Compatibility Jamo (0x03130-0x0318F). In Hangul Compatibility Jamo the first consonant (onset) and the final consonant (code) are given the same unicode point, wherase in Hangul Jamo they are treated as independent letters. (Figuratively, if you follow the Hangul Jamo system in English, you have to distinguish the two l's in law and cool) On the other hand, those two regard consonant clusters such as ㄲ, ㄱㅅ as a single letter. Some claim that they should be understood as a sequence of single consonants. Are they right in the computational practice? These questions motivate this project.

I run four different experiements depending on the Hangul processing strategies below.

  • Exp.0: Hangul Jamo (0x01100-0x011FF) with consonant clusters. Graphemes are converted into phonemes.
  • Exp.1: Hangul Jamo (0x01100-0x011FF) with consonant clusters.
  • Exp.2: Hangul Compatibility Jamo (0x03130-0x0318F) with consonant clusters
  • Exp.3: Hangul Jamo (0x01100-0x011FF). Single consonants only.
  • Exp.4: Hangul Compatibility Jamo (0x03130-0x0318F). Single consonants only.

Requirements

  • python >= 2.7
  • NumPy >= 1.11.1
  • TensorFlow >= 1.3
  • librosa
  • tqdm
  • matplotlib
  • scipy

Data

KSS Dataset, a Korean single speaker speech dataset, is used.

Model

DCTTS, introudced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention, is implemented for this project. You can refer to my other repo to see the original implementation. This repo focuses on the comparison among the four different experiment conditions.

Training

  • STEP 0. Download KSS Dataset.
  • STEP 1. Adjust num_exp in hyperparams.py.
  • STEP 2. Run python prepro.py for model inputs and targets.
  • STEP 3. Run python train.py 1 for training Text2Mel.
  • STEP 4. Run python train.py 2 for training SSRN.

You can do STEP 3 and 4 at the same time, if you have more than one gpu card.

Sample Synthesis

  • Run synthesize.py and check the files in samples.

Generated Samples

Num Experiment Samples
0 400k
1 400k
2 400k
3 400k
4 400k

Pretrained Models

Num Experiment Models
0 400k
1 400k
2 400k
3 400k
4 400k

Notes

  • Refer to this, which is provided by Hyungjun So.

kss's People

Contributors

jaymini-kakaobrain avatar kyubyong avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.