Giter VIP home page Giter VIP logo

german-tts's Introduction

👋 Hey, I'm Yusuf!

I'm an AI research engineer from Turkey. 📊 My work is usually related to NLProc, automatic speech recognition and neural text-to-speech. I'm always passionate about efficient implementations and green AI as an abolutionist vegan. 🌱

🗞️ Timeline

The timeline below is dynamically updated with the messages I posted to a Telegram bot. 🤖


monatis/clip.cpp ggerganov/llama.cpp monatis/stable-diffusion-tf-docker
monatis/clip.cpp ggerganov/llama.cpp monatis/stable-diffusion-tf-docker
unum-cloud/usearch damian0815/llama.cpp abetlen/llama-cpp-python
unum-cloud/usearch damian0815/llama.cpp abetlen/llama-cpp-python

🤙 Some more places where you can find me

german-tts's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

german-tts's Issues

Several typos in inference_tflite.py code and missing audio generation

diff inference_tflite.py inference_tflite_corrected.py
89a90

processor = Processor()

136c137
< mbmelgan_interpreter = tf.lite.Interpreter(model_path=path_tombmelgan[:-6] + "tflite")

mbmelgan_interpreter = tf.lite.Interpreter(model_path=path_to_melgan[:-6] + "tflite")

142c143,148
< audio = inference_tflite("Möchtest du das meiner Frau erklären? Nein? Ich auch nicht.", interpreter, mbmelgan_interpreter)
\ Kein Zeilenumbruch am Dateiende.

start = time.time()
audio = infer_tflite("Möchtest du das meiner Frau erklären? Nein? Ich auch nicht.", interpreter, mbmelgan_interpreter)

duration = time.time() - start
print(f"it took {duration} secs")
wavfile.write("sample.wav", 22050, audio)

background noise in multi-band melgan

Hi, in your third note, multi-band melgan will cause some background noise, so how I can avoid the background ? What's the optimazations you mean ?

Some guidance on training?

Thanks for sharing your training results. I'm trying to reproduce the same process and perhaps do some finetuning afterwards. Now I have trained Tacotron2 94k steps, but still getting poor results.. I'm listing what I have done, would you mind sharing some guidance on what I have done wrong?

  1. Cloned the TensorFlowTTS repo(after your PR merged)
  2. Updated one line in tensorflow_tts/processor/thorsten.py to include special characters: _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäüößÄÜÖ"
  3. Updated tensorflow_tts/configs/tacotron2.py to add support of 'thorsten' dataset: from tensorflow_tts.processor.thorsten import THORSTEN_SYMBOLS as thorsten_symbols ...... elif dataset == "thorsten": self.vocab_size = len(thorsten_symbols)
  4. Updated examples/tacotron2/conf/tacotron2.v1.yaml to use dataset thorsten, and set batch_size to 9 for my 3070 gpu: dataset: thorsten ...... batch_size: 9
  5. Set up TensorflowGPU, did pip install, ran commands to preprocess the dataset:tensorflow-tts-normalize --rootdir ./dump_thorsten --outdir ./dump_thorsten --config preprocess/thorsten_preprocess.yaml --dataset thorsten & tensorflow-tts-normalize --rootdir ./dump_thorsten --outdir ./dump_thorsten --config preprocess/thorsten_preprocess.yaml --dataset thorsten
  6. Trained Tacotron2 with command: CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/train_tacotron2.py --train-dir ./dump_thorsten/train/ --dev-dir ./dump_thorsten/valid/ --outdir ./output/tacotron2/v1.thorsten/ --config ./examples/tacotron2/conf/tacotron2.v1.thorsten.yaml --use-norm 1 --mixed_precision 0
  7. Verified result with the 'german-tts-inference' notebook you shared, using my 94k checkpoint h5 and your MB-MELGAN vocoder, the new vocab_size 70, and new mapper file. Here is the modified notebook with my 94k h5: https://colab.research.google.com/drive/1VLjk3PouMlhAl0Tizko3OqH9CrNcvlzk?usp=sharing

You can listen to the audio in the notebook, any help would be appreciated.
tensorboard
audio

Adding "emotional" sentences to "thorsten" dataset

Hi @monatis.
Thanks for your efforts on training models based on my free german dataset.
This is not an issue, more like an information on what i'm planning to do next on my dataset.

In mid january 2021 i'll start recording phrases in an emotional manner (https://arxiv.org/pdf/1901.04276.pdf) and publish them as usually. Because i'm not an actor or professional voice i hope to read the phrases in a normal way of emotions and not too emotional.

Next steps:

  • Choose random 300 phrases (around 50 chars length each), no special emotional phrases, from existing dataset
  • Record every one of these phrases in following emotions:
    • Amused (Erfreut)
    • Angry (Wütend)
    • Disgusted (Angeekelt)
    • Sleepy (Schläfrig)
    • Surprised (Überrascht)

See also:
https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/48150/217?u=mrthorstenm

Maybe that's interesting for you too.

Umlaute are removed

I think Umlaut-characters (äüößÄÜÖ) are currently just being removed from the input texts instead of getting their own symbol id or being replaced by similar ASCII encodings ('ae', 'ue, 'oe', 'ss'...). Even though I guess the neural network learns to pronounce 'fnf' as 'fünf' I think the performance could be improved by fixing this.

The background is that german_transliterate actually doesn't change the umlaut-characters, even though it states it 'replaces Unicode symbols with ASCII characters'. They are still in the string afterwards and as there is no symbol id for them in symbol_to_id they are just left out in the resulting sequence.

A solution could be to append those characters to ALL_SYMBOLS to give them their own id. Unfortunately the network probably has to be retrained after changing this.

Please don't hesitate to tell me if I got something wrong and umlaut characters are being handled correctly.

[Edit: Thank you Monatis and Thorsten for this really great effort regardless of this issue anyway!]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.