Giter VIP home page Giter VIP logo

tensorspeech / tensorflowtts Goto Github PK

View Code? Open in Web Editor NEW
3.7K 78.0 792.0 133.32 MB

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Home Page: https://tensorspeech.github.io/TensorFlowTTS/

License: Apache License 2.0

Python 100.00%
speech-synthesis text-to-speech tensorflow2 melgan fastspeech real-time tts vocoder multi-speaker-tts fastspeech2

tensorflowtts's Introduction

πŸ˜‹ TensorFlowTTS

Build GitHub Colab

Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

πŸ€ͺ TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

What's new

Features

  • High performance on Speech Synthesis.
  • Be able to fine-tune on other languages.
  • Fast, Scalable, and Reliable.
  • Suitable for deployment.
  • Easy to implement a new model, based-on abstract class.
  • Mixed precision to speed-up training if possible.
  • Support Single/Multi GPU gradient Accumulate.
  • Support both Single/Multi GPU in base trainer class.
  • TFlite conversion for all supported models.
  • Android example.
  • Support many languages (currently, we support Chinese, Korean, English, French and German)
  • Support C++ inference.
  • Support Convert weight for some models from PyTorch to TensorFlow to accelerate speed.

Requirements

This repository is tested on Ubuntu 18.04 with:

  • Python 3.7+
  • Cuda 10.1
  • CuDNN 7.6.5
  • Tensorflow 2.2/2.3/2.4/2.5/2.6
  • Tensorflow Addons >= 0.10.0

Different Tensorflow version should be working but not tested yet. This repo will try to work with the latest stable TensorFlow version. We recommend you install TensorFlow 2.6.0 to training in case you want to use MultiGPU.

Installation

With pip

$ pip install TensorFlowTTS

From source

Examples are included in the repository but are not shipped with the framework. Therefore, to run the latest version of examples, you need to install the source below.

$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

If you want to upgrade the repository and its dependencies:

$ git pull
$ pip install --upgrade .

Supported Model architectures

TensorFlowTTS currently provides the following architectures:

  1. MelGAN released with the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
  2. Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.
  3. FastSpeech released with the paper FastSpeech: Fast, Robust, and Controllable Text to Speech by Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
  4. Multi-band MelGAN released with the paper Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
  5. FastSpeech2 released with the paper FastSpeech 2: Fast and High-Quality End-to-End Text to Speech by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
  6. Parallel WaveGAN released with the paper Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
  7. HiFi-GAN released with the paper HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae.

We are also implementing some techniques to improve quality and convergence speed from the following papers:

  1. Guided Attention Loss released with the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention by Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara.

Audio Samples

Here in an audio samples on valid set. tacotron-2, fastspeech, melgan, melgan.stft, fastspeech2, multiband_melgan

Tutorial End-to-End

Prepare Dataset

Prepare a dataset in the following format:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

Where metadata.csv has the following format: id|transcription. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.

Note that NAME_DATASET should be [ljspeech/kss/baker/libritts/synpaflex] for example.

Preprocessing

The preprocessing has two steps:

  1. Preprocess audio features
    • Convert characters to IDs
    • Compute mel spectrograms
    • Normalize mel spectrograms to [-1, 1] range
    • Split the dataset into train and validation
    • Compute the mean and standard deviation of multiple features from the training split
  2. Standardize mel spectrogram based on computed statistics

To reproduce the steps above:

tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]

Right now we only support ljspeech, kss, baker, libritts, thorsten and synpaflex for dataset argument. In the future, we intend to support more datasets.

Note: To run libritts preprocessing, please first read the instruction in examples/fastspeech2_libritts. We need to reformat it first before run preprocessing.

Note: To run synpaflex preprocessing, please first run the notebook notebooks/prepare_synpaflex.ipynb. We need to reformat it first before run preprocessing.

After preprocessing, the structure of the project folder should be:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...
  • stats.npy contains the mean and std from the training split mel spectrograms
  • stats_energy.npy contains the mean and std of energy values from the training split
  • stats_f0.npy contains the mean and std of F0 values in the training split
  • train_utt_ids.npy / valid_utt_ids.npy contains training and validation utterances IDs respectively

We use suffix (ids, raw-feats, raw-energy, raw-f0, norm-feats, and wave) for each input type.

IMPORTANT NOTES:

  • This preprocessing step is based on ESPnet so you can combine all models here with other models from ESPnet repository.
  • Regardless of how your dataset is formatted, the final structure of the dump folder SHOULD follow the above structure to be able to use the training script, or you can modify it by yourself πŸ˜„.

Training models

To know how to train model from scratch or fine-tune with other datasets/languages, please see detail at example directory.

Abstract Class Explaination

Abstract DataLoader Tensorflow-based dataset

A detail implementation of abstract dataset class from tensorflow_tts/dataset/abstract_dataset. There are some functions you need overide and understand:

  1. get_args: This function return argumentation for generator class, normally is utt_ids.
  2. generator: This function have an inputs from get_args function and return a inputs for models. Note that we return a dictionary for all generator functions with the keys that exactly match with the model's parameters because base_trainer will use model(**batch) to do forward step.
  3. get_output_dtypes: This function need return dtypes for each element from generator function.
  4. get_len_dataset: Return len of datasets, normaly is len(utt_ids).

IMPORTANT NOTES:

  • A pipeline of creating dataset should be: cache -> shuffle -> map_fn -> get_batch -> prefetch.
  • If you do shuffle before cache, the dataset won't shuffle when it re-iterate over datasets.
  • You should apply map_fn to make each element return from generator function have the same length before getting batch and feed it into a model.

Some examples to use this abstract_dataset are tacotron_dataset.py, fastspeech_dataset.py, melgan_dataset.py, fastspeech2_dataset.py

Abstract Trainer Class

A detail implementation of base_trainer from tensorflow_tts/trainer/base_trainer.py. It include Seq2SeqBasedTrainer and GanBasedTrainer inherit from BasedTrainer. All trainer support both single/multi GPU. There a some functions you MUST overide when implement new_trainer:

  • compile: This function aim to define a models, and losses.
  • generate_and_save_intermediate_result: This function will save intermediate result such as: plot alignment, save audio generated, plot mel-spectrogram ...
  • compute_per_example_losses: This function will compute per_example_loss for model, note that all element of the loss MUST has shape [batch_size].

All models on this repo are trained based-on GanBasedTrainer (see train_melgan.py, train_melgan_stft.py, train_multiband_melgan.py) and Seq2SeqBasedTrainer (see train_tacotron2.py, train_fastspeech.py).

End-to-End Examples

You can know how to inference each model at notebooks or see a colab (for English), colab (for Korean), colab (for Chinese), colab (for French), colab (for German). Here is an example code for end2end inference with fastspeech2 and multi-band melgan. We uploaded all our pretrained in HuggingFace Hub.

import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")


# initialize mb_melgan model
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en")


# inference
processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")

input_ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
# fastspeech inference

mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)

# melgan inference
audio_before = mb_melgan.inference(mel_before)[0, :, 0]
audio_after = mb_melgan.inference(mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

Contact

License

All models here are licensed under the Apache 2.0

Acknowledgement

We want to thank Tomoki Hayashi, who discussed with us much about Melgan, Multi-band melgan, Fastspeech, and Tacotron. This framework based-on his great open-source ParallelWaveGan project.

tensorflowtts's People

Contributors

ak391 avatar anon-artist avatar azraelkuan avatar cclauss avatar crux153 avatar dathudeptrai avatar erogol avatar ga642381 avatar hertz-pj avatar jaeyoo avatar kewlbear avatar luan78zaoha avatar machineko avatar mapledxf avatar mariamsu avatar megapanchamz avatar mokkemeguru avatar monatis avatar myagues avatar neonbohdan avatar oscarvanl avatar patrickvonplaten avatar samuel-lunii avatar sayakpaul avatar sjincho avatar theamdara avatar trfnhle avatar tts-nlp avatar tulasiram58827 avatar zdisket avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorflowtts's Issues

Saving entire model

Hello I tried to save the entire model for Tacotron-2, instead of the weights, as an h5 file. However, I am getting the following error

NotImplementedError: Saving the model to HDF5 format requires the model to be a Functional model or a Sequential model. It does not work for subclassed models, because such models are defined via the body of a Python method, which isn't safely serializable. Consider saving to the Tensorflow SavedModel format (by setting save_format="tf") or using save_weights.

I used the following code to successfully load the weights:

tacotron2 = TFTacotron2(config=fs_config, name="tacotron2", training=False)
tacotron2._build()  
tacotron2.load_weights("tacotron2.h5") 

Then I tried to call tacotron2.save("full_tacotron2.h5") and I got the afformentioned error. Should I modify the trainers/base_trainer.py as follows and re-train or is there another way to save the entire model as an h5 file?

def save_checkpoint(self):
      """Save checkpoint."""
      self.ckpt.steps.assign(self.steps)
      self.ckpt.epochs.assign(self.epochs)
      self.ckp_manager.save(checkpoint_number=self.steps)
      self.model.save_weights(self.saved_path + 'model-{}.h5'.format(self.steps))
      self.model.save_model(self.saved_path + 'model-total{}.h5'.format(self.steps))

fast speech mel normalization

when compute mean and scaler for mel-spectrogram before normization, mean and scaler are computed from all dataset and only the first frame mel?

mel = mel[0].numpy()

tensorflow-tts-preprocess: assert len(mel) == len(f0) == len(energy) AssertionError

My dataset which always worked properly gave me this error when running the preprocessing step.

2020-06-19 20:34:52.303453: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[Preprocessing]:   0% 6/3431 [00:02<24:16,  2.35it/s]
[Preprocessing]:   0% 13/3431 [00:07<32:39,  1.74it/s]
[Preprocessing]:   0% 6/3431 [00:10<1:35:38,  1.68s/it]
[Preprocessing]:   2% 64/3431 [00:20<18:08,  3.09it/s]
[Preprocessing]:   1% 49/3431 [00:25<29:30,  1.91it/s]
[Preprocessing]:   2% 73/3431 [00:53<40:54,  1.37it/s]
[Preprocessing]:   3% 106/3431 [00:59<31:02,  1.78it/s]
[Preprocessing]:   0% 13/3431 [01:03<4:39:15,  4.90s/it]
[Preprocessing]:   2% 59/3431 [01:12<1:09:08,  1.23s/it]
[Preprocessing]:   6% 215/3431 [01:12<18:09,  2.95it/s]
[Preprocessing]:   6% 215/3431 [01:13<18:12,  2.94it/s]
[Preprocessing]:   0% 12/3431 [01:16<6:05:32,  6.41s/it]
[Preprocessing]:   1% 20/3431 [01:25<4:03:18,  4.28s/it]
[Preprocessing]:   2% 76/3431 [01:35<1:10:16,  1.26s/it]
[Preprocessing]:   5% 178/3431 [01:49<33:13,  1.63it/s]
[Preprocessing]:   6% 209/3431 [01:56<06:16,  8.55it/s]multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/bin/preprocess.py", line 217, in save_to_file
    assert len(mel) == len(f0) == len(energy)
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/tensorflow-tts-preprocess", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/bin/preprocess.py", line 250, in main
[Preprocessing]:   6% 209/3431 [01:56<29:58,  1.79it/s]
    p.map(save_to_file, range(len(processor.items)))
  File "/usr/local/lib/python3.6/dist-packages/pathos/multiprocessing.py", line 137, in map
    return _pool.map(star(f), zip(*args)) # chunksize
  File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 644, in get
    raise self._value
AssertionError
[Preprocessing]:   0% 0/3431 [01:56<?, ?it/s]

When I run the normalization step it only does 1314 iterations instead of 3431 as it should. In addition, when trying to train I get this.

<ipython-input-12-8616bad8c9dc> in dotrain(inargs, ptpath, maxsteps)
    371         energy_stat=args.energy_stat,
    372         mel_length_threshold=mel_length_threshold,
--> 373         return_utt_id=False
    374     ).create(
    375         is_shuffle=config["is_shuffle"],

/content/TensorflowTTS/ttsexamples/fastspeech2/fastspeech2_dataset.py in __init__(self, root_dir, charactor_query, mel_query, duration_query, f0_query, energy_query, f0_stat, energy_stat, max_f0_embeddings, max_energy_embeddings, charactor_load_fn, mel_load_fn, duration_load_fn, f0_load_fn, energy_load_fn, mel_length_threshold, return_utt_id)
    101             mel_files = [mel_files[idx] for idx in idxs]
    102             charactor_files = [charactor_files[idx] for idx in idxs]
--> 103             duration_files = [duration_files[idx] for idx in idxs]
    104             mel_lengths = [mel_lengths[idx] for idx in idxs]
    105             f0_files = [f0_files[idx] for idx in idxs]

/content/TensorflowTTS/ttsexamples/fastspeech2/fastspeech2_dataset.py in <listcomp>(.0)
    101             mel_files = [mel_files[idx] for idx in idxs]
    102             charactor_files = [charactor_files[idx] for idx in idxs]
--> 103             duration_files = [duration_files[idx] for idx in idxs]
    104             mel_lengths = [mel_lengths[idx] for idx in idxs]
    105             f0_files = [f0_files[idx] for idx in idxs]

IndexError: list index out of range

I have no idea why this happens. My dataset is formatted exactly the same (like LJSpeech) as it was back when it was working.

Error when loading pretrained model ljspeech to training another language

When i tried to load pretrained model(model_65000.h5) like the tutorials in the examples/tacotron2/README.md to train on the other language, i got the error like this:

tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable encoder/embeddings/LayerNorm/beta_164 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/embeddings/LayerNorm/beta_164/N10tensorflow3VarE does not exist.

And when i used another pretrained(model_40000.h5), i got another error:

(0) Failed precondition: Error while reading resource variable encoder/embeddings/LayerNorm/gamma_157 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/embeddings/LayerNorm/gamma_157/N10tensorflow3VarE does not exist.
[[node encoder/embeddings/LayerNorm/batchnorm/mul/ReadVariableOp (defined at /data/linhld6/Project/TensorflowTTS-master/tensorflow_tts/models/tacotron2.py:163) ]]
[[decoder/while/LoopCond/_71/_86]]
(1) Failed precondition: Error while reading resource variable encoder/embeddings/LayerNorm/gamma_157 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/embeddings/LayerNorm/gamma_157/N10tensorflow3VarE does not exist.

Can anyone help me to fix this error?! Thanks!

SqueezeWave Implementation

Hi,
Will it be possible to add a TF2 implementation of SqueezeWave vocoder to this system? The performance is really fast and promising. I'm working on the same. But I'm not well versed with TF2 yet. I had quite a struggle trying to train the PyTorch implementation from the authors with my custom dataset even though it had almost the same characteristics as LJSpeech but double the size of dataset. I believe TF2 is more suitable for post training optimization and deployment.
Original Repo: https://github.com/tianrengao/SqueezeWave

FastSpeech 2

Hello, thank you for this project. I'm aware of two different implementations of FastSpeech, any plan to support the recent FastSpeech 2 architecture?

Thank you very much.

Error when loading data for train tacotron2 model

when i load my own dataset to training tacotron2 model, i got this error:

tensorflow.python.framework.errors_impl.DataLossError: Attempted to pad to a smaller size than the input element. [Op:IteratorGetNext]

Can anyone help me to fixed this error.

Running path problem

[root@localhost TensorflowTTS]# CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/train_tacotron2.py --train-dir ./dump/train/ --dev-dir ./dump/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mix_precision 0 --resume ""
Traceback (most recent call last):
File "examples/tacotron2/train_tacotron2.py", line 31, in
from examples.tacotron2.tacotron_dataset import CharactorMelDataset
ModuleNotFoundError: No module named 'examples'

Why do I get this error when running in the root directory ?

config length of the melspectrogram input

When i training tacotron2 model with my own data, some of mel-spetrogram has timestep > 2000 so I always get this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [16,2000,80] vs. [16,2023,80]

I dkn that where to set the timestep to higher value to adapt my data.
Thanks

optimizer further by using fake-quantize aware and pruning

Thanks for your this great job.
I have trained Tacotron2 model as your repository.
Now I am trying to convert our model to a int8 quantization model with Tensorflow Lite.
But I encountered some errors,when I use
converter = tf.lite.TFLiteConverter.from_saved_model("./test_saved")
or
converter = tf.lite.TFLiteConverter.from_keras_model(tacotron2).
And do you have any advise about optimizer further by using fake-quantize aware and pruning?
Thank you very much.

Enabling mixed_precision for training

Hi, I would like to turn on mixed_precision setting for training so set "1" for the command below, but mixed_precision is still False after that. Do I miss something?

2020-06-25 06:58:32,264 (train_tacotron2:413) INFO: mixed_precision = False
CUDA_VISIBLE_DEVICES=0 python3 examples_models/tacotron2/train_tacotron2.py \
  --train-dir ./dump_lisa/train/ \
  --dev-dir ./dump_lisa/valid/ \
  --outdir ./examples_models/tacotron2/exp_lisa/train.tacotron2.v2_mixed_precision/ \
  --config ./examples_models/tacotron2/conf/tacotron2.v1.yaml \
  --use-norm 1
  --mixed_precision 1 \
  --resume ""

How to continue running training from the checkpoint

Hi, My run is unexpectedly stopped so I need to continue my run.

Below is my command.

python3 examples_models/tacotron2/train_tacotron2.py --train-dir ./dump/train/ --dev-dir ./dump/valid/ --outdir ./examples_models/tacotron2/exp/train.tacotron2.v2_mixed_precision/ --config ./examples_models/tacotron2/conf/tacotron2.v1.yaml --use-norm 1 --mixed_precision 0 --resume "./examples_models/tacotron2/exp/train.tacotron2.v1/checkpoints/model-95000.h5"

But it failed as showing logs below.

2020-06-30 13:26:51.312649: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open ./examples_models/tacotron2/exp/train.tacotron2.v1/checkpoints/model-95000.h5: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line
95, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
RuntimeError: Unable to open table file ./examples_models/tacotron2/exp/train.tacotron2.v1/checkpoints/model-95000.h5: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples_models/tacotron2/train_tacotron2.py", line 511, in <module>
    main()
  File "examples_models/tacotron2/train_tacotron2.py", line 504, in main
    resume=args.resume)
  File "examples_models/tacotron2/train_tacotron2.py", line 345, in fit
    self.load_checkpoint(resume)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/trainers/base_trainer.py", line 343, in load_checkpoint
    self.ckpt.restore(pretrained_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 2009, in restore
    status = self._saver.restore(save_path=save_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py", line 1260, in restore
    reader = py_checkpoint_reader.NewCheckpointReader(save_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line
99, in NewCheckpointReader
    error_translator(e)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line
44, in error_translator
    raise errors_impl.DataLossError(None, None, error_message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./examples_models/tacotron2/exp/train.tacotron2.v1/checkpoints/model-95000.h5: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Am I missing something or did I run in a wrong way?

fastspeech inference duration_outputs

when run fastspeech in inference mode, why use exp op after duration_output from duration predictor?

duration_outputs = tf.math.exp(duration_outputs) - 1.0

Optimizer error at the start of training

Hi,
I notice this error at the start of training every time after loading data although the training starts and progresses as usual. I have a bit of hard time trying to get the model to converge and that's when i started to doubt if the optimizer is having any issue. Is this normal?

image

Tacotron2 teacher forcing

@dathudeptrai Training Sampler used teacher forching(ground_truth mel-spectrogram frames) during all training period. Did you met Explosure bias problem? audio quality always good in inference time?

pretrain model of melgan

Hi, I tried to use run melgan on your pretrained model, but I found there is no ckpt data, only discriminator and generator's h5 files.
I read the train_melgan.py, found it needs --resume parameter, but I think it needs ckpt file to get steps and epochs.
Could you upload the ckpt files along with the pre-trained model or is there other way to do it ?

Thank you for your awesome repo. It is really well constructed and I can't wait to see its performance.

Problems creating Tensorflow-based Dataloader

Hi, I encountered the following problems when using it, do you know how to solve it?
tensorflow-tts-preprocess --rootdir ./datasets/ --outdir ./dump/ --conf preprocess/ljspeech_preprocess.yaml
[Preprocessing]: 0% 0/10000 [00:00<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:01<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:02<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:03<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:03<?, ?it/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 44, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.6/dist-packages/pathos/helpers/mp_helper.py", line 15, in
func = lambda args: f(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/bin/preprocess.py", line 165, in save_to_file
f"{utt_id} seems to have a different sampling rate."
AssertionError: 000001 seems to have a different sampling rate.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/tensorflow-tts-preprocess", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/bin/preprocess.py", line 228, in main
p.map(save_to_file, range(len(processor.items)))
File "/usr/local/lib/python3.6/dist-packages/pathos/multiprocessing.py", line 137, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/local/lib/python3.6/dist-packages/multiprocess/pool.py", line 644, in get
raise self._value
AssertionError: 000001 seems to have a different sampling rate.
[Preprocessing]: 0% 0/10000 [00:03<?, ?it/s]
[Preprocessing]: 0% 0/10000 [00:03<?, ?it/s]

Double Decoder Consistency for Tacotron 2

Hi, I found this interesting post while regularly checking up on erogol's blog. Any chances of having it here? Having a Tacotron 2 that is practically immune to alignment problems sounds very state of the art.
For reference, Mozilla/TTS has it.

decode_tacotron2.py Problem

Hi~
After training tacotron2 with my data, the following problems occur during decoding.

Does anyone know the answer to this problem?

CUDA_VISIBLE_DEVICES=3 python ./decode_tacotron2.py   --rootdir ./dump/valid/   --outdir ./prediction/tacotron2-75k/ 
--checkpoint ./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-75000.h5 
--config ./examples/tacotron2/conf/tacotron2.v1.yaml   --batch-size 32
2020-06-30 13:31:07.511973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-30 13:31:07.512004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-06-30 13:31:07.512024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-06-30 13:31:07.516529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:08:00.0, compute capability: 6.1)
2020-06-30 13:31:28.713565: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-30 13:31:28.964408: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[Decoding]: 0it [00:00, ?it/s]2020-06-30 13:31:34.449436: W tensorflow/core/kernels/data/cache_dataset_ops.cc:794] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
[Decoding]: 0it [00:03, ?it/s]
Traceback (most recent call last):
  File "./decode_tacotron2.py", line 135, in <module>
    main()
  File "./decode_tacotron2.py", line 116, in main
    speaker_ids=tf.zeros(shape=[tf.shape(charactor)[0]]),
  File "/data/nlp/ihkim/miniconda2/envs/data_TensorflowTTS/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/data/nlp/ihkim/miniconda2/envs/data_TensorflowTTS/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 648, in _call
    *args, **kwds)
  File "/data/nlp/ihkim/miniconda2/envs/data_TensorflowTTS/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2238, in canonicalize_function_inputs
    self._flat_input_signature)
  File "/data/nlp/ihkim/miniconda2/envs/data_TensorflowTTS/lib/python3.6/site-
packages/tensorflow/python/eager/function.py", line 2305, in _convert_inputs_to_signature
    format_error_message(inputs, input_signature))
ValueError: Python inputs incompatible with input_signature:
  inputs: (
    tf.Tensor(
[[ 0 13 27 ...  0  0  0]
 [ 0 20 27 ...  0  0  0]
 [ 0  8 26 ...  0  0  0]
 ...
 [ 0  2 29 ...  0  0  0]
 [ 0 11 34 ...  0  0  0]
 [ 0  9 21 ...  0  0  0]], shape=(32, 134), dtype=int32),
    tf.Tensor(
[ 29  25  17  26  23  52  11  21  22  37 102  76  54  45 130  59  39  36
  20  47  31  43  45 134 101  53  80  73 112 114  16 114], shape=(32,), dtype=int32),
    tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.], shape=(32,), dtype=float32))
  input_signature: (
    TensorSpec(shape=(None, None), dtype=tf.int32, name=None),
    TensorSpec(shape=(None,), dtype=tf.int32, name=None),
    TensorSpec(shape=(None,), dtype=tf.int32, name=None))

Deployment of frozen models with tf-addons

Hi,
For freezing the graph and deploying on target device, is it necessary to have Tensorflow-Addons installed on the device? Or will it be possible for just tensorflow 2.x package to perform inference from the frozen graph. Sorry but I have never used Tensorflow-Addons before and dont have much knowledge about it.

Data enhancement leads to increased training time

The first thing to note is that tensorflow originally had a very useful library for data augmentation.
from tensorflow.keras.preprocessing.image import ImageDataGenerator

However, in the 2.0 official version, this library and the corresponding training method model.fit_generator() had problems, and the training time increased by 3-4 times. The answer given by the official staff of tensorflow is that this is indeed a bug(issue #33177οΌ‰

And they decided to abandon this method instead of repairing it. The solution given is to directly use model.fit() to receive the data generated by ImageDataGenerator, but this creates an additional problem, the program prompts β€˜Filling up shuffle buffer (this may take a while)’ before each epoch.

This is also a considerable extra time overhead. For me, my machine needs 10s, and 10s per epoch is unacceptable. Is there any better way to deal with it?

Difference btw decode_tacotron2.py vs tacotron2_inference.ipynb

The procedure of running inference of tacotron2 is not clear. It appears that running only "tacotron2_inference.ipynb" looks enough to me. What is "decode_tacotron2.py" for? Even running this code gives me errors as below.

2020-07-01 01:59:24.756772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow d
evice (/job:localhost/replica:0/task:0/device:GPU:0 with 299 MB memory) -> physical GPU (device: 0, name:
 GeForce GTX 1080 Ti, pci bus id: 0000:08:00.0, compute capability: 6.1)
2020-07-01 01:59:43.496259: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully 
opened dynamic library libcublas.so.10                                 
2020-07-01 01:59:43.723366: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully 
opened dynamic library libcudnn.so.7
2020-07-01 01:59:44.157435: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn han
dle: CUDNN_STATUS_INTERNAL_ERROR                                  
2020-07-01 01:59:44.167314: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn han
dle: CUDNN_STATUS_INTERNAL_ERROR                                       
2020-07-01 01:59:44.172894: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn han
dle: CUDNN_STATUS_INTERNAL_ERROR                    
2020-07-01 01:59:44.175197: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn han
dle: CUDNN_STATUS_INTERNAL_ERROR   
Traceback (most recent call last):                                                             [105/1355]
  File "examples_models/tacotron2/decode_tacotron2.py", line 136, in <module>                            
    main()                                                                                               
  File "examples_models/tacotron2/decode_tacotron2.py", line 103, in main                                
    tacotron2._build()  # build model to be able load_weights.                                           
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/models/tacotron2.py", line 677, in _build  
    self(input_ids, input_lengths, speaker_ids, mel_outputs, mel_lengths, 10, training=True)             
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, $
n __call__                                                                                               
    outputs = self.call(cast_inputs, *args, **kwargs)                                                    
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 580, in __$
all__                                                                                                    
    result = self._call(*args, **kwds)                                                                   
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 644, in _c$
ll                                                                                                       
    return self._stateless_fn(*args, **kwds)                                                             
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2420, in __cal$
__                           
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _filt$
red_call                     
    self.captured_inputs)    
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call$
flat                         
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)                
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_exe
cute                        
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,
 so try looking to see if a warning log message was printed above.
         [[node encoder/conv_batch_norm/tf_tacotron_conv_batch_norm/conv_._0/conv1d (defined at /usr/loca
l/lib/python3.6/dist-packages/tensorflow_tts/models/tacotron2.py:86) ]]
         [[decoder/while/body/_1/decoder_cell/assert_positive/assert_less/Assert/AssertGuard/pivot_f/_265
/_47]]                      
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,
 so try looking to see if a warning log message was printed above.
         [[node encoder/conv_batch_norm/tf_tacotron_conv_batch_norm/conv_._0/conv1d (defined at /usr/loca
l/lib/python3.6/dist-packages/tensorflow_tts/models/tacotron2.py:86) ]]
0 successful operations.     
0 derived errors ignored. [Op:__inference_call_8385]
                                
Function call stack:               
call -> call      

pqmf synthesis filter

Hi, why synthesis filter still use h_analysis here?

  # [subbands, 1, taps + 1] == [filter_width, in_channels, out_channels]
        analysis_filter = np.expand_dims(h_analysis, 1)
        analysis_filter = np.transpose(analysis_filter, (2, 1, 0))
        synthesis_filter = np.expand_dims(h_analysis, 0)
        synthesis_filter = np.transpose(synthesis_filter, (2, 1, 0))

melgan residual stack config

Hi, In original melgan paper, the kernel size of the second conv1d layer is 3, I see that in this melgan residual config it's 1. Is it a better config?

Error following Tacotron2 tutorial

First, great project. Thanks a ton for maintaining it. I'm learning a lot going through the code.

I tried following the Tacotron2 tutorial -- downloaded the dataset, ran the preprocessing steps, and tried training. I ran into this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  [_Derived_]  Trying to access element 156 in a list with 156 elements.
         [[{{node while_21/body/_1/TensorArrayV2Read_1/TensorListGetItem}}]]
         [[tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall]] [Op:__inference__one_step_tacotron2_406287]

Function call stack:
_one_step_tacotron2 -> _one_step_tacotron2 -> _one_step_tacotron2

Is this an error you've seen before? Unless I've made a dumb mistake, I imagine anyone might experience this error since I'm following the tutorial. Any pointers on how I can debug this? I don't mind trying to solve it myself but I'm fairly new to Tensorflow2.

The full log is below:

$ CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/train_tacotron2.py \
>   --train-dir /media/usb0/tts/ljspeech_dump/train/ \
>   --dev-dir /media/usb0/tts/ljspeech_dump/valid/ \
>   --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ \
>   --config ./examples/tacotron2/conf/tacotron2.v1.yaml \
>   --use-norm 1 \
>   --mixed_precision 0 \
>   --resume ""

2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: hop_size = 256
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: format = npy
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: tacotron2_params = {'embedding_hidden_size': 512, 'initializer_range': 0.02, 'embedding_dropout_prob': 0.1, 'n_speakers': 1, 'n_conv_encoder': 5, 'encoder_conv_filters': 512, 'encoder_conv_kernel_sizes': 5, 'encoder_conv_activation': 'relu', 'encoder_conv_dropout_rate': 0.5, 'encoder_lstm_units': 256, 'n_prenet_layers': 2, 'prenet_units': 256, 'prenet_activation': 'relu', 'prenet_dropout_rate': 0.5, 'n_lstm_decoder': 1, 'reduction_factor': 1, 'decoder_lstm_units': 1024, 'attention_dim': 128, 'attention_filters': 32, 'attention_kernel': 31, 'n_mels': 80, 'n_conv_postnet': 5, 'postnet_conv_filters': 512, 'postnet_conv_kernel_sizes': 5, 'postnet_dropout_rate': 0.1, 'attention_type': 'lsa'}
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: batch_size = 32
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: remove_short_samples = True
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: allow_cache = True
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: mel_length_threshold = 32
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: is_shuffle = True
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: use_fixed_shapes = True
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 1e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: train_max_steps = 200000
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: save_interval_steps = 5000
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: eval_interval_steps = 500
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: log_interval_steps = 100
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: start_schedule_teacher_forcing = 200001
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: start_ratio_value = 0.5
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: schedule_decay_steps = 50000
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: end_ratio_value = 0.0
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: num_save_intermediate_results = 1
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: train_dir = /media/usb0/tts/ljspeech_dump/train/
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: dev_dir = /media/usb0/tts/ljspeech_dump/valid/
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: use_norm = True
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: outdir = ./examples/tacotron2/exp/train.tacotron2.v1/
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: config = ./examples/tacotron2/conf/tacotron2.v1.yaml
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: resume =
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: verbose = 1
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: mixed_precision = False
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: version = 0.6.1
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: max_mel_length = 871
2020-06-27 17:02:49,796 (train_tacotron2:440) INFO: max_char_length = 188
2020-06-27 17:02:49.799192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-27 17:02:49.827610: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-27 17:02:49.827904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 46 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.23GiB/s
2020-06-27 17:02:49.828043: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-27 17:02:49.828913: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-27 17:02:49.829804: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-27 17:02:49.829957: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-27 17:02:49.830830: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-27 17:02:49.831252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-27 17:02:49.831346: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-06-27 17:02:49.831352: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-06-27 17:02:49.831534: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-27 17:02:49.835216: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 4200000000 Hz
2020-06-27 17:02:49.835388: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f5318000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-27 17:02:49.835400: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-27 17:02:49.836162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-27 17:02:49.836171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
Model: "tacotron2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
encoder (TFTacotronEncoder)  multiple                  8218624
_________________________________________________________________
decoder_cell (TFTacotronDeco multiple                  18246402
_________________________________________________________________
post_net (TFTacotronPostnet) multiple                  5460480
_________________________________________________________________
residual_projection (Dense)  multiple                  41040
=================================================================
Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240
_________________________________________________________________
[train]:   0%|                                                                             | 0/200000 [00:00<?, ?it/s]2020-06-27 17:03:08.327047: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 4125 of 12445
2020-06-27 17:03:18.326677: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 8227 of 12445
2020-06-27 17:03:28.325737: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 12285 of 12445
2020-06-27 17:03:28.724872: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Traceback (most recent call last):
  File "examples/tacotron2/train_tacotron2.py", line 513, in <module>
    main()
  File "examples/tacotron2/train_tacotron2.py", line 503, in main
    trainer.fit(train_dataset,
  File "examples/tacotron2/train_tacotron2.py", line 343, in fit
    self.run()
  File "/home/caleb/repos/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 72, in run
    self._train_epoch()
  File "/home/caleb/repos/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py", line 94, in _train_epoch
    self._train_step(batch)
  File "examples/tacotron2/train_tacotron2.py", line 116, in _train_step
    self._one_step_tacotron2(charactor, char_length, mel, mel_length, guided_attention)
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/home/caleb/repos/TensorflowTTS/venv/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  [_Derived_]  Trying to access element 156 in a list with 156 elements.
         [[{{node while_21/body/_1/TensorArrayV2Read_1/TensorListGetItem}}]]
         [[tacotron2/StatefulPartitionedCall/encoder/bilstm/forward_lstm/StatefulPartitionedCall]] [Op:__inference__one_step_tacotron2_406287]

Function call stack:
_one_step_tacotron2 -> _one_step_tacotron2 -> _one_step_tacotron2

[train]:   0%|                                                                             | 0/200000 [00:36<?, ?it/s]

FastSpeech2 implementation

Hi, i just fast implemented FastSpeech2 to check the contribution of F0 embedding and Energy embedding. See PR #45 for detail.

Can't train MelGAN-STFT discriminator

I have been training MelGAN-STFT by finetuning it on the LJSpeech model. When it gets to discriminator_train_start_steps, it stops and tells me to restart. When I restart with the discriminator on less than the steps of the latest checkpoint (210k vs 220k), I get this:

ValueError: in user code:

    <ipython-input-13-8bd0e2e9cdea>:110 _one_step_generator  *
        p_hat = self.discriminator(y_hat)
    /content/TensorflowTTS/tensorflow_tts/models/melgan.py:441 call  *
        outs += [f(x)]
    /content/TensorflowTTS/tensorflow_tts/models/melgan.py:386 call  *
        x = f(x)
    /content/TensorflowTTS/tensorflow_tts/utils/group_conv.py:283 call  *
        self._convolution_op = nn_ops.Convolution(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py:1063 __init__  **
        filter_shape[num_spatial_dims]))

    ValueError: number of input channels does not match corresponding dimension of filter, 16 != 4

Also, all my predictions have heavy metallic noise. I'm assuming this is due to lack of discriminator training.

MelNet & other dataset than ljspeech

Hello,
thank you for this exellent and very intuitive implementation!

I am not familiar with TTS research, so my questions can probably be quite naive. :)

  1. Do you also plan to implement MelNet? The audio results provided in the paper overview are quite impressive.
  2. Is there any chance that in the long run you will train models with different dataset, other than ljspeech? This implementation is great, but the commercial applications (Google, Microsoft) have models trained with much better datasets.

P.S: As I said I am not familiar with TTS research, but there is plenty of exellent readers on librivox - all of them being in public domain. I have done plenty of audio-text matching tasks with aeneas library in the past. Would it be enough to emulate the structure of ljspeech dataset? With the quantity and lenght of samples in librivox I could probably create a dataset with 30-50 single-speaker hours of samples...

Once again thanks for your great work!

Extract duration from tacotron2 model

I don't send any issue to this comment. Following to the tutorial in training model fastspeech2, we have to extract the duration from alignment of tacotron2 model( on function get_duration_from_alignment on file extract_duration.py).
I just want to know what exactly of this term "duration". Anyone help me to figure out this definition?!

Fine-tuning does not work

Hi,
I am trying to fine-tune the LJSpeech pretrained tacotron 2 model to fit custom voice dataset in English. I made the changes as mentioned for rebuilding with a new Embedding layer. But it fails to build it the second time. Since my dataset is also english but with different voice, I used the same vocab_size.
This is the change made for the finetuning code with respect to the training code.
`pretrained_config = Tacotron2Config(**config["tacotron2_params"])

tacotron2 = TFTacotron2(pretrained_config, training=True, name='tacotron2')

tacotron2._build()

tacotron2.summary()

tacotron2.load_weights(path)

pretrained_config.vocab_size = len(symbols)

new_embedding_layers = TFTacotronEmbeddings(pretrained_config, name='embeddings')

tacotron2.encoder.embeddings = new_embedding_layers

# re-build model

tacotron2._build() #BREAKS HERE

tacotron2.summary()`

Error:
`Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240

2020-06-22 10:26:33.976375: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at resource_variable_ops.cc:637 : Not found: Resource localhost/encoder/embeddings/character_embeddings/weight_147/N10tensorflow3VarE does not exist.
Traceback (most recent call last):
File "examples/tacotron2/finetune_tacotron.py", line 518, in
main()
File "examples/tacotron2/finetune_tacotron.py", line 471, in main
tacotron2._build()
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow_tts/models/tacotron2.py", line 677, in _build
self(input_ids, input_lengths, speaker_ids, mel_outputs, mel_lengths, 10, training=True)
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 580, in call
result = self._call(*args, **kwds)
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 611, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2420, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable encoder/embeddings/LayerNorm/beta_164 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/embeddings/LayerNorm/beta_164/N10tensorflow3VarE does not exist.
[[node encoder/embeddings/LayerNorm/batchnorm/ReadVariableOp (defined at /anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow_tts/models/tacotron2.py:153) ]]
[[decoder/while/body/_1/decoder_cell/assert_positive/assert_less/Assert/AssertGuard/pivot_f/_289/_53]]
(1) Failed precondition: Error while reading resource variable encoder/embeddings/LayerNorm/beta_164 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/encoder/embeddings/LayerNorm/beta_164/N10tensorflow3VarE does not exist.
[[node encoder/embeddings/LayerNorm/batchnorm/ReadVariableOp (defined at /anaconda/envs/py37_tensorflow/lib/python3.7/site-packages/tensorflow_tts/models/tacotron2.py:153) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_call_8492]

Function call stack:
call -> call`

Since the end goal for me is to just change the voice, i initially tried to train the model further from the pretrained checkpoint with my dataset since this worked for me with PyTorch Tacotron2 implementation from Nvidia. But somehow the model's voice doesn't change here while the quality is getting better. That is why i decided to redo the embedding layer fine-tune.
Should I train it from scratch?

I want to create organization repo and need enroll more member :D

after an open source time, I see a lot of things to do sα»₯ch as support more model like flow/glow, GAN, tensorrt, ... but 1 or 2 people may not be enough to do that. I am thinking about creating a organization repo and enroll more members with us to develop. In the future, I want to change the repo name to TensorSpeech and support speech-related problems like voice conversion, speech recognition, ... Does anyone want to join?: D.

Error

I ran into an error when I run

tensorflow-tts-preprocess --rootdir ./datasets/ --outdir ./dump/ --conf preprocess/ljspeech_preprocess.yaml

The error is

Traceback (most recent call last):
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/tensorflow_tts/bin/preprocess.py", line 215, in save_to_file
    np.save(os.path.join(args.outdir, subdir, "wavs", f"{utt_id}-wave.npy"),
  File "<__array_function__ internals>", line 5, in save
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/numpy/lib/npyio.py", line 541, in save
    fid = open(file, "wb")
FileNotFoundError: [Errno 2] No such file or directory: './dump/train/wavs/LJ001-0001-wave.npy'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/trunk/miniconda3/envs/tts_env/bin/tensorflow-tts-preprocess", line 8, in <module>
    sys.exit(main())
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/tensorflow_tts/bin/preprocess.py", line 228, in main
    p.map(save_to_file, range(len(processor.items)))
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/pathos/multiprocessing.py", line 137, in map
[Preprocessing]:   0%|                                | 0/13100 [00:04<?, ?it/s]
    return _pool.map(star(f), zip(*args)) # chunksize
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/multiprocess/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/trunk/miniconda3/envs/tts_env/lib/python3.8/site-packages/multiprocess/pool.py", line 768, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: './dump/train/wavs/LJ001-0001-wave.npy'
^[[A^[[A^[[B

I am running Manjaro Linux 20.3 XFCE.
My folder structure is

datasets
β”‚Β Β  β”œβ”€β”€ LJSpeech-1.1--
β”‚Β Β  β”œβ”€β”€ metadata.csv
β”‚Β Β  β”œβ”€β”€ README
β”‚Β Β  └── wavs
β”œβ”€β”€ dump
β”‚Β Β  β”œβ”€β”€ train_utt_ids.npy
β”‚Β Β  └── valid_utt_ids.npy
β”œβ”€β”€ examples
β”‚Β Β  β”œβ”€β”€ fastspeech
β”‚Β Β  β”œβ”€β”€ melgan
β”‚Β Β  β”œβ”€β”€ melgan.stft
β”‚Β Β  β”œβ”€β”€ multiband_melgan
β”‚Β Β  └── tacotron2
β”œβ”€β”€ google9b8578adaee731be.html
β”œβ”€β”€ LICENSE
β”œβ”€β”€ notebooks
β”‚Β Β  └── tacotron2_inference.ipynb
β”œβ”€β”€ preprocess
β”‚Β Β  └── ljspeech_preprocess.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ setup.cfg
β”œβ”€β”€ setup.py
β”œβ”€β”€ tensorflow_tts
β”‚Β Β  β”œβ”€β”€ bin
β”‚Β Β  β”œβ”€β”€ configs
β”‚Β Β  β”œβ”€β”€ datasets
β”‚Β Β  β”œβ”€β”€ init.py
β”‚Β Β  β”œβ”€β”€ losses
β”‚Β Β  β”œβ”€β”€ models
β”‚Β Β  β”œβ”€β”€ optimizers
β”‚Β Β  β”œβ”€β”€ processor
β”‚Β Β  β”œβ”€β”€ trainers
β”‚Β Β  └── utils
β”œβ”€β”€ test
β”‚Β Β  β”œβ”€β”€ test_fastspeech.py
β”‚Β Β  β”œβ”€β”€ test_mb_melgan.py
β”‚Β Β  β”œβ”€β”€ test_melgan_layers.py
β”‚Β Β  β”œβ”€β”€ test_melgan.py
β”‚Β Β  └── test_tacotron2.py
└── tts
β”œβ”€β”€ bin
β”œβ”€β”€ include
β”œβ”€β”€ lib
β”œβ”€β”€ lib64 -> lib
└── pyvenv.cfg

An error was encountered during data preprocessing

[root@localhost TensorflowTTS]# tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --stats ./dump/stats.npy --config preprocess/ljspeech_preprocess.yaml
2020-06-10 11:07:41.220189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-10 11:07:41.472070: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-06-10 11:07:41.472150: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: localhost.localdomain
2020-06-10 11:07:41.472174: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: localhost.localdomain
2020-06-10 11:07:41.472313: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.39.0
2020-06-10 11:07:41.472366: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.39.0
2020-06-10 11:07:41.472382: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.39.0
2020-06-10 11:07:41.473387: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-10 11:07:41.489617: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2100000000 Hz
2020-06-10 11:07:41.492554: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55abc9916ef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-10 11:07:41.492595: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/ysj/bin/tensorflow-tts-normalize", line 8, in
sys.exit(main())
File "/usr/local/anaconda3/envs/ysj/lib/python3.7/site-packages/tensorflow_tts/bin/normalize.py", line 107, in main
mel = scaler.transform(mel)
File "/usr/local/anaconda3/envs/ysj/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 794, in transform
force_all_finite='allow-nan')
File "/usr/local/anaconda3/envs/ysj/lib/python3.7/site-packages/sklearn/base.py", line 436, in _validate_data
self._check_n_features(X, reset=reset)
File "/usr/local/anaconda3/envs/ysj/lib/python3.7/site-packages/sklearn/base.py", line 373, in check_n_features
"The reset parameter is False but there is no "
RuntimeError: The reset parameter is False but there is no n_features_in
attribute. Is this estimator fitted?

Tacotron2 end2end sample

Could you provide a end2end sample showing how to use the Tacotron2 pretrained model? I tried to reach the source code but couldn't figure out what to feed to each parameter.

Thanks!

tensorflow-tts-normalize: "UnboundLocalError: local variable 'subdir' referenced before assignment"

I've formatted my dataset like the LJSpeech one in the README so I can skip writing a dataloader for finetuning.
This is my directory
image
And this is my metadata.csv. I've made it fileid|transcription|transcription because in ljspeech.py there was text = parts[2] which was giving me index out of range errors with just fileid|trans
image
And this is a small portion of os.listdir("wavs")

file0816.wav
file0039.wav
file2292.wav
file2433.wav
file0794.wav
file1314.wav
file2486.wav
file0695.wav
file2564.wav

All the preprocessing steps run fine until the normalization one:

Traceback (most recent call last):
  File "/usr/local/bin/tensorflow-tts-normalize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_tts/bin/normalize.py", line 115, in main
    np.save(os.path.join(args.outdir, subdir, "norm-feats", f"{utt_id}-norm-feats.npy"),
UnboundLocalError: local variable 'subdir' referenced before assignment

Am I doing something wrong?

Some questions

  1. Are the mel outputs generated compatible with kan-bayashi's ParallelWaveGAN?
  2. There's a FastSpeech synthesis example, but not Tacotron2. How to generate speech with the Tacotron2 pretrained model and MelGAN-STFT?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.