Giter VIP home page Giter VIP logo

wavenet_vocoder's Introduction

WaveNet vocoder

PyPI Build Status Build status DOI

NOTE: This is the development version. If you need a stable version, please checkout the v0.1.1.

The goal of the repository is to provide an implementation of the WaveNet vocoder, which can generate high quality raw speech samples conditioned on linguistic or acoustic features.

Audio samples are available at https://r9y9.github.io/wavenet_vocoder/.

News

Online TTS demo

A notebook supposed to be executed on https://colab.research.google.com is available:

Highlights

  • Focus on local and global conditioning of WaveNet, which is essential for vocoder.
  • 16-bit raw audio modeling by mixture distributions: mixture of logistics (MoL), mixture of Gaussians, and single Gaussian distributions are supported.
  • Various audio samples and pre-trained models
  • Fast inference by caching intermediate states in convolutions. Similar to arXiv:1611.09482
  • Integration with ESPNet (https://github.com/espnet/espnet)

Pre-trained models

Note: This is not itself a text-to-speech (TTS) model. With a pre-trained model provided here, you can synthesize waveform given a mel spectrogram, not raw text. You will need mel-spectrogram prediction model (such as Tacotron2) to use the pre-trained models for TTS.

Note: As for the pretrained model for LJSpeech, the model was fine-tuned multiple times and trained for more than 1000k steps in total. Please refer to the issues (#1, #75, #45) to know how the model was trained.

Model URL Data Hyper params URL Git commit Steps
link LJSpeech link 2092a64 1000k~ steps
link CMU ARCTIC link b1a1076 740k steps

To use pre-trained models, first checkout the specific git commit noted above. i.e.,

git checkout ${commit_hash}

And then follows "Synthesize from a checkpoint" section in the README. Note that old version of synthesis.py may not accept --preset=<json> parameter and you might have to change hparams.py according to the preset (json) file.

You could try for example:

# Assuming you have downloaded LJSpeech-1.1 at ~/data/LJSpeech-1.1
# pretrained model (20180510_mixture_lj_checkpoint_step000320000_ema.pth)
# hparams (20180510_mixture_lj_checkpoint_step000320000_ema.json)
git checkout 2092a64
python preprocess.py ljspeech ~/data/LJSpeech-1.1 ./data/ljspeech \
  --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json
python synthesis.py --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json \
  --conditional=./data/ljspeech/ljspeech-mel-00001.npy \
  20180510_mixture_lj_checkpoint_step000320000_ema.pth \
  generated

You can find a generated wav file in generated directory. Wonder how it works? then take a look at code:)

Repository structure

The repository consists of 1) pytorch library, 2) command line tools, and 3) ESPnet-style recipes. The first one is a pytorch library to provide WavaNet functionality. The second one is a set of tools to run WaveNet training/inference, data processing, etc. The last one is the reproducible recipes combining the WaveNet library and utility tools. Please take a look at them depending on your purpose. If you want to build your WaveNet on your dataset (I guess this is the most likely case), the recipe is the way for you.

Requirements

  • Python 3
  • CUDA >= 8.0
  • PyTorch >= v0.4.0

Installation

git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
pip install -e .

If you only need the library part, you can install it from pypi:

pip install wavenet_vocoder

Getting started

Kaldi-style recipes

The repository provides Kaldi-style recipes to make experiments reproducible and easily manageable. Available recipes are as follows:

  • mulaw256: WaveNet that uses categorical output distribution. The input is 8-bit mulaw quantized waveform.
  • mol: Mixture of Logistics (MoL) WaveNet. The input is 16-bit raw audio.
  • gaussian: Single-Gaussian WaveNet (a.k.a. teacher WaveNet of ClariNet). The input is 16-bit raw audio.

All the recipe has run.sh, which specifies all the steps to perform WaveNet training/inference including data preprocessing. Please see run.sh in egs directory for details.

NOTICE: Global conditioning for multi-speaker WaveNet is not supported in the above recipes (it shouldn't be difficult to implement though). Please check v0.1.12 for the feature, or if you really need the feature, please raise an issue.

Apply recipe to your own dataset

The recipes are designed to be generic so that one can use them for any dataset. To apply recipes to your own dataset, you'd need to put all the wav files in a single flat directory. i.e.,

> tree -L 1 ~/data/LJSpeech-1.1/wavs/ | head
/Users/ryuichi/data/LJSpeech-1.1/wavs/
├── LJ001-0001.wav
├── LJ001-0002.wav
├── LJ001-0003.wav
├── LJ001-0004.wav
├── LJ001-0005.wav
├── LJ001-0006.wav
├── LJ001-0007.wav
├── LJ001-0008.wav
├── LJ001-0009.wav

That's it! The last step is to modify db_root in run.sh or give db_root as the command line argment for run.sh.

./run.sh --stage 0 --stop-stage 0 --db-root ~/data/LJSpeech-1.1/wavs/

Step-by-step

A recipe typically consists of multiple steps. It is strongly recommended to run the recipe step-by-step to understand how it works for the first time. To do so, specify stage and stop_stage as follows:

./run.sh --stage 0 --stop-stage 0
./run.sh --stage 1 --stop-stage 1
./run.sh --stage 2 --stop-stage 2

In typical situations, you'd need to specify CUDA devices explciitly expecially for training step.

CUDA_VISIBLE_DEVICES="0,1" ./run.sh --stage 2 --stop-stage 2

Docs for command line tools

Command line tools are writtern with docopt. See each docstring for the basic usages.

tojson.py

Dump hyperparameters to a json file.

Usage:

python tojson.py --hparams="parameters you want to override" <output_json_path>

preprocess.py

Usage:

python preprocess.py wavallin ${dataset_path} ${out_dir} --preset=<json>

train.py

Note: for multi gpu training, you have better ensure that batch_size % num_gpu == 0

Usage:

python train.py --dump-root=${dump-root} --preset=<json>\
  --hparams="parameters you want to override"

evaluate.py

Given a directoy that contains local conditioning features, synthesize waveforms for them.

Usage:

python evaluate.py ${dump_root} ${checkpoint} ${output_dir} --dump-root="data location"\
    --preset=<json> --hparams="parameters you want to override"

Options:

  • --num-utterances=<N>: Number of utterances to be generated. If not specified, generate all uttereances. This is useful for debugging.

synthesis.py

NOTICE: This is probably not working now. Please use evaluate.py instead.

Synthesize waveform give a conditioning feature.

Usage:

python synthesis.py ${checkpoint_path} ${output_dir} --preset=<json> --hparams="parameters you want to override"

Important options:

  • --conditional=<path>: (Required for conditional WaveNet) Path of local conditional features (.npy). If this is specified, number of time steps to generate is determined by the size of conditional feature.

Training scenarios

Training un-conditional WaveNet

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/
    --hparams="cin_channels=-1,gin_channels=-1"

You have to disable global and local conditioning by setting gin_channels and cin_channels to negative values.

Training WaveNet conditioned on mel-spectrogram

python train.py --dump-root=./data/cmu_arctic/ --speaker-id=0 \
    --hparams="cin_channels=80,gin_channels=-1"

Training WaveNet conditioned on mel-spectrogram and speaker embedding

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/ \
    --hparams="cin_channels=80,gin_channels=16,n_speakers=7"

Misc

Monitor with Tensorboard

Logs are dumped in ./log directory by default. You can monitor logs by tensorboard:

tensorboard --logdir=log

List of papers that used the repository

Thank you very much!! If you find a new one, please submit a PR.

Sponsors

References

wavenet_vocoder's People

Contributors

aleksas avatar azraelkuan avatar candlewill avatar cbrom avatar jasonghent avatar mdda avatar ola-vish avatar petrochukm avatar r9y9 avatar sadam1195 avatar whyky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wavenet_vocoder's Issues

Tkinter's PhotoImage methods and Thread Safety?

Hi guys! Fantastic work with this implementation.

Hopefully a quick question today; all was training fine over LJSpeech, until 16k steps or so, when this error was thrown.

Save intermediate states at step 10000
Saved checkpoint: checkpoints/checkpoint_step000010000.pth
Saved averaged checkpoint: checkpoints/checkpoint_step000010000_ema.pth
Using averaged model for evaluation
Shape of local conditioning features: torch.Size([1, 80, 31])
Intial value: 0.0
6261it [1:11:40,  1.46it/s]
[train] Loss: 69.05194234611926█████████████| 7936/7936 [02:39<00:00, 49.77it/s]
289it [01:11,  4.05it/s]
[test] Loss: 66.40849115444303
4871it [55:02,  1.47it/s]Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7f409809d160>>
Traceback (most recent call last):
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/tkinter/__init__.py", line 3501, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Traceback (most recent call last):
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/resource_sharer.py", line 142, in _serve
    with self._listener.accept() as conn:
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 455, in accept
    deliver_challenge(c, self._authkey)
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/sven/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Aborted (core dumped)

I've kept in some of the former prints to help contextualise the first error (the exception in tkinter).
I feel rather out of my depth trying to debug this, as I'm not sure where to begin. Is this something people have encountered before?

Thanks :)

Is it wrong for the discretized_mix_logistic_loss?

Hi,in the func:discretized_mix_logistic_loss
log_probs = torch.sum(log_probs, dim=-1, keepdim=True) + F.log_softmax(logit_probs, -1)
why do this torch.sum(log_probs, dim=-1, keepdim=True)
p=(π1p1+π2p2....)/(π1+π2...)
if F.log_softmax do this:log(π1/(π1+π2...)) log(π2/(π1+π2...)) ...
maybe should do like this:
log_probs = log_probs+ F.log_softmax(logit_probs, -1)
if modify it then the p is much bigger like exp(-5),otherwise the p is so small like exp(-50) it is unreasonable

Error in extracting melspectrogram

Hi, @r9y9
I am using your tool to preprocess the waveform of LJspeech database. However I meet this assertion error in melspectrogram extraction for every utterance. assert S.max <=0 and S.min-hparams.min_level_db >=0 I see you set min_level_db=-100, ref_level_db=20. Shall I set different values depends on the speech database? or Did I misunderstood something?

Thanks

Ask about Phoneme Segmentation and Phoneme Duration

Hi, @r9y9. First of all, thank you for such a brilliant implementation of Wavenet. Now I study about how to detect a phoneme duration (start time and end time of phoneme) extracted from audio and align this thing with linguistic feature but i don't know how to do this. Can you show me the idea that you use to solve the problem and where is the code in this repo to do this job? Thanks!

1

P/s: Btw is this possible to train this repo with another language? Currently, I'm doing with Vietnamese with my own dataset (7 hours of audio and ARPABET linguistic feature extracted from text)

Dilation conv kernel size=3

In hparams kernel_size=3 while in most implementations i came across, kernel size 2 is used.
Does anyone have experience with implications (mainly output quality) of using kernel size=3 vs 2?

Wave Noise

Hi,

In your demo waves, I notice some glitch noise, this noise may be generated by the WaveNet random sampling feature.

Do u have further updated model that without such kind of noise?

Multi speaker embedding not working?!

if g is not None:
g = self.embed_speakers(g.view(B, -1))
assert g.dim() == 3
# (B x gin_channels, 1)
g = g.transpose(1, 2)
g_bct = _expand_global_features(B, T, g, bct=True)

I am trying to get the multi speaker conditioning to work (using it as lib). So g.shape is (B x C'') and the embedding_dim is D, so Line 191 gives (B x D x C'').
Yet, _expand_global_features expects a g which has the shape (B x C) or (B x C x 1) to expand C to the whole sequence.

Is an embedding necessary to train multiple speakers? Or could a one_hot encoding be sufficient?

Cannot reproduce as good audios as the demo shows

Hi, @r9y9, thanks for sharing this great repo. I have run your code using CMU ARCTIC clb data to train a single-speaker wavenet. I think I have set all the model hparams as the same as you show on your demo website. But the test audios coming out of the 170k-step checkpoint still contain many noises. Below is the haprams, and I have made other codes untouched.
Hope you can help me, thanks.
input_type="mulaw",
quantize_channels=256, # 65536 or 256

# Audio:
sample_rate=16000,
# this is only valid for mulaw is True
silence_threshold=2,
num_mels=80,
fmin=125,
fmax=7600,
fft_size=1024,
# shift can be specified by either hop_size or frame_shift_ms
hop_size=256,
frame_shift_ms=None,
min_level_db=-100,
ref_level_db=20,
# whether to rescale waveform or not.
# Let x is an input waveform, rescaled waveform y is given by:
# y = x / np.abs(x).max() * rescaling_max
rescaling=True,
rescaling_max=0.999,
# mel-spectrogram is normalized to [0, 1] for each utterance and clipping may
# happen depends on min_level_db and ref_level_db, causing clipping noise.
# If False, assertion is added to ensure no clipping happens.o0
allow_clipping_in_normalization=True,

# Mixture of logistic distributions:
log_scale_min=float(np.log(1e-14)),

# Model:
# This should equal to `quantize_channels` if mu-law quantize enabled
# otherwise num_mixture * 3 (pi, mean, log_scale)
out_channels=10 * 3,
layers=16,
stacks=2,
residual_channels=512,
gate_channels=512,  # split into 2 gropus internally for gated activation
skip_out_channels=256,
dropout=1 - 0.95,
kernel_size=3,
# If True, apply weight normalization as same as DeepVoice3
weight_normalization=True,

# Local conditioning (set negative value to disable))
cin_channels=80,
# If True, use transposed convolutions to upsample conditional features,
# otherwise repeat features to adjust time resolution
upsample_conditional_features=True,
# should np.prod(upsample_scales) == hop_size
upsample_scales=[4, 4, 4, 4],
# Freq axis kernel size for upsampling network
freq_axis_kernel_size=3,

# Global conditioning (set negative value to disable)
# currently limited for speaker embedding
# this should only be enabled for multi-speaker dataset
gin_channels=-1,  # i.e., speaker embedding dim
n_speakers=7,  # 7 for CMU ARCTIC

# Data loader
pin_memory=True,
num_workers=2,

# train/test
# test size can be specified as portion or num samples
test_size=0.0441,  # 50 for CMU ARCTIC single speaker
test_num_samples=None,
random_state=1234,

# Loss

# Training:
batch_size=2,
adam_beta1=0.9,
adam_beta2=0.999,
adam_eps=1e-8,
initial_learning_rate=1e-3,
# see lrschedule.py for available lr_schedule
lr_schedule="noam_learning_rate_decay",
lr_schedule_kwargs={},  # {"anneal_rate": 0.5, "anneal_interval": 50000},
nepochs=2000,
weight_decay=0.0,
clip_thresh=-1,
# max time steps can either be specified as sec or steps
# This is needed for those who don't have huge GPU memory...
# if both are None, then full audio samples are used
max_time_sec=None,
max_time_steps=8000,
# Hold moving averaged parameters and use them for evaluation
exponential_moving_average=True,
# averaged = decay * averaged + (1 - decay) * x
ema_decay=0.9999,

# Save
# per-step intervals
checkpoint_interval=10000,
train_eval_interval=10000,
# per-epoch interval
test_eval_epoch_interval=5,
save_optimizer_state=True,

Training time

How long does it take to train the LJSpeech dataset to 800k steps?

Thanks!

KeyError when trying synthesis.py on cmu_arctic

Hi,

After successfully running python preprocess.py cmu_arctic ./../../data/cmu_arctic ./../tmp_cmu_arctic
and
python train.py --data-root=./../tmp_cmu_arctic --hparams="cin_channels=80,gin_channels=16,n_speakers=7"

when I try to use synthesis.py, I find the following KeyError when trying to load the model at line 149:

(wavenet)v-ricardo@gpu14:/work/smg/v-ricardo/EXPERIMENTS/wavenet/wavenet_vocoder$ python synthesis.py ./checkpoints/checkpoint_step000030000.pth ./output_cmu_arctic/
Using TensorFlow backend.
Command line args:
{'--conditional': None,
'--file-name-suffix': '',
'--help': False,
'--hparams': '',
'--initial-value': None,
'--length': '32000',
'--output-html': False,
'--speaker-id': None,
'': './checkpoints/checkpoint_step000030000.pth',
'<dst_dir>': './output_cmu_arctic/'}
Load checkpoint from ./checkpoints/checkpoint_step000030000.pth
Traceback (most recent call last):
File "synthesis.py", line 149, in
model.load_state_dict(checkpoint["state_dict"])
File "/work/smg/v-ricardo/EXPERIMENTS/wavenet/lib/python3.5/site-packages/torch/nn/modules/module.py", line 490, in load_state_dict
.format(name))
KeyError: 'unexpected key "conv_layers.0.conv1x1g.bias" in state_dict'

has anyone else found this problem before? Or do you know how I can get round it?
There was not problem whatsoever during preprocessing or training.

Thank you!

+0.01 in audio._amp_to_db(x) ???

What is the reason for raising the floor of the spectrogram in the amp to db function?

Did this give better sound quality?

def _amp_to_db(x):
    return 20 * np.log10(np.maximum(1e-5, x + 0.01))  # why the + 0.01??


def _db_to_amp(x):
    return np.power(10.0, x * 0.05) # - 0.01 should be at end there??

And a small point, should the inverse function not have - 0.01 at the end to compensate for this?

Thanks in advance!

Hi,is there any proof for how discretized_mix_logistic_loss works?

Hi:
I read the paper that P(x|π,µ,s) as discretized mix logistic distributions,and i read the code of pixelcnn++ and yours but i still not very clear why it written like this,In fact i donot know In pixelcnn++ what is the coeffs var represent and why it is removed in your code?
is there any mathematical proof for this?

Parallel WaveNet

Hi,

I am currently working on parallel WaveNet and noticed that it is one of your planned TODOS.

Till now I can only generate wavs that are distinguishable but with a lot of noise. From my experience, the power loss is really crucial, and we need to do sufficient sampling per step to approximately compute the cross entropy term(If not sufficient, the high variance of gradients probably makes the model collapse).

Do you have anything in mind that may help? Thanks

RFC: Speech samples and pre-trained models

Dear folks,

I'll giving a presentation about this project this month and I'm looking for speech samples (or models) I can share for the presentation. If you folks are doing cool things with the repository and could kindly allow me to introduce your work, please let me know.

I'm afraid I am unable to play with Tacotron2 myself until now due to time constraints so I would be very happy if you could share Tacotron2 + WaveNet vocoder integration or similar variants; e.g., DeepVoice3 + WaveNet. Others are of course welcome.

Out of Memory on Synthesis

When running python synthesis.py <model_checkpoint_path> <output_dir> --conditional <mel_path> , I consistently run out of GPU memory about 4 minutes into synthesis. I have a GTX 1080Ti (11GB memory), and when I watch nvidia-smi while synthesis is running, the memory usage continually increases until it runs out. How much GPU memory is generally required to synthesize a clip?

For reference, here is the progress on ljspeech-mel-00001.npy before it failed most recently:
33249/195328 [03:57<19:18, 139.90it/s]

Running out of GPU memory!

The error below is thrown when I try to train wavenet_vocoder with the default parameters on a Jupyter Notebook at Googles's Colaboratory site

!cd ./wavenet_vocoder && python train.py --data-root=data/ljspeech

/usr/local/lib/python3.6/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoints',
 '--data-root': 'data/ljspeech',
 '--help': False,
 '--hparams': '',
 '--log-event-path': None,
 '--reset-optimizer': False,
 '--restore-parts': None,
 '--speaker-id': None}
Hyperparameters:
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_eps: 1e-08
  allow_clipping_in_normalization: False
  batch_size: 2
  builder: wavenet
  checkpoint_interval: 10000
  cin_channels: 80
  clip_thresh: -1
  dropout: 0.050000000000000044
  ema_decay: 0.9999
  exponential_moving_average: True
  fft_size: 1024
  fmax: 7600
  fmin: 125
  frame_shift_ms: None
  freq_axis_kernel_size: 3
  gate_channels: 512
  gin_channels: -1
  hop_size: 256
  initial_learning_rate: 0.001
  input_type: raw
  kernel_size: 3
  layers: 24
  log_scale_min: -32.23619130191664
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  max_time_sec: None
  max_time_steps: 8000
  min_level_db: -100
  n_speakers: 7
  name: wavenet_vocoder
  nepochs: 2000
  num_mels: 80
  num_workers: 2
  out_channels: 30
  pin_memory: True
  preset: 
  presets: {}
  quantize_channels: 65536
  random_state: 1234
  ref_level_db: 20
  rescaling: True
  rescaling_max: 0.999
  residual_channels: 512
  sample_rate: 22050
  save_optimizer_state: True
  silence_threshold: 2
  skip_out_channels: 256
  stacks: 4
  test_eval_epoch_interval: 5
  test_num_samples: None
  test_size: 0.0441
  train_eval_interval: 10000
  upsample_conditional_features: True
  upsample_scales: [4, 4, 4, 4]
  weight_decay: 0.0
  weight_normalization: True
Local conditioning enabled. Shape of a sample: (426, 80).
[train]: length of the dataset is 12522
Local conditioning enabled. Shape of a sample: (539, 80).
[test]: length of the dataset is 578
WaveNet(
  (first_conv): Conv1d (1, 512, kernel_size=(1,), stride=(1,))
  (conv_layers): ModuleList(
    (0): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (1): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (2): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(4,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (3): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(8,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (4): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(16,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (5): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(32,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (6): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (7): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (8): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(4,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (9): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(8,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (10): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(16,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (11): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(32,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (12): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (13): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (14): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(4,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (15): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(8,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (16): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(16,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (17): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(32,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (18): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (19): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(2,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (20): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(4,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (21): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(8,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (22): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(16,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
    (23): ResidualConv1dGLU(
      (conv): Conv1d (512, 512, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(32,))
      (conv1x1c): Conv1d (80, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_out): Conv1d (256, 512, kernel_size=(1,), stride=(1,))
      (conv1x1_skip): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    )
  )
  (last_conv_layers): ModuleList(
    (0): ReLU(inplace)
    (1): Conv1d (256, 256, kernel_size=(1,), stride=(1,))
    (2): ReLU(inplace)
    (3): Conv1d (256, 30, kernel_size=(1,), stride=(1,))
  )
  (upsample_conv): ModuleList(
    (0): ConvTranspose2d (1, 1, kernel_size=(3, 4), stride=(1, 4), padding=(1, 0))
    (1): ReLU(inplace)
    (2): ConvTranspose2d (1, 1, kernel_size=(3, 4), stride=(1, 4), padding=(1, 0))
    (3): ReLU(inplace)
    (4): ConvTranspose2d (1, 1, kernel_size=(3, 4), stride=(1, 4), padding=(1, 0))
    (5): ReLU(inplace)
    (6): ConvTranspose2d (1, 1, kernel_size=(3, 4), stride=(1, 4), padding=(1, 0))
    (7): ReLU(inplace)
  )
)
Receptive field (samples / ms): 505 / 22.90249433106576
Los event path: log/run-test2018-02-17_03:59:27.681235
0it [00:00, ?it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory

Traceback (most recent call last):
  File "train.py", line 961, in <module>
    train_loop(model, data_loaders, optimizer, writer, checkpoint_dir=checkpoint_dir)
  File "train.py", line 724, in train_loop
    checkpoint_dir, eval_dir, do_eval, ema)
  File "train.py", line 639, in __train_step
    y_hat = model(x, c=c, g=g, softmax=False)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/wavenet_vocoder/wavenet_vocoder/wavenet.py", line 207, in forward
    x, h = f(x, c, g_bct)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/wavenet_vocoder/wavenet_vocoder/modules.py", line 122, in forward
    return self._forward(x, c, g, False)
  File "/content/wavenet_vocoder/wavenet_vocoder/modules.py", line 140, in _forward
    x = F.dropout(x, p=self.dropout, training=self.training)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 526, in dropout
    return _functions.dropout.Dropout.apply(input, p, training, inplace)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/_functions/dropout.py", line 32, in forward
    output = input.clone()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

training

Thanks for sharing your comprehensive code!
I've just started reading and running your code. In hparams, nepochs is set to 2000; but it seems to stop at the first epoch, as it shows in tensorboard:
image
it seems that the training loop is only depend on nepochs. Is there any other parameter to set for keep going the training process?

Using DeepVoice3 mels spectrogram as wavenet vocoder's input

I hope anyone can help me with a problem I got while trying to run both DeepVoice3 and Wavenet systems.

When I run TTS with Deepvoice3 on LJSpeech, I got a robotic sound.
I know that Wavenet can predict a better reasults.
I extract the mel spectrogram from DeepVoice3 system (hopefully I did it right) and tried to set it as an input for the Wavenet system.
The results weren't so good, as the voice still has a robotic sound and lots of background whistling.

I am using UBUNTU 16.04 os.
The model I used is LJ 410K iters

Hopefully anyone has any advice how can I improve results.

Thank in advance,
Yishai

preprocess error while pip install -e ".[train]" OK on windows 7

Hi r9y9, thanks for your work!

I only have a borrowed machine with windows 7 have 1060 GPU and I tried pip install -e ".[train]"

after I finished the preprocess step pop some error

D:\code\wavenet_vocoder-master>python preprocess.py --preset=presets/ljspeech_mixture.json ljspeech ./LJSpeech-1.1/ training_data_ljspeech
C:\Users\butter\AppData\Local\Programs\Python\Python35\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype
from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Sampling frequency: 22050
Traceback (most recent call last):
  File "preprocess.py", line 59, in <module>
    mod = importlib.import_module(name)
  File "C:\Users\butter\AppData\Local\Programs\Python\Python35\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "D:\code\wavenet_vocoder-master\ljspeech.py", line 12, in <module>
    from wavenet_vocoder.util import is_mulaw_quantize, is_mulaw, is_raw
  File "D:\code\wavenet_vocoder-master\wavenet_vocoder\__init__.py", line 6, in <module>
    from .wavenet import receptive_field_size, WaveNet
  File "D:\code\wavenet_vocoder-master\wavenet_vocoder\wavenet.py", line 7, in <module>
    import torch
  File "C:\Users\butter\AppData\Local\Programs\Python\Python35\lib\site-packages\torch\__init__.py", line 78, in <module>
    from torch._C import *
ImportError: DLL load failed: The specified module could not be found.

the pytroch I'm follow the instruction

pip3 install http://download.pytorch.org/whl/cu90/torch-0.4.0-cp35-cp35m-win_amd64.whl 
pip3 install torchvision

And for bandmat I download the code , build locally to solve -Wno-unused-but-set-variable issue

Do you got any idea how to got through this? pytroch seems install good

from torch._C import *
ImportError: DLL load failed: The specified module could not be found.

when synthesis a wav file use synthesis.py, crash!

Traceback (most recent call last):
File "synthesis.py", line 151, in
waveform = wavegen(model, length, c=c, g=speaker_id, initial_value=initial_value, fast=True)
File "synthesis.py", line 103, in wavegen
initial_input, c=c, g=g, T=length, tqdm=tqdm, softmax=True, quantize=True)
File "wavenet.py", line 234, in incremental_forward
assert c.size(-1) == T
AssertionError

Music generation

Things I want to try if I get a chance, Comments and requests are welcome.

  • Music generation (unconditional)
  • MIDI-conditioned piano sound generation

Why lws and not regular stft?

Is there a reason for using lws?

Do you use reconstructed phase somewhere to improve quality?

Thanks in advance, Duvte.

About the synthesis time

Hi, r9y9,
Thanks for sharing the great work. Can you share the detail time for synthesis a wav? Thank you.

Is it wrong for these code?

y_hat = model(x, c=c, g=g, softmax=False)

if is_mulaw_quantize(hparams.input_type):
    # wee need 4d inputs for spatial cross entropy loss
    # (B, C, T, 1)
    y_hat = y_hat.unsqueeze(-1)
    loss = criterion(y_hat[:, :, :-1, :], y[:, 1:, :], mask=mask) # criterion  is nn.CrossEntropyLoss
else:
    loss = criterion(y_hat[:, :, :-1], y[:, 1:, :], mask=mask)

I am confuzing about this code, why y_hat'shape is ( B, C, T, 1), but y'shape is (B, T, 1), but why y_hat'shape is not (B, T, 1, C).
I read pytorch's docs about CrossEntropyLoss function.

class torch.nn.CrossEntropyLoss(weight=None, size_average=True, ignore_index=-100, reduce=True)
Input: (N,C) where C = number of classes
Target: (N) where each value is 0≤targets[i]≤C−1
Examples:

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.LongTensor(3).random_(5)
output = loss(input, target)
output.backward()

can somebody give me some help?

arctic multi speaker generate

hi, i am doing with wavenet vocoder using tensorflow, and i generate good voice using arctic, ljspeech and chinese dataset if not with speaker id. i want to ask:

if i want to generate the wav of the speaker id 0 in arctic dataset, should i use the local condition of speaker id 0, or i can use local condition of any other speaker id ?

  • speaker id 0 + local condition 0 ==> wav of speaker id 0 (good)
  • speaker id 1 + local condition 0 ==> also the wav of speaker id 0

i notice that in the repository ibab/tensorflow-wavenet, he can generate the wav of any speaker id

Multi-GPU Support

Are there any plans to add multi-gpu support? I didn't see anything in the TODO list. Thanks!

Planned TODOs

This is an umbrella issue to track progress for my planned TODOs. Comments and requests are welcome.

Goal

  • achieve higher speech quality than conventional vocoder (WORLD, griffin-lim, etc)
  • provide pre-trained model of WaveNet-based mel-spectrogram vocoder

Model

  • 1D dilated convolution
  • batch forward
  • incremental inference
  • local conditioning
  • global conditioning
  • upsampling network (by transposed convolutions)

Training script

  • Local conditioning
  • Global conditioning
  • Configurable maximum number of time steps (to avoid out of memory error). 58ad07f

Experiments

  • unconditioned WaveNet trained with CMU Arctic
  • conditioning model on mel-spectrogram (local conditioning) with CMU Arctic
  • conditioning model on mel-spectrogram and speaker id with CMU Arctic
  • conditioning model on mel-spectrogram (local conditioning) with LJSpeech
  • DeepVoice3 + WaveNet vocoder r9y9/deepvoice3_pytorch#21

Misc

  • [ ] Time sliced data generator?
  • Travis CI
  • Train/val split
  • README

Sampling frequency

  • 4kHz
  • 16kHz
  • 22.5kHz
  • 44.1kHz
  • 48kHz

Advanced (lower priority)

error in synthesis

Traceback (most recent call last):
File "synthesis.py", line 181, in
waveform = wavegen(model, length, c=c, g=speaker_id, initial_value=initial_value, fast=True)
File "synthesis.py", line 84, in wavegen
assert c.ndim == 2
AssertionError

Guiding synthesis with low-fidelity waveform?

Hi @r9y9, brilliant work with your implementation!

Could you tell me if I can pass in lo-fi audio at synthesis time to help guide generation?
I have audio at a sample rate of 4kHz, along with Mel-spectrogram predictions of that audio upsampled to 16kHz. I want to decode these spectrograms with Wavenet. Can I pass in the original lo-fi audio so that 1 in 4 synthesized samples will be ground truth, and 3 in 4 will be 'filled in'?

I see you have test_inputs defined as an argument for incremental_forward in wavenet.py. Is this an argument that could be altered for such a purpose?

Thanks very much!

  • Sven

It's not necessary to sum up 'log_probs' at the last dimension in discretized_mix_logistic_loss function

Hi Ryuichi,

Thanks for your great work.

Forgive me if I got this wrong but I think it's not necessary to sum up log_probs at the last dimension here. Saying

log_probs = torch.sum(log_probs, dim=-1, keepdim=True) + F.log_softmax(logit_probs, -1)

should be

log_probs = log_probs + F.log_softmax(logit_probs, -1)

I saw your code is adapted from pixel-cnn, but these two cases are different. Since pixel has 3 channels(RGB) in pixelCNN and the log_probs has shape [B, H, W, C, num_mix] before this line, the mixture model is built on pixel not channel so they get the probability of each pixel with tf.reduce_sum(log_probs,axis=3). But here for the wavenet, waveform has only one channel, saying log_probs has shape [B, T, num_mix], and the mixture model is built on that channel, so it's not needed to sum up at any dimension. In addition, your code summing up at the num_mix dimension is also inconsistent with pixel cnn(which at channels dimension).

Feel free to ignore me if I'm wrong,
Thanks

the problem of pre-model

When I use the pre-model ,an error appears,I don't know the reason.Can you give me some suggestions? Thank you very much
yan@yan-Default-string:~/文档/beifen/wavenet_vocoder-master$ python synthesis.py --conditional=./data/ljspeech/ljspeech-mel-00001.npy ./checkpoints/checkpoint_step000410000_ema.pth generated
/home/yan/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Command line args:
{'--conditional': './data/ljspeech/ljspeech-mel-00001.npy',
'--file-name-suffix': '',
'--help': False,
'--hparams': '',
'--initial-value': None,
'--length': '32000',
'--output-html': False,
'--preset': None,
'--speaker-id': None,
'': './checkpoints/checkpoint_step000410000_ema.pth',
'<dst_dir>': 'generated'}
Load checkpoint from ./checkpoints/checkpoint_step000410000_ema.pth
Traceback (most recent call last):
File "synthesis.py", line 175, in
checkpoint = torch.load(checkpoint_path)
File "/home/yan/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 267, in load
return _load(f, map_location, pickle_module)
File "/home/yan/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 405, in _load
return legacy_load(f)
File "/home/yan/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 341, in legacy_load
tar.extract('storages', path=tmpdir)
File "/home/yan/anaconda3/lib/python3.6/tarfile.py", line 2038, in extract
tarinfo = self.getmember(member)
File "/home/yan/anaconda3/lib/python3.6/tarfile.py", line 1749, in getmember
raise KeyError("filename %r not found" % name)
KeyError: "filename 'storages' not found"

Speech to Speech

Does it make sense to use the Wavenet vocoder as it is for speech to speech? For example, Can I record my voice, generate a melspectrogram, then use a pre-trained model on LJSpeech dataset to respeak it?

I've been trying this and the results don't sound good.

Taking Tacotron2 output to wavenet vocoder

Anyone had experience with the above?
I guess that audio hparams need to be same for both. My intuition for using ljspeech:

Settings for tacotron 2 implementation(https://github.com/Rayhane-mamah/Tacotron-2):
num_mels=80
num_freq=1025; In wavenet code fft_size=1024 in t2 fft_size=(1025-1)*2=2048. As far as i understand i can keep this as is since anyways this accumulates to mel bands
sample_rate=20050 (As the ljspeech dataset)
frame_length_ms=46.44 (correlates to wavenet's fft_size/22050).
frame_shift_ms=11.61 (correlates to wavenet's hop_size=256, 256/22050=11.61ms)
preemphasis, not available in wavenet r9y9 implementation
Others: in t2 i dont have fmin(125 in wavenet) and fmax (7600 in wavenet). looking into t2 code,
the spectrogram fmin is set to 0 and fmax is set to 2/fsample = 22050/2=11025Hz. Since im using a pre-trained wavenet model i guess ill need to change params in t2 code.

Any remarks, suggestions?

Google releases Cloud TTS speak engine that uses wavenet

Google just released a Cloud TTS service that uses wavenet.

Google Cloud Text-to-Speech enables developers to synthesize natural-sounding speech with 30 voices, available in multiple languages and variants. It applies DeepMind’s groundbreaking research in WaveNet and Google’s powerful neural networks to deliver the highest fidelity possible. As an easy-to-use API, you can create lifelike interactions with your users, across many applications and devices.

Good pre-trained weights anyone?

First, thank you very much @r9y9 and everyone for the great work!

Does anyone want to share pre-trained weights that sound good?

Particularly for LJSpeech if possible. My training is to be converging to a very high loss value. I would love to experiment with some sounds, and maybe figure out where I am going wrong in training.

Thanks in advance,
Duvte.

Quick Feedback

Hi, @r9y9, I have been training few models the last couple of days with wavenet_vocoder and I must say the results are impressive.

Here are some learning charts:
screen shot 2018-01-29 at 16 46 41
screen shot 2018-01-29 at 16 46 48

The blue line represents learning with 24 layers, 4 stacks
The red line is 40 layers, 4 stacks.
I have also waveplots for both.
24 layers, 4 stacks:
step000255024_waveplots

40 layers, 4 stacks:
step000176640_waveplots

My data is roughly five hours total training data (split into training and testing). The audio results are quite impressive after already 140k steps; currently testing with 250k steps (24 layers, 4 stacks).

At one point, I will (hopefully) be able to test it with 65k quantization instead of 256, provided it fits into my GPUs RAM.
Since I have a few 1080 Ti, I can train multiple models in parallel and the training is super-efficient (100% GPU-utilization) and very, very fast.
What I would like to do is parallel-training (on multiple GPUs) as well as (later on) parallel prediction.

The major task for me is to integrate your Vocoder into Tacotron-2 - hopefully I can show something towards end of February. The other project is generating music, but that is just a side project.

Again, thanks for such a great work and if you have any pointers regarding multi-GPU-training, it would be appreciated...

preprocess error

when i run the preprocess.py, the lws will raise a error:
TypeError: '>' not supported between instances of 'NoneType' and 'int',
should I use return lws.lws(hparams.fft_size, get_hop_size(), mode="speech", fftsize=hparams.fft_size)
the error line

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.