Giter VIP home page Giter VIP logo

diffwave's Introduction

DiffWave

PyPI Release License

We're hiring! If you like what we're building here, come join us at LMNT.

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

  • unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

  • fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

  • new pretrained model trained for 1M steps
  • updated audio samples with output from new model

Status (2021-11-09)

  • fast inference procedure
  • stable training
  • high-quality synthesis
  • mixed-precision training
  • multi-GPU training
  • command-line inference
  • programmatic inference API
  • PyPI package
  • audio samples
  • pretrained models
  • unconditional waveform synthesis

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

  • trained on 4x 1080Ti
  • default parameters
  • single precision floating point (FP32)
  • trained on LJSpeech dataset excluding LJ001* and LJ002*
  • trained for 1000578 steps (1273 epochs)

Install

Install using pip:

pip install diffwave

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

References

diffwave's People

Contributors

andrechang avatar shaper avatar sharvil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffwave's Issues

Loss function

Hi,

First, I would like to thank you for your implementation.
I have a question regarding the optimization. In the original paper the authors propose to minimize the L2 distance between the noise and network output. I noticed that the code uses the L1 loss, is there a reason for that change?

Thank you,
Felix

Question in the inference

the required spectrogram form is like [N,C,W].

spectrogram = # get your hands on a spectrogram in [N,C,W] format

could you please explain these three dimensions?

I use the code from this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning to produce the mel spectrogram and use diffwave as the vocoder. But I only get the audio full of noises.

generate mel spectrogram

specs = synthesizer.synthesize_spectrograms(texts, embeds) #len(specs) == 1
spec = specs[0] #spec numpy.array, float32, shape(80, 314)
spec = torch.tensor(spec)

Generating the waveform

diffwave_dir = "/hdd/haoran_project/diffwave-master/pretrained_models/diffwave-ljspeech-22kHz-1000578.pt"
generated_wav, sample_rate = diffwave_predict(spec, diffwave_dir, fast_sampling=True)

Save it on the disk

filename = "results/diffwave_Elon.wav"
print(generated_wav.dtype, " ",generated_wav.shape) # torch.float32 torch.Size([1, 87040])
torchaudio.save(filename, generated_wav.cpu(), sample_rate=sample_rate)

Long sentences

Hi,

the model seems to be working fairly well (tested after just 100K steps on a 100 speaker 24KHz dataset, it starts sounding reasonably well, but I guess it needs more epochs to achieve higher quality).

I just tested it on some random sentences, and I noticed the GPU ran out of memory for long sentences. What would be the best approach to synthesize long sentences? The baseline would be to split the mel spectrogram in parts and synthesize them separately, but I am not sure if this is the only way to go.

Thank you for your help!

PD: I'll report some results after 1M steps.

Trying to run training after preprocess

following the steps in the readme, after trying

python3.7 -m diffwave ~/tf_runs/diffwave_test/ ~/datasets/R2

we get the error;

Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/matt/.local/lib/python3.7/site-packages/diffwave/main.py", line 53, in
main(parser.parse_args())
File "/home/matt/.local/lib/python3.7/site-packages/diffwave/main.py", line 39, in main
train(args, params)
File "/home/matt/.local/lib/python3.7/site-packages/diffwave/learner.py", line 169, in train
dataset = dataset_from_path(args.data_dirs, params)
File "/home/matt/.local/lib/python3.7/site-packages/diffwave/dataset.py", line 87, in from_path
drop_last=True)
File "/home/matt/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 262, in init
sampler = RandomSampler(dataset, generator=generator) # type: ignore
File "/home/matt/.local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 104, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

training set - the files in ~/datasets/R2 are a list of approx 49000 16 bit mono 44100 khz samples that have all had the pre-processing step run for them successfully and generated npy files. Sample rate has been updated as appropriate in params.py.

Any help with the problem would be appreciated

CUDA Memory Error During Training on k80

Hello,

I have been trying to train your awesome work on my custom dataset, however, I get the following error on a k80 irrespective of Batch Size (even tried with BS = 1).

RuntimeError: CUDA out of memory. Tried to allocate 124.00 MiB (GPU 0; 11.17 GiB total capacity; 10.65 GiB already allocated; 94.31 MiB free; 10.68 GiB reserved in total by PyTorch)

Would appreciate any help in this regard so that I might be able to put the model on training as soon as possible.

Thanks.

How much GPU ram? How to change batch size?

When trying to run python -m diffwave I'm getting an out of memory on my 8GB GPU.

How much ram is required based on sample sizes? Most of my samples are <=10 seconds.

Can the batch size be changed when running the pip installed version?

environment.yml.zip

(cherokee-diffwave) muksihs@muksihs-omen:~/git/cherokee-diffwave$ python -m diffwave --fp16 --max_steps 5000000 models/ wavs/
Epoch 0:   0%|                                                                                                           | 0/2018 [00:00<?, ?it/s]/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)
/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/dataset.py:39: UserWarning: torchaudio.backend.sox_io_backend.load_wav has been deprecated and will be removed from 0.9.0 release. Please use "torchaudio.load".
  signal, _ = torchaudio.load_wav(audio_filename)

Epoch 0:   0%|                                                                                                           | 0/2018 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/__main__.py", line 52, in <module>
    main(parser.parse_args())
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/__main__.py", line 39, in main
    train(args, params)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/learner.py", line 169, in train
    _train_impl(0, model, dataset, args, params)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/learner.py", line 163, in _train_impl
    learner.train(max_steps=args.max_steps)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/learner.py", line 108, in train
    loss = self.train_step(features)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/learner.py", line 136, in train_step
    predicted = self.model(noisy_audio, spectrogram, t)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/model.py", line 139, in forward
    x = torch.sum(torch.stack(skip), dim=0) / sqrt(len(self.residual_layers))
RuntimeError: CUDA out of memory. Tried to allocate 62.00 MiB (GPU 0; 7.79 GiB total capacity; 6.58 GiB already allocated; 10.12 MiB free; 6.60 GiB reserved in total by PyTorch)

Unconditional synthesis

I"m running the this command to generate unconditional samples.

python -m diffwave.inference --fast /path/to/model -o output.wav

I've trained for almost 4k epochs on 7k+ sounds. I seem to get the same sound (or a very similar one) regardless of training time.

I have not worked with diffwave before - any tips for debugging this?

Thanks

The audio have some noise.

Hi, thanks for your good job.
I trained the model on a single speaker dataset which have 10000 utterances, loss is shown in figure. In inference, the audio have some clearly noise, is this dataset too small? Or are there other reasons?
捕获
…]()

Error in version comparison in dataset.py

Hi.
Thankyou for your great work.

I believe, there can be a better way to compare versions of torchaudio. Instead of comparing strings, we can use packages like packaging and use functions like parse from packaging.version to properly compare versions.

To reproduce this error, you can use the installation of torchaudio==0.10.0 and run the training using that.
'0.10.0' > '0.7.0' gives False which is incorrect.

if torchaudio.__version__ > '0.7.0':

How to match tacotron2?

I have another problem that I try to match tacotron2 https://github.com/begeekmyfriend/tacotron2 ,but the generated audio only have noise.
The TTS params is already match diffwave, i found that the only difference is mel's range(preprocess is different). Tacotron2's output mel range is [-4, 4], diffwave's input mel range is [0, 1].
So, i try something to solve this problem.

  • Only change the inference: try to change tacotron's mel range to [0, 1], like the figure. The result become better that i can hear human's voice and some content, but this way lose the speaker's timbre, just like a machine.

图片

  • Retraining: use tacotron2's mel to training diffwave, after 800k steps, it still only have noise.

  • Retraining: change tacotron2's mel range like (1), and then training diffwave, after 350k steps, it still only have noise.

Do you have any good suggestions?

questions about the codes.

Thanks for your great work!
I study with your codes and It is of great help to me.

By the way, I have two questions in your scripts.

In learner.py scripts, The variable 'noise_level'(line 50) takes square root, and it seems to mean 'alpha_cumprod_sqrt' in the paper.
but in line120&121, it takes square_root again and the variable 'noise_scale_sqrt' seems to be took square root second time.
(+'noise_scale' seems to equal to 'alpha_cumprod_sqrt' in paper.)
I thought the input 'noisy_audio' is different from the original paper.

To confirm whether there is something wrong, I trained both model with&without changing the script(remove **0.5 in line 50).
and I found the model without changing the code(same with your scripts) acts better which means your code is right!

I have checked the scripts several times, but I still find there is something weird in the scripts(alpha_cumprod took sqrt two times)
so I cannot understand the given results.

May I ask you to confirm whether your code is exactly same with the paper and if I missed anything?

You used L1 distance for the objective function but originally, the loss is euclidian distance in the paper.
Is there any reason for using L1 distance? (I knew that wavegrad paper said training with L1 is better!)

Thanks for your work again. :)

Spectrogram Upsample

Hi,

I'm having trouble with the upsampling of the mel spectrograms. How should I change the ConvTranspose2d kernel, stride and padding to match a hop size of 300? Thank you in advance.

hard coded assumption of 256 for hop_samples

I don't where at in the code, but there is effectively a hard coded requirement of 256 for hop_samples length:

Errors for different hop_samples != 256:

hop_samples==254tensor a (12700) must match the size of tensor b (12800)
hop_samples==255tensor a (12750) must match the size of tensor b (12800)
hop_samples==257tensor a (12850) must match the size of tensor b (12800)
hop_samples==275tensor a (13750) must match the size of tensor b (12800)

where tensor_a size = 50*hop_samples and tensor_b size = 50*256

Regular sampling and fast sampling not equivalent in unconditional generation

Hi, thank you so much for your implementation.

I trained one unconditional generator, the fast sampling makes sense during inference using default noise schedule, like this:
Screen Shot 2022-05-02 at 8 34 15 PM

However, when I set fast_sampling to False, still with the default noise schedule, I got this:
Screen Shot 2022-05-02 at 8 35 20 PM

Is this normal? Thanks in advance.

Also is this setting correct? The maximum beta in two schedules are different here.

noise_schedule=np.linspace(1e-4, 0.05, 50).tolist()
inference_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5]

tensorboard not specified as a requirement in setup.py

tensorboard not specified as a requirement in setup.py

python -m diffwave --help
Traceback (most recent call last):
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/__main__.py", line 20, in <module>
    from diffwave.learner import train, train_distributed
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/diffwave/learner.py", line 22, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py", line 1, in <module>
    import tensorboard
ModuleNotFoundError: No module named 'tensorboard'

Error occured when i try to reload pretrained model to inference

I downloaded the checkpoint offered in readme, then i used the Inference API, the params remained unchanged. But an error occuered,
Here's the code and the error info:

Code:
from inference import predict as diffwave_predict
model_dir = './diffwave-ljspeech-22kHz-1000578.pt'
spectrogram = torch.from_numpy(np.load('./upsample_test.wav.spec.npy'))
if len(spectrogram.shape) == 2:
spectrogram = spectrogram.unsqueeze(0)
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
sf.write('./output.wav', audio, sample_rate)

Error info:
Traceback (most recent call last):
File "test.py", line 6, in
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)
File "/home/work_nfs6/zqwang/workspace/voicefilter/model/diffusion_model/inference.py", line 40, in predict
model.load_state_dict(checkpoint['model'])
File "/home/environment/zqwang/anaconda3/envs/py38pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiffWave:
Unexpected key(s) in state_dict: "spectrogram_upsampler.conv1.weight", "spectrogram_upsampler.conv1.bias", "spectrogram_upsampler.conv2.weight", "spectrogram_upsampler.conv2.bias", "residual_layers.0.conditioner_projection.weight", "residual_layers.0.conditioner_projection.bias", "residual_layers.1.conditioner_projection.weight", "residual_layers.1.conditioner_projection.bias", "residual_layers.2.conditioner_projection.weight", "residual_layers.2.conditioner_projection.bias", "residual_layers.3.conditioner_projection.weight", "residual_layers.3.conditioner_projection.bias", "residual_layers.4.conditioner_projection.weight", "residual_layers.4.conditioner_projection.bias", "residual_layers.5.conditioner_projection.weight", "residual_layers.5.conditioner_projection.bias", "residual_layers.6.conditioner_projection.weight", "residual_layers.6.conditioner_projection.bias", "residual_layers.7.conditioner_projection.weight", "residual_layers.7.conditioner_projection.bias", "residual_layers.8.conditioner_projection.weight", "residual_layers.8.conditioner_projection.bias", "residual_layers.9.conditioner_projection.weight", "residual_layers.9.conditioner_projection.bias", "residual_layers.10.conditioner_projection.weight", "residual_layers.10.conditioner_projection.bias", "residual_layers.11.conditioner_projection.weight", "residual_layers.11.conditioner_projection.bias", "residual_layers.12.conditioner_projection.weight", "residual_layers.12.conditioner_projection.bias", "residual_layers.13.conditioner_projection.weight", "residual_layers.13.conditioner_projection.bias", "residual_layers.14.conditioner_projection.weight", "residual_layers.14.conditioner_projection.bias", "residual_layers.15.conditioner_projection.weight", "residual_layers.15.conditioner_projection.bias", "residual_layers.16.conditioner_projection.weight", "residual_layers.16.conditioner_projection.bias", "residual_layers.17.conditioner_projection.weight", "residual_layers.17.conditioner_projection.bias", "residual_layers.18.conditioner_projection.weight", "residual_layers.18.conditioner_projection.bias", "residual_layers.19.conditioner_projection.weight", "residual_layers.19.conditioner_projection.bias", "residual_layers.20.conditioner_projection.weight", "residual_layers.20.conditioner_projection.bias", "residual_layers.21.conditioner_projection.weight", "residual_layers.21.conditioner_projection.bias", "residual_layers.22.conditioner_projection.weight", "residual_layers.22.conditioner_projection.bias", "residual_layers.23.conditioner_projection.weight", "residual_layers.23.conditioner_projection.bias", "residual_layers.24.conditioner_projection.weight", "residual_layers.24.conditioner_projection.bias", "residual_layers.25.conditioner_projection.weight", "residual_layers.25.conditioner_projection.bias", "residual_layers.26.conditioner_projection.weight", "residual_layers.26.conditioner_projection.bias", "residual_layers.27.conditioner_projection.weight", "residual_layers.27.conditioner_projection.bias", "residual_layers.28.conditioner_projection.weight", "residual_layers.28.conditioner_projection.bias", "residual_layers.29.conditioner_projection.weight", "residual_layers.29.conditioner_projection.bias".

Thanks in advance! Looking forward to your reply!

Inference

While inferencing with the provided LJSpeech pretrained model and one of the reference audio, the output is a very low amplitude sound (almost silence). And while I used a trained model over a custom dataset, the result was static noise on inferencing. What could be going wrong?

Other feature representations besides mel-spect

I'm doing music related research, and mel-spectrogram doesn't seem to be the best data representation for the task I'm handling with, so I'm considering switching to CQT.
I trained DiffWave on music Mel-spectrograms and it yielded very impressive result. I'm wondering whether it makes sense to use some other input representations other than Mel-spectrograms, such as CQT? (The representation is informational enough)

torchaudio.__version__ check fails when version >= '0.10' and < '1.0'

FYI:
if torchaudio.__version__ > '0.7.0': which you call in dataset.py does not check versioning properly, 0.11.0; will return False

Until recent python one could use distutils.LooseVersion but that has been deprecated ( i think it was python 3.8 or so) The proper way of checking version currently is using packaging

from packaging import version
if version.parse(torchaudio.__version__) > version.parse("0.7.0"):

Also I noticed that on same file you check if num_workers=os.cpu_count
You may want to fix it to if num_workers=os.cpu_count//3 or so. After that using cpus in python gives you diminishing returns, may even be slower. You may want to verify that but at some point I tested that.

Some Audio Samples

Could you share some audio samples? Thanks. By the way, what is the RTF using GPU or CPU?

Error during inference

I'm trying to run inference using a pretrained diffwave model on the output of a SepFormer model (separating a 2 speaker mixture). Creating the mel spectrogram and calling predict

mel_args = {
    'sample_rate': 8000,
    'win_length': 384,
    'hop_length': 192,
    'n_fft': 384,
    'f_min': 80.0,
    'f_max': 3000.0,
    'n_mels': 64,
    'power': 2.0,
    'normalized': False,
}

mel_transform = MelSpectrogram(**mel_args)

mel_spec = mel_transform(det_estimation)                    # [32, 2, 64, 167] det_estimation.shape is [32, 2, 32000]
mel_spec = 20 * log10(clamp(mel_spec, min=1e-5)) - 20
mel_spec = clamp((mel_spec + 100) / 100, 0.0, 1.0)
mel_trimmed = mel_spec[:, :, :, :-1]                        # [32, 2, 64, 166]
B, C, nmel, ntime = mel_trimmed.shape                       # B: 32 C:2 nmel: 64 ntime:64
enlarged_spectrogram = mel_trimmed.view(B * C, nmel, ntime) # [64, 64, 166]

gen_estimation, sr = predict(enlarged_spectrogram, 'diffwave-weights-902319.pt', base_params, fast_sampling=True, device='cpu')

leads to the following error:

Traceback (most recent call last):
  File "path/test.py", line 29, in <module>
    gen_estimation, _ = diffwave_predict(enlarged_spectrogram, 'diffwave-weights-902319.pt', base_params, fast_sampling=True,
  File "path/diffwave/inference.py", line 81, in predict
    audio = c1 * (audio - c2 * model(audio, spectrogram, torch.tensor([T[n]], device=audio.device)).squeeze(1))
  File "/home/eitzo/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "path/diffwave/model.py", line 152, in forward
    diffusion_step = self.diffusion_embedding(diffusion_step)
  File "/home/eitzo/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "path/diffwave/model.py", line 50, in forward
    x = self._lerp_embedding(diffusion_step)
  File "path/diffwave/model.py", line 62, in _lerp_embedding
    return low + (high - low) * (t - low_idx)
RuntimeError: The size of tensor a (128) must match the size of tensor b (166) at non-singleton dimension 3

Process finished with exit code 1

Any ideas where I went wrong ? Thanks in advance!

"RuntimeError: CUDA out of memory" when attempting inference

Hello, I was trying to train my own model with this algorithm but I ran across a problem when trying to use inference with this self trained model:
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 10.92 GiB total capacity; 8.45 GiB already allocated; 1.80 GiB free; 8.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

It seems to be a cuda memory problem. First thinking I might not have enough memory on my personal GPU, I tried my university server which hosts 8 GPU's (the earlier log is from that server), which ran into the same problem. A google Colab Pro GPU with 15GB of vram also reported this same problem.

I've tried setting different batch sizes in params.py to see if that would solve the problem, but I can't find any place in the model.py or inference.py that this gets called and it also doesn't seem to affect the memory usage.

I've tried adding torch.cuda.empty_cache() in multiple places to see if that could help, but sadly it didn't.

So far in the code I can't find anything that would cause this problem. Does anyone else experience the same problem, or is there a solution or setting I'm not seeing?

Could this perhaps be a training problem as well, which means I should train in a lower batch size to keep the inference easier?

I'll add my parameters as well, just in case that's of any help:

params = AttrDict(
    # Training params
    batch_size=16,
    learning_rate=2e-4,
    max_grad_norm=None,

    # Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1024,
    hop_samples=256,
    crop_mel_frames=62,  # Probably an error in paper.

    # Model params
    residual_layers=30,
    residual_channels=64,
    dilation_cycle_length=10,
    unconditional = False,
    noise_schedule=np.linspace(1e-4, 0.05, 50).tolist(),
    inference_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],

    # unconditional sample len
    audio_len = 22050*5, # unconditional_synthesis_samples
)

Question about fientune the diffwave

Heya creator, thanks for your amazing contribution to implement diffwave model. I have a general question about the fientune the diffwave.
If I use only one prompt voice to fine-tune the diffwave, is this possible to produce voices that all have the same tone with the prompt voice? Does this make sense to accomplish the voice cloning?
Hope for your answering.

Trying to use pretrained model but failed

Hi, I have trouble using pre-trained model and badly wants your help.

I wanted to check the performance of Diffwave with pretrained prameters.

Since there was no demo for it, I've write my own script that importing pretrained model.

Purpose of the script is to compare the original audio and generated audio from pretrained vocoder.

First, I've generated Mel spectrogram from one of audio samples provieded in https://github.com/lmnt-com/diffwave#audio-samples.

# Audio downloaded from audio samples
# https://github.com/lmnt-com/diffwave#audio-samples
waveform, sample_rate = get_speech_sample()

# define transformation
spectrogram = T.MelSpectrogram(
    sample_rate=22050,
    n_fft = 1024,
    hop_length = 256,
    win_length = hop_length * 4,
    f_min = 20.0,
    f_max=sample_rate/2.,
    n_mels=80
)

# Perform transformation
spec = spectrogram(waveform)
spec = 20 * torch.log10(torch.clamp(spec, min=1e-5)) - 20
spec = torch.clamp((spec + 100) / 100, 0.0, 1.0)

print_stats(spec)
plot_spectrogram(spec[0], title="torchaudio")
plot_waveform(waveform, sample_rate)
Shape: (1, 80, 833)
Dtype: torch.float32
 - Max:      1.000
 - Min:      0.280
 - Mean:     0.698
 - Std Dev:  0.171

tensor([[[0.5525, 0.5410, 0.5013,  ..., 0.4834, 0.5863, 0.6569],
         [0.5485, 0.5346, 0.4632,  ..., 0.4242, 0.6327, 0.6866],
         [0.4129, 0.5611, 0.5924,  ..., 0.4228, 0.6652, 0.7208],
         ...,
         [0.5441, 0.6529, 0.7050,  ..., 0.5078, 0.5972, 0.6283],
         [0.5814, 0.6205, 0.6569,  ..., 0.5178, 0.6150, 0.6492],
         [0.5728, 0.6037, 0.6395,  ..., 0.4996, 0.6498, 0.6952]]])

스크린샷 2022-05-12 오후 9 26 58

Using the created spectrogram, spec, I've generate audio file which should give similar audio from above.

from diffwave.inference import predict as diffwave_predict

# Pretrained parameters. given at the 
# https://github.com/lmnt-com/diffwave#pretrained-models
model_dir = './diffwave/' 
spectrogram = spec # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True, 
                                      device='cpu')
plot_waveform(audio, sample_rate)
play_audio(audio, sample_rate)

스크린샷 2022-05-12 오후 9 30 41

However, the results was far from the original. It was unstable, and doesn't give similar results on the Demo.

Is there any problem on my code? or Is there way of properly using pre-trained parameters?

I would really appreciate if there is any example code that I can use pre-trained model properly.

Thanks.

help request. trying to figure out how to match up params for TTS to Vocoder.

I'm using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.

The TTS project I'm using shows the below for audio params, but I don't know what to change in either the TTS params or the vocoder params to have them match up. I'm guessing the hop_samples somehow matches up with the sftp_* settings, but, am a bit clueless as to what I'm looking at. I'm thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.

TTS Tacotron Settings

    sample_rate = 22050                  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102                       # number of frequency bins used during computation of spectrograms
    num_mels = 80                        # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13                        # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50                  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5                 # shift of the window (or better said gap between windows) in ms   

diffwave Vocoder Settings

# Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1024,
    hop_samples=256,

High pitched voices when scaling fft size up to 4096

Let me start by saying that this repo is fantastic. I've successfully synthesized voices and would like to experiment with scaling up fft size and other audio parameters.

I'm running with the following:

n_fft: 4096
hop_samples: 256
sample_rate: 32000

I'm able to train and the loss goes down quite a lot, but when I listen to the sample voices they are very high pitched compared to when training with n_fft = 1024. I think somewhere during training the voices are being squeezed together and messing with the pitch.

Are there any modifications that need to be made to make this work? For reference I'm training on the ljspeech dataset.

Thank you!

CUDA Error when loading checkpoint on more than one GPU

Hello.

I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:

  File "__main__.py", line 55, in <module>
    main(parser.parse_args())
  File "__main__.py", line 39, in main
    spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
    _train_impl(replica_id, model, dataset, args, params)
  File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
    learner.restore_from_checkpoint()
  File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
    checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
    return obj.cuda(device)
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.

I must add that I have checked that the GPUs were completely free when launching the training.

Any advice on this issue?

Thanks in advance.

How is the value of crop_mel_frames chosen?

In params.py we set crop_mel_frames=62, with the comment "Probably an error in the paper." I wasn't able to find any discussion of this parameter in the DiffWave paper (is this the paper that the comment refers to?) and I'm curious where it comes from. Could someone clarify where this crop length comes from? Apologies if I have overlooked something obvious.

Adopting diffusion model on TTS

Hi all,
I'm currently playing with DiffSinger, which is a TTS system extended by diffusion models. For the naive version, It consists of encoders (for embedding text and pitch information) and a denoiser where the encoders' output is used to condition the denoiser. Everything is similar to diffwave including denoiser's structure and prediction but the neural net to predict epsilon would be changed to epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) compared to DiffWave's epsilon(noisy_audio, upsampled_spectrogram, diffusion_step).
While I'm successfully training encoders, I got an issue during training denoiser. I used LJSpeech. Here is what I did:

  1. First of all, as a preliminary experiment, I try to check all modules to work well by setting denoiser as epsilon(noisy_spectrogram, clean_spectrogram, diffusion_step) to predict the noisy_spectrogram.
  2. After the model converges, I went back to the denoiser of epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) to predict clean_spectrogram. I detached the encoders_output from the auto_grad when the input (to prevent from updating) to the denoiser to fix the conditioner for model convergence. The model was broken when I didn't detach (allow the encoder to be updated during denoiser training).
  3. I found that when the range of the conditioner (encoder_outputs) values is smaller, then the model shows better evidence of successful training.

Bellows are the results I've got so far. The upper one is the sampled (synthesized) mel-spectrogram, and the lower one is the ground truth of each image.

  1. I can see the model converge during the primary experiment:
    image
  2. When the encoder's output directly input to the denoiser (value range: -9.xxx to 6.xxx):
    image
  3. When the encoder's output is multiplied by 0.01 to shrink the range:
    image

For case 2., It shows any clues on training. On contrary, the case 3. shows 'some' levels of training but it is not what we expected. I double-checked the inference part (reverse part), but it is exactly the same as that of 1. and diffwave.

So I just want to know if you have any idea on the successful conditions of the input conditioner of the denoiser. Why does the model show such an unsatisfying result above? Do I miss something to process the conditioner?

I will appreciate all suggestions or sharing of your experience.
Thanks in advance.

preprocess problem

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Code for evaluation in paper

I found some automatic evaluation metrics mentioned in the paper, where can I find these scripts so that I can reproduce the result and compare with others method.

image

Unconditional Generation Training Time

Hi @sharvil @Andrechang @JCBrouwer thanks for this implementation.

My issue is about the training time for unconditional generation. It takes me about 5 hours/ epoch on 1 * RTX8000 and most of the time is spent on loss.backward(), with the unconditional setting in #5, I wonder:

  1. Is this common?
  2. Any suggestions for acceleration please?
  3. From how many epochs that you start to have good-quality generations?

Thanks in advance.

Trouble starting training

Hi,

I'm just starting some experiments with diffwave and so far have run into this persistent error when trying to train:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/diffwave/__main__.py", line 52, in <module> main(parser.parse_args()) File "/usr/local/lib/python3.6/dist-packages/diffwave/__main__.py", line 39, in main train(args, params) File "/usr/local/lib/python3.6/dist-packages/diffwave/learner.py", line 169, in train _train_impl(0, model, dataset, args, params) File "/usr/local/lib/python3.6/dist-packages/diffwave/learner.py", line 163, in _train_impl learner.train(max_steps=args.max_steps) File "/usr/local/lib/python3.6/dist-packages/diffwave/learner.py", line 104, in train for features in tqdm(self.dataset, desc=f'Epoch {self.step // len(self.dataset)}') if self.is_master else self.dataset: ZeroDivisionError: integer division or modulo by zero

I'm running on Google Colab, my dataset consists of ~6 pieces of audio, each ~4min. long, already preprocessed using diffwave.preprocess function.

Any advice would be of great help, thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.