russellsb / tt-vae-gan Goto Github PK

Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. Able to transfer the timbre of an audio source to that of another.

Python 100.00%

voice-conversion-gan speech timbre timbre-transfer music variational-autoencoder generative-adversarial-network

tt-vae-gan's People

Contributors

Stargazers

Watchers

Forkers

8secz-johndpope entn-at taalua casonclagg yaohwang rohitgupta3 vcip2015 yumahayomaso colorbuffer fastflair dhockaday hozhenwai navi0105

tt-vae-gan's Issues

Exploding loss during voice-conversion training

Thanks for the great repository.
Unfortunatly I have a problem during the voice-conversion training.

After the first 2 Epochs I get exploding losses.

What's the reason for that and how can I solve this?

I am happy about every tip

Thanks in advance!

problem with tutorial steps, short files from wavenet

hi, i'm trying to reproduce your tutorial with pretrained models, but there is a problem with outputting files from the wavenet - after starting infer.sh I get files 1 second long, please tell me what i am doing wrong and how can i get fully processed files?

Request to transfer 2 voices over another

Hi Russell,

I´m studying for months how to control Voice Style Transfer, but it´s too difficult for me.
Maybe you can help me to transfer 2 voices over another to fulfill a long youth-wish.
It´s a ZIP-file of 5 MB with the WAV-files of the voices.
You don´t have to do it for nothing, we will agree on a price.
My email is [email protected]

Kind regards,
Berend.
VOICES.zip

KeyError: 2

Hello,
Congratulations, and thank you for sharing this very interesting project. We are trying to run this project, and have completed the data preparation and preprocessing steps. But at the training step we run into an issue. We have 2 speakers and are otherwise using the default settings. The printout from the run is pasted below.

Namespace(b1=0.5, b2=0.999, batch_size=4, channels=1, checkpoint_interval=1, dataset='../data/data_urmp/', decay_epoch=50, dim=32, epoch=0, img_height=128, img_width=128, lr=0.0001, model_name='test', n_cpu=8, n_downsample=2, n_epochs=100, n_spkrs=2, plot_interval=1)
2 2
Cuda found.
/path_cropped/venv/pytorch/lib/python3.8/site-packages/torch/optim/adam.py:48: UserWarning: optimizer contains a parameter group with duplicate parameters; in future, this will cause an error; see github.com/pytorch/pytorch/issues/40967 for more information
  super(Adam, self).__init__(params, defaults)
  0%|                                                                                        | 0/2136 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 287, in <module>
    train_global()
  File "train.py", line 267, in train_global
    losses = train_local(i, epoch, batch, pair[0], pair[1], losses)
  File "train.py", line 151, in train_local
    X1 = Variable(batch[id_1].type(Tensor))
KeyError: 2
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/path_cropped/venv/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/path_cropped/venv/pytorch/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.8/multiprocessing/reduction.py", line 157, in recvfds
    msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_SPACE(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer

The machine we use has an RTX 3080 GPU, and a CPU with 14 cores / 28 threads.
We have also tried to decrease the number of cpu's to 1 and 2, for example, and the error is similar, but shorter:

Namespace(b1=0.5, b2=0.999, batch_size=4, channels=1, checkpoint_interval=1, dataset='../data/data_urmp/', decay_epoch=50, dim=32, epoch=0, img_height=128, img_width=128, lr=0.0001, model_name='test', n_cpu=1, n_downsample=2, n_epochs=100, n_spkrs=2, plot_interval=1)
2 2
Cuda found.
/path_cropped/venv/pytorch/lib/python3.8/site-packages/torch/optim/adam.py:48: UserWarning: optimizer contains a parameter group with duplicate parameters; in future, this will cause an error; see github.com/pytorch/pytorch/issues/40967 for more information
  super(Adam, self).__init__(params, defaults)
  0%|                                                                                        | 0/2136 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 287, in <module>
    train_global()
  File "train.py", line 267, in train_global
    losses = train_local(i, epoch, batch, pair[0], pair[1], losses)
  File "train.py", line 151, in train_local
    X1 = Variable(batch[id_1].type(Tensor))
KeyError: 2
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner

Just in case, we have also reduced the batch size to 1 and 2, and then it stops at KeyError: 2 already.
Can you, or someone else, please help with that the problem is and how to resolve it?

Is this correct? Griffin Lim output almost unintelligible

This is a great project, congrats!

I have been trying to get it to run and so far so good, I get to the "Infer with VAE-GAN and Griffin Lim audio reconstruction" step. Using the Flickr dataset applied to the speakers 4 and 7.

My output is quite a low quality spectrogram and almost unintelligible Griffin Lim reconstruction. I was wondering whether I should change some of the training parameters to increase resolution, whether I am doing something wrong or whether this is expected and WaveNet will fix this?

I have a feeling I am doing something wrong. There is a clear correlation between input and output plots but the resolution of the output just seems to smalle. If you can point me into the right direction I would be extremely grateful.

Attached the Mel Spectrogram after step 1.4. It seems quite wrong.

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

I am getting below error,

But I confirmed I only have 1 videos and fyi I have a 3 mins of audio which is target audio and 21 secs of source audio

Quick Question regarding convert mel_spectrogram to wav

Hello, Thank you for your great implementation work!

I have a quick question about converting from mel-spectrogram to wav.

I have checked that you used librosa library to convert frequency domain to time domain.

Have you tried to use any other libraries such as torchaudio which support GPU? because it takes so long time to convert mel-spectrograms to wav...

Thank you!

Two small issues in the script for training the WaveNet vocoder (run.sh)

Hi, two fairly minor questions.

Within the WaveNet vocoder training section, for the preprocessing step (step 2.2), it gives an example
spk="[name]_[id]" ./run.sh --stage 1 --stop-stage 1

Do I need to also pass hparams (à la step 2.3)? I seem to get an unbound variable if not, and I assume this is the reason.

The first time I run the training for the WaveNet vocoder, it appears that I get a FileNotFoundError along the lines of:
FileNotFoundError: [Errno 2] No such file or directory: 'exp/flickr_1_train_no_dev_flickr/checkpoint_latest.pth'

Is this because wavenet_vocoder/egs/gaussian/run.sh passes the --checkpoint=${expdir}/checkpoint_latest.pth argument to train.py even though (if this is a fresh model run) there wouldn't be any latest checkpoint saved? If I edit out that arg from that line, the training at least starts.

Improvements to results

General question, not an issue, apologies if this is the wrong place for such queries. I was wondering a couple things:

In general, are there more / less promising ways to create better results? Many of the voice conversions I've tried via this repo have had strange artifacts. Even in the core VAE-GAN demo, I'd say (subjectively) that the male=>female conversions sound a lot better than the female=>male, with the latter having lots of warbled speech. Maybe too broad a question, but based on your experience I'd be curious how you'd go about improving this? E.g. are there specific hyperparameters you'd change, and/or is it due to nature of the training data, etc etc?
How good have you found MelGAN vs WaveNet? I'm wondering whether to dive more into training WaveNet or not given what appears to be MelGAN's speed benefits (both training and inference). And along the lines of MelGAN, I'm curious whether you've found any pretrained models (whether the implementation you link or the official one) that you think are good enough or whether you typically train MelGAN yourself.

Appreciate any thoughts here.

KeyError: 2

Hi,

Great work on this, I am trying to replicate it on my local machine but I am having some issue when training the model, could you please advise what might cause this error?

Traceback (most recent call last):
File "..\tt-vae-gan\voice_conversion\src\train.py", line 275, in
train_global()
File "..\tt-vae-gan\voice_conversion\src\train.py", line 255, in train_global
losses = train_local(i, epoch, batch, pair[0], pair[1], losses)
File "..\tt-vae-gan\voice_conversion\src\train.py", line 139, in train_local
X1 = Variable(batch[id_1].type(Tensor))
KeyError: 2

I have tried to play with n_epochs but it seems to fail at the very first one as shown below:

Namespace(epoch=0, n_epochs=2, model_name='test_1', dataset='../data/data_flickr', n_spkrs=4, batch_size=4, lr=0.0001, b1=0.5, b2=0.999, decay_epoch=1, n_cpu=6, img_height=128, img_width=128, channels=1, plot_interval=1, checkpoint_interval=2, n_downsample=2, dim=32)
..\ttvaegan\lib\site-packages\torch\optim\adam.py:48: UserWarning: optimizer contains a parameter group with duplicate parameters; in future, this will cause an error; see github.com/pytorch/pytorch/issues/40967 for more information

I am running this with an NVIDIA RTX 3000 with 6gb dedicated. Can it be an hardware limitation? It does fail exactly when the GPU reaches around 6gb

Best,

Loss evolution for pre-trained models?

Hey great work! Just wondering any chance you would have the loss evolution for your pretrained models?

How to resume training?

It would be desirable to be able to load a saved checkpoint to resume training; help with this would be welcome.

not an issue - fyi

Hi @RussellSB

there was a Voice Conversion Challenge 2020 baseline: CycleVAE w/ PWG vocoder
Official homepage: http://www.vc-challenge.org/

some code was provided by @bigpon / perhaps it could help the training / getting started stuff.
https://github.com/bigpon/vcc20_baseline_cyclevae/tree/master/baseline
https://github.com/bigpon/vcc20_baseline_cyclevae/blob/master/baseline/src/parallel_wavegan/models/melgan.py

In the paper it mentions Google's Parratron
this was implemented by @fd873630 / his models are here ( could this help?)
https://github.com/fd873630/Parrotron/tree/master/models

I wonder if a bit of detective work can piece things together to avoid the mode collapse.
Otherwise need to wait for @ebadawy to release code.

about calculate KLD

Hello! Your excellent work has benefited me a lot in the introduction of timbre conversion, but I don't understand one thing well: when you calculate KLD, put the square of each potential variable and then get the average value as the loss of KLD. I haven't understood it here. Could you explain it? Thank you very much!