Giter VIP home page Giter VIP logo

dyganvc's Introduction

Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution

Mingjie Chen, Yanghao Zhou, Heyan Huang, Thomas Hain


It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In this work we present a model that is significantly smaller and thereby faster in processing while obtaining equivalent performance. To achieve this the proposed model, Dynamic-GAN-VC (DYGAN-VC), uses a non-autoregressive structure and makes use of vector quantised embeddings obtained from a VQWav2vec model. Furthermore dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. Objective and subjective evaluation was performed using the VCC2020 task, yielding MOS scores of up to 3.86, and character error rates as low as 4.3%. This was achieved with approximately half the number of model parameters, and up to 8 times faster decoding speed.

[paper] [demo]

Dataset

VCC2020 track1

Dependencies

  1. fairseq
  2. Parallel WaveGAN vocoder

How to run

  1. clone repository
git clone https://github.com/MingjieChen/DYGANVC.git
cd DYGANVC
  1. Create conda env and install pytorch 1.7 through conda
conda create --name torch_1.7 python==3.7
conda activate torch_1.7
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
conda install librosa -c conda-forge

Choose your own cudatoolkit version according to your own GPU.

  1. install packages
pip install fairseq parallel_wavegan munch pyyaml SoundFile tqdm scikit-learn tensorboardX webrtcvad
  1. download dataset and unzip dataset to vcc20/
./data_download.sh vcc20
  1. extract speaker embeddings
./extract_speaker_embed.sh
  1. download vqwav2vec ckpt
mkdir vqw2v
cd vqw2v
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt
  1. extract vqwav2vec features
./vqw2v_feat_extract.sh
  1. start training
./run_train.sh
  1. inference
python inference.py

dyganvc's People

Contributors

keithcallenberg avatar mingjiechen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dyganvc's Issues

Error in solve mismatch between vq feats and mels

Running train.py gives me error in

vqw2v_dense = np.concatenate((vqw2v_dense, np.repeat(pad_vec, mel_length - vq_length, 0)),1)

with message ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Output of extract_speaker_embed.py
$ python -W ignore::FutureWarning  extract_speaker_embed.py
Loaded the voice encoder model on cpu in 0.01 seconds.
SpeakerEncoder(
(lstm): LSTM(40, 256, num_layers=3, batch_first=True)
(linear): Linear(in_features=256, out_features=256, bias=True)
(relu): ReLU()
)
encoder
The number of parameters: 1423616
spk_emb SEF1 (256,)
spk_emb SEF2 (256,)
spk_emb SEM1 (256,)
spk_emb SEM2 (256,)
spk_emb TEF1 (256,)
spk_emb TEF2 (256,)
spk_emb TEM1 (256,)
spk_emb TEM2 (256,)
Output of train.py
$ python train.py 
{'log_dir': 'exp', 'model_name': 'dygan_vc', 'exp_name': 'dygan_vc_vq_spkemb', 'pretrained_model': '', 'fp16_run': False, 'trainer': 'VQMelSpkEmbLSTrainer', 'epochs': 100, 'num_speakers': 8, 'save_freq': 20, 'load_only_params': False, 'data_loader': {'dataset': 'VQMelSpkEmbDataset', 'data_dir': 'vcc2020', 'vq_dir': 'dump/vqw2v_feat_test', 'batch_size': 8,'speakers': 'speaker.json', 'spk_emb_dir': 'dump/ppg-vc-spks', 'shuffle': True, 'drop_last': False, 'num_workers': 4, 'min_length': 128, 'stats': 'vocoder/stats.npy'}, 'model': {'generator': {'model_name': 'Generator0', 'in_feat_dim': 512, 'out_feat_dim': 80, 'kernel': '9_9_9_9_9_9', 'num_heads': '4_4_4_4_4_4', 'num_res_blocks': 6, 'hidden_size': '256_256_256_256_256_256', 'spk_emb_dim': 256, 'hid2_factor': 1, 'res_wff_kernel1': 3, 'res_wff_kernel2': 3, 'res_wadain_use_ln': False, 'res_wff_use_res': True, 'res_wff_use_act2': True, 'res_use_ln': True, 'use_kconv': False, 'wadain_beta': False, 'ff_block': 'WadainFF', 'conv_block': 'DynamicConv', 'scale': 1.0, 'out_kernel': 1}, 'discriminator': {'model_name': 'Discriminator128', 'num_speakers': 8, 'kernel_size': 3, 'padding': 1}}, 'loss': {'g_loss': {'lambda_cyc': 0.0, 'lambda_id': 5.0, 'lambda_adv': 1.0}, 'd_loss': {'lambda_reg': 1.0, 'lambda_con_reg': 5.0}, 'con_reg_epoch': 50000}, 'optimizer': {'discriminator': {'lr': 2e-05, 'weight_decay': 0.0001, 'betas': [0.5, 0.999]}, 'generator': {'lr': 0.0001, 'weight_decay': 0.0001, 'betas': [0.5, 0.999]}}}
SEF1
60
SEF2
60
SEM1
60
SEM2
60
TEF1
60
TEF2
60
TEM1
60
TEM2
60
loading files 480
SEF1
10
SEF2
10
SEM1
10
SEM2
10
TEF1
10
TEF2
10
TEM1
10
TEM2
10
loading files 80
Generator0(
(conv1): Sequential(
(0): Conv1d(512, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): LeakyReLU(negative_slope=0.2)
)
(res_blocks): Sequential(
(0): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(1): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(2): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(3): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(4): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(5): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(6): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(7): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(8): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(9): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(10): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(11): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
)
(out): Sequential(
(0): Conv1d(256, 80, kernel_size=(1,), stride=(1,))
)
)
generator
The number of parameters: 4281384
Discriminator128(
(conv_layer_1): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(down_sample_1): DisRes(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv1x1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_2): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_3): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_4): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(blocks): Sequential(
(0): LeakyReLU(negative_slope=0.2)
(1): Conv2d(128, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(2): LeakyReLU(negative_slope=0.2)
(3): AdaptiveAvgPool2d(output_size=1)
)
(dis_conv): Conv2d(128, 8, kernel_size=(1, 1), stride=(1, 1))
)
discriminator
The number of parameters: 1415880
Traceback (most recent call last):
File "train.py", line 96, in <module>
main(args.config_path)
File "train.py", line 75, in main
train_results = trainer._train_epoch()
File "/deepmind/experiments/mingjiechen/dyganvc/vqmel_spkemb_ls_trainer.py", line 228, in _train_epoch
for train_steps_per_epoch, batch in enumerate(self.train_dataloader, 1):
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/deepmind/experiments/mingjiechen/dyganvc/data_loader.py", line 117, in __getitem__
vqw2v_dense = np.concatenate((vqw2v_dense, np.repeat(pad_vec, mel_length - vq_length, 0)),1)
File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Any idea on how to fix this?

training recipe

Hi, thanks for publishing the code!

can you please add up recipe for training?

own data

hi thanks fot the codes !
i just want to add my own datasets, and could you please tell me how i can do that?
thanks :)

pretrained weights

Thanks for your amazing work! Could you please publish your pretrained weights of this model? Thanks!!

File not found: 'exp/dygan_vc/dygan_vc_vq_spkemb/epoch_100.pth'

I'm getting a file not found error when I run

python inference.py

it returns:

FileNotFoundError: [Errno 2] No such file or directory: 'exp/dygan_vc/dygan_vc_vq_spkemb/epoch_100.pth

I've searched but can't find where to get those files. Any guidance?

How should I prepare the speech in the *.wav files used for training?

I've gotten DYGANVC to train and inference properly using the VCC2020 dataset and I'm now trying to record audio for and put together my own 2 speaker dataset (of myself and a friend) to train.

I think I understand everything I need to follow specification-wise (and I've read #6), but I'm not 100% sure what should be in the audio files themselves to produce the best results:

  1. Do both speakers need to say the same transcript for the training to work properly? If it's not necessary, does it still help or does it not matter?

  2. Does it matter how much silence is in the audio files? If a person stops speaking for one second or so in the middle of the wav file, will that confuse the training?

  3. Should the lengths of the audio files be relatively consistent? If most of the WAVs in my corpus end up being 1 to 5 seconds long, but I have one rambling 15 second long sentence to train, should I chop it into multiple clips or leave it as is?

  4. Is there any benifit to expanding the corpus to more speakers even though I only need to inference between two of them? Or does that just add conflating variables?

  5. If some of the WAV files have audible background noises while the speakers talk, does that interfere with training? (ie. would it train the algorithm to be more resilient to background sounds or would it just start mistaking those sounds for speech?)

  6. What do you think the minimum number of minutes the corpus could be for each speaker while still producing mostly passable results? And at what point do diminishing returns start sinking in? (ie. Do you think there would be a significant quality improvement by having 30mins per speaker over having 10 or 15 minutes?)

  7. Do I need to normalize my WAV training files to the same volume or does the algorithm handle that well?

Sorry for all the questions. Even if you can only answer some, it would be a great help, and hopefully other people will find it useful as well. Thank you.

What do these indices mean?

When training, real:0.32081 fake:0.19402 reg:0.00009 adv:0.30563 id:0.30306, these indices make me confused.
Would you please show me which one(s) is relative to the performance of the generator model? Thank you

extract_speaker_embed.sh returns an error: "ValueError: Improper number of dimensions to norm."

I'm at Step 5. of the "How to Run" process from the main Git page, and I'm getting an error. Here is my output when I execute

./extract_speaker_embed.sh

It returns:

./extract_speaker_embed.sh: line 5: activate: No such file or directory
Loaded the voice encoder model on cpu in 0.09 seconds.
SpeakerEncoder(
(lstm): LSTM(40, 256, num_layers=3, batch_first=True)
(linear): Linear(in_features=256, out_features=256, bias=True)
(relu): ReLU()
)
encoder
The number of parameters: 1423616
/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "extract_speaker_embed.py", line 31, in
spk_emb = encoder.embed_speaker(audios)
File "/mnt/w/python stuff/dyganvc/speaker_encoder/voice_encoder.py", line 173, in embed_speaker
return raw_embed / np.linalg.norm(raw_embed, 2)
File "<array_function internals>", line 6, in norm
File "/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 2611, in norm
raise ValueError("Improper number of dimensions to norm.")
ValueError: Improper number of dimensions to norm.

Error in extract vqwav2vec features

cp in

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp])

is not defined. I suppose it means path to checkpoint vq-wav2vec_kmeans.pt.

However, my attempt of loading checkpoint gives ```KeyError: 'binary_cross_entropy'``.

File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 279, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 232, in load_checkpoint_to_cpu
state = _upgrade_state_dict(state)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 439, in _upgrade_state_dict
cls = REGISTRY["registry"][choice]
KeyError: 'binary_cross_entropy'
$ python -c "import fairseq;print(fairseq.__version__)"
0.10.2

Any idea on how to fix this?

train Chinese data

Great work! I train a model in Chinese data, the speaker embedding uses nn.Embedding layer, not use speaker encoder netwotk. In test stage, the similarity of inter-gender is very poor. Any suggestion?

Why not train like StarGAN

StarGAN training way could keep a many-to-many VC, why do you only use identity loss while lacking of consistent loss?
And , is identity loss necessary in StarGAN training?
Thank you .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.