mingjiechen / dyganvc Goto Github PK

demo page https://MingjieChen.github.io/dygan-vc

Python 97.32% Shell 2.68%

dyganvc's Introduction

Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution

Mingjie Chen, Yanghao Zhou, Heyan Huang, Thomas Hain

It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In this work we present a model that is significantly smaller and thereby faster in processing while obtaining equivalent performance. To achieve this the proposed model, Dynamic-GAN-VC (DYGAN-VC), uses a non-autoregressive structure and makes use of vector quantised embeddings obtained from a VQWav2vec model. Furthermore dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. Objective and subjective evaluation was performed using the VCC2020 task, yielding MOS scores of up to 3.86, and character error rates as low as 4.3%. This was achieved with approximately half the number of model parameters, and up to 8 times faster decoding speed.

[paper] [demo]

Dataset

VCC2020 track1

Dependencies

How to run

clone repository

git clone https://github.com/MingjieChen/DYGANVC.git
cd DYGANVC

Create conda env and install pytorch 1.7 through conda

conda create --name torch_1.7 python==3.7
conda activate torch_1.7
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
conda install librosa -c conda-forge

Choose your own cudatoolkit version according to your own GPU.

install packages

pip install fairseq parallel_wavegan munch pyyaml SoundFile tqdm scikit-learn tensorboardX webrtcvad

download dataset and unzip dataset to vcc20/

./data_download.sh vcc20

extract speaker embeddings

./extract_speaker_embed.sh

download vqwav2vec ckpt

mkdir vqw2v
cd vqw2v
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt

extract vqwav2vec features

./vqw2v_feat_extract.sh

start training

./run_train.sh

inference

python inference.py

dyganvc's People

Contributors

Stargazers

Watchers

Forkers

mortyzhou-shef-bit khursani8 ishine keithcallenberg ahmeftah zhinengshiyanshi hancytoro fangsen9000 cesurpolat

dyganvc's Issues

Error in solve mismatch between vq feats and mels

Running train.py gives me error in

DYGANVC/data_loader.py

Line 117 in 189591a

 vqw2v_dense = np.concatenate((vqw2v_dense, np.repeat(pad_vec, mel_length - vq_length, 0)),1) 

with message

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Output of extract_speaker_embed.py

$ python -W ignore::FutureWarning  extract_speaker_embed.py
Loaded the voice encoder model on cpu in 0.01 seconds.
SpeakerEncoder(
(lstm): LSTM(40, 256, num_layers=3, batch_first=True)
(linear): Linear(in_features=256, out_features=256, bias=True)
(relu): ReLU()
)
encoder
The number of parameters: 1423616
spk_emb SEF1 (256,)
spk_emb SEF2 (256,)
spk_emb SEM1 (256,)
spk_emb SEM2 (256,)
spk_emb TEF1 (256,)
spk_emb TEF2 (256,)
spk_emb TEM1 (256,)
spk_emb TEM2 (256,)

Output of train.py

$ python train.py 
{'log_dir': 'exp', 'model_name': 'dygan_vc', 'exp_name': 'dygan_vc_vq_spkemb', 'pretrained_model': '', 'fp16_run': False, 'trainer': 'VQMelSpkEmbLSTrainer', 'epochs': 100, 'num_speakers': 8, 'save_freq': 20, 'load_only_params': False, 'data_loader': {'dataset': 'VQMelSpkEmbDataset', 'data_dir': 'vcc2020', 'vq_dir': 'dump/vqw2v_feat_test', 'batch_size': 8,'speakers': 'speaker.json', 'spk_emb_dir': 'dump/ppg-vc-spks', 'shuffle': True, 'drop_last': False, 'num_workers': 4, 'min_length': 128, 'stats': 'vocoder/stats.npy'}, 'model': {'generator': {'model_name': 'Generator0', 'in_feat_dim': 512, 'out_feat_dim': 80, 'kernel': '9_9_9_9_9_9', 'num_heads': '4_4_4_4_4_4', 'num_res_blocks': 6, 'hidden_size': '256_256_256_256_256_256', 'spk_emb_dim': 256, 'hid2_factor': 1, 'res_wff_kernel1': 3, 'res_wff_kernel2': 3, 'res_wadain_use_ln': False, 'res_wff_use_res': True, 'res_wff_use_act2': True, 'res_use_ln': True, 'use_kconv': False, 'wadain_beta': False, 'ff_block': 'WadainFF', 'conv_block': 'DynamicConv', 'scale': 1.0, 'out_kernel': 1}, 'discriminator': {'model_name': 'Discriminator128', 'num_speakers': 8, 'kernel_size': 3, 'padding': 1}}, 'loss': {'g_loss': {'lambda_cyc': 0.0, 'lambda_id': 5.0, 'lambda_adv': 1.0}, 'd_loss': {'lambda_reg': 1.0, 'lambda_con_reg': 5.0}, 'con_reg_epoch': 50000}, 'optimizer': {'discriminator': {'lr': 2e-05, 'weight_decay': 0.0001, 'betas': [0.5, 0.999]}, 'generator': {'lr': 0.0001, 'weight_decay': 0.0001, 'betas': [0.5, 0.999]}}}
SEF1
60
SEF2
60
SEM1
60
SEM2
60
TEF1
60
TEF2
60
TEM1
60
TEM2
60
loading files 480
SEF1
10
SEF2
10
SEM1
10
SEM2
10
TEF1
10
TEF2
10
TEM1
10
TEM2
10
loading files 80
Generator0(
(conv1): Sequential(
(0): Conv1d(512, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): LeakyReLU(negative_slope=0.2)
)
(res_blocks): Sequential(
(0): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(1): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(2): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(3): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(4): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(5): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(6): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(7): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(8): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(9): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
(10): DynamicConv(
(k_layer): Linear(in_features=256, out_features=512, bias=True)
(conv_kernel_layer): Linear(in_features=256, out_features=36, bias=True)
(lconv): LightConv(
(unfold1d): Unfold(kernel_size=[9, 1], dilation=1, padding=[4, 0], stride=1)
)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(act): GLU(dim=-1)
)
(11): WadainFF(
(conv1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): WadaIN(
(act): LeakyReLU(negative_slope=0.2)
(style_linear): EqualLinear()
)
(act): ReLU()
)
)
(out): Sequential(
(0): Conv1d(256, 80, kernel_size=(1,), stride=(1,))
)
)
generator
The number of parameters: 4281384
Discriminator128(
(conv_layer_1): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(down_sample_1): DisRes(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv1x1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_2): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_3): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(down_sample_4): DisRes(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(act): LeakyReLU(negative_slope=0.2)
)
(blocks): Sequential(
(0): LeakyReLU(negative_slope=0.2)
(1): Conv2d(128, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(2): LeakyReLU(negative_slope=0.2)
(3): AdaptiveAvgPool2d(output_size=1)
)
(dis_conv): Conv2d(128, 8, kernel_size=(1, 1), stride=(1, 1))
)
discriminator
The number of parameters: 1415880
Traceback (most recent call last):
File "train.py", line 96, in <module>
main(args.config_path)
File "train.py", line 75, in main
train_results = trainer._train_epoch()
File "/deepmind/experiments/mingjiechen/dyganvc/vqmel_spkemb_ls_trainer.py", line 228, in _train_epoch
for train_steps_per_epoch, batch in enumerate(self.train_dataloader, 1):
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/deepmind/experiments/mingjiechen/dyganvc/data_loader.py", line 117, in __getitem__
vqw2v_dense = np.concatenate((vqw2v_dense, np.repeat(pad_vec, mel_length - vq_length, 0)),1)
File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Any idea on how to fix this?

training recipe

Hi, thanks for publishing the code!

can you please add up recipe for training?

own data

hi thanks fot the codes !
i just want to add my own datasets, and could you please tell me how i can do that?
thanks :)

pretrained weights

Thanks for your amazing work! Could you please publish your pretrained weights of this model? Thanks!!

File not found: 'exp/dygan_vc/dygan_vc_vq_spkemb/epoch_100.pth'

I'm getting a file not found error when I run

python inference.py

it returns:

FileNotFoundError: [Errno 2] No such file or directory: 'exp/dygan_vc/dygan_vc_vq_spkemb/epoch_100.pth

I've searched but can't find where to get those files. Any guidance?

How should I prepare the speech in the *.wav files used for training?

I've gotten DYGANVC to train and inference properly using the VCC2020 dataset and I'm now trying to record audio for and put together my own 2 speaker dataset (of myself and a friend) to train.

I think I understand everything I need to follow specification-wise (and I've read #6), but I'm not 100% sure what should be in the audio files themselves to produce the best results:

Do both speakers need to say the same transcript for the training to work properly? If it's not necessary, does it still help or does it not matter?
Does it matter how much silence is in the audio files? If a person stops speaking for one second or so in the middle of the wav file, will that confuse the training?
Should the lengths of the audio files be relatively consistent? If most of the WAVs in my corpus end up being 1 to 5 seconds long, but I have one rambling 15 second long sentence to train, should I chop it into multiple clips or leave it as is?
Is there any benifit to expanding the corpus to more speakers even though I only need to inference between two of them? Or does that just add conflating variables?
If some of the WAV files have audible background noises while the speakers talk, does that interfere with training? (ie. would it train the algorithm to be more resilient to background sounds or would it just start mistaking those sounds for speech?)
What do you think the minimum number of minutes the corpus could be for each speaker while still producing mostly passable results? And at what point do diminishing returns start sinking in? (ie. Do you think there would be a significant quality improvement by having 30mins per speaker over having 10 or 15 minutes?)
Do I need to normalize my WAV training files to the same volume or does the algorithm handle that well?

Sorry for all the questions. Even if you can only answer some, it would be a great help, and hopefully other people will find it useful as well. Thank you.

When training, real:0.32081 fake:0.19402 reg:0.00009 adv:0.30563 id:0.30306, these indices make me confused.
Would you please show me which one(s) is relative to the performance of the generator model? Thank you

extract_speaker_embed.sh returns an error: "ValueError: Improper number of dimensions to norm."

I'm at Step 5. of the "How to Run" process from the main Git page, and I'm getting an error. Here is my output when I execute

./extract_speaker_embed.sh

It returns:

./extract_speaker_embed.sh: line 5: activate: No such file or directory
Loaded the voice encoder model on cpu in 0.09 seconds.
SpeakerEncoder(
(lstm): LSTM(40, 256, num_layers=3, batch_first=True)
(linear): Linear(in_features=256, out_features=256, bias=True)
(relu): ReLU()
)
encoder
The number of parameters: 1423616
/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "extract_speaker_embed.py", line 31, in
spk_emb = encoder.embed_speaker(audios)
File "/mnt/w/python stuff/dyganvc/speaker_encoder/voice_encoder.py", line 173, in embed_speaker
return raw_embed / np.linalg.norm(raw_embed, 2)
File "<array_function internals>", line 6, in norm
File "/home/lhill/anaconda3/envs/torch_1.7/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 2611, in norm
raise ValueError("Improper number of dimensions to norm.")
ValueError: Improper number of dimensions to norm.

Error in extract vqwav2vec features

cp in

DYGANVC/vqwv2vec_feat_extract.py

Line 36 in 189591a

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp])

is not defined. I suppose it means path to checkpoint vq-wav2vec_kmeans.pt.

However, my attempt of loading checkpoint gives ```KeyError: 'binary_cross_entropy'``.

File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 279, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 232, in load_checkpoint_to_cpu
state = _upgrade_state_dict(state)
File "/storage/usr/conda/envs/dyganvc/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 439, in _upgrade_state_dict
cls = REGISTRY["registry"][choice]
KeyError: 'binary_cross_entropy'

$ python -c "import fairseq;print(fairseq.__version__)"
0.10.2

Any idea on how to fix this?

train Chinese data

Great work! I train a model in Chinese data, the speaker embedding uses nn.Embedding layer, not use speaker encoder netwotk. In test stage, the similarity of inter-gender is very poor. Any suggestion?

Why not train like StarGAN

StarGAN training way could keep a many-to-many VC, why do you only use identity loss while lacking of consistent loss?
And , is identity loss necessary in StarGAN training?
Thank you .

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.