maum-ai / assem-vc Goto Github PK

View Code? Open in Web Editor NEW

260.0 18.0 38.0 82.87 MB

Official Code for Assem-VC @ICASSP2022

Home Page: https://mindslab-ai.github.io/assem-vc/

License: BSD 3-Clause "New" or "Revised" License

Python 10.86% Jupyter Notebook 89.05% Shell 0.09%

voice-conversion speech-synthesis deep-learning pytorch

assem-vc's People

Contributors

Stargazers

Watchers

assem-vc's Issues

Speaker encoder

안녕하세요. 논문을 보니 speaker representation을 구하는데 자주 쓰는 lookup table/embedding 대신에 speaker encoder를 사용했다는데 이 부분에 대해 더 자세한 설명이 없어서 혹시 자세한 구조나 참조한 논문을 알 수 있을까요? 아울러 비슷한 방식을 채택한 Attentron이나 DeepSinger 등은 target이 굳이 training set에 없어도 reference 하나만으로 conversion이 가능한 zero-shot/any-to-any 기능을 지원하는데 이것도 실험을 해보셨는지 궁금합니다.

One-to-Many

can it use as an "one-to-many" conversion model? I have few unpaired datasets, 1 hour for each, 1k+ sentences, SNR > 40dB. I want to know if I can make this project as an "one-to-many" or said one-vs-rest model?

Cross-lingual supported?

Hey, thank you for sharing your great work!
I'm a student who's currently working on cross-lingual VC. Your Assem-vc framework provides excellent results in VC within single language. I'm wondering does this framework also work for cross-lingual VC? Is there any way to transform the Cotatron model to support multi-language?

Controllable and Interpretable Singing Voice Decomposition via Assem-VC

just saw this paper https://arxiv.org/abs/2110.12676, when will the repo be updated for this thanks

Training HIFI-GAN faster

Hi @wookladin ,

I was trying to fine-tune HIFI-GAN for a single speaker dataset(20 mins of Audio) and the training time per epoch was around 35 seconds. This seems too long. Any ideas of how to make it faster? I'm using a single T4 GPU machine, I could use a bigger machine with V100 GPUs based on your suggestion.

Build custom non-English dataset with ARPABET

Hi, thanks for opening this project. I'm a newbie in VC and I try to add a new speaker to assem-vc.
In Prepare Metadata section, @wookladin uses python datasets/g2p.py to convert transcription into ARPABET.
For custom dataset other than English, e.g. Mandarin Chinese, how to build metadata?
I searched for g2p and find https://github.com/kakaobrain/g2pM, a Grapheme-to-Phoneme Conversion tool for Chinese. But the generated results are PinYin, not ARPABET format. This really confuses me. Could we use PinYin for Chinese to build metadata?

Changin the sampling rate

Hi, I have a voice dataset that is sampled at 16kHz, I saw that inside the cofig file there is specific instruction not to change the audio part of the config including the sampling rate. Is there a way for me to adapt a different sampling rate into the code?

Trouble importing AttrDict

Hi there - I'm having trouble running the inference code. Can't seem to import AttrDict from env.

ImportError Traceback (most recent call last)
<ipython-input-1-7f2d8b5f7f6d> in <module>
15
16 from omegaconf import OmegaConf
---> 17 from env import AttrDict
18
19 from synthesizer import Synthesizer

ImportError: cannot import name 'AttrDict'

Things I've tried:

!pip install env
Installed AttrDict=2.0.0=py36_0 via conda
Made sure Python=3.6.8

Any help would be appreciated :)

Training Cotatron has a problem

Thanks for your code.
This is a question when I train at Cotatron.
What can I do to address this question?

Is chn.residual_out missing in config/global/default.yaml ?

Hello!
Thank you for your awesome project!

1). I believe hp.chn.residual_out in modules/vc_decoder.py line 54 should be defined in config/global/default.yaml, right?
2). Why were libritts_vctk_train_10s_g2p.txt and libritts_vctk_val_g2p.txt files deleted from the repo? Shouldn't we use them as train_meta and val_meta in config/global/default.yaml ?

Best way to extend the model to a new speaker

Hi,
I have a 15 min recording of a new speaker. I'd like to train assem-vc to perform any-to-one voice conversion. Based on my previous experience, the best and fastest way to do it would be to create a single speaker dataset and further fine-tune both the pre-trained VC decoder and pre-trained HIFI-GAN vocoder.

Am I correct or is there a better way to do it?
How would the training steps change to fine-tune the given pre-trained models?

Audio samples sound great! I have a few questions.

Hello!

Very excited for the code release this June, I'm interested to see how my datasets hold up when trained with your model. I've got a few questions so that I can adequately prep my datasets:

Do I only need audio to train a speaker, or do I need audio + text transcriptions?
What specific format and specifications should the audio be in?
Do the speakers need to have the exact same utterances to train, or can they be different utterances?
Do you think this model would be able to perform well doing Any to Many or Many to Many real-time? For example, if the input speaker was a microphone feed.

Thanks for reading, good luck w/ the code release!

Where can I get audio samples? The link in the README is broken.

In README:

Audio Samples: https://mindslab-ai.github.io/assem-vc/

However, it's 404. Can I get audio samples somewhere? Thanks a lot.

어떻게해야 모델을 한글 음소로 학습시킬 수 있나요?

최근에 음성 합성에 관심이 생겨서 공부하고 있는 학부생입니다, 이 프로젝트는 정말 멋집니다.

데모에서 한국어 음성 합성을 들었는데, '가수' 데모에는 없어서, 직접 모델을 만들어보고 싶습니다.

Config를 kor와 korean_cleaners로 변경했습니다.
그리고 제가 이때까지 구현해서 만들어본 metadata.txt의 내용의 일부분입니다.

wavs\나타내고 싶지 않다고_ 생각하고 있습니다..wav|나타내고 싶지 않다고? 생각하고 있습니다.|001
wavs\나타내고 싶지 않다고_ 생각하고 있습니다..wav|나타네고 십찌 안타고? 셍가카고 읻씀니다.|001
wavs\나타내고 싶지 않다고_ 생각하고 있습니다..wav|{NA TA NAE GO} {SIP JI} {AN TA GO} {SAENG GA KA GO}? {IT SEUP NI DA}.|001

이런 접근 방법이 맞는 건가요?
조언 부탁드립니다.

many to many, any to any 질문

안녕하세요! 음성합성을 공부하고 있는 학생입니다.

논문 잘 읽었습니다! 샘플 오디오를 들으면서 엄청난 성능에 깜짝 놀랐습니다.

제가 VC논문을 처음 읽어 모르는 것이 생겨 질문을 하게 되었습니다.

many to many, any to many가 뭔지를 모르겠습니다.
제가 이해하기로는
many to many가 학습에 사용했었던 speaker를 inference시에도 사용한것이고
any to many가 학습에서 보지 못했던 speaker를 inference 시에 사용한 것이 맞는지 모르겠습니다 ㅠㅠ
GTA finetuning에 대해서 찾아봤는데 마땅한 내용을 찾지 못했습니다. 제가 논문을 읽고 이해하기로는
원래 학습시킨 모델에 추가적으로 assem-vc를 통과한 mel을 더 추가적으로 finetuning 하는것이 맞는지 여쭤보고 싶습니다.

감사합니다.

teacher-forcing

have you tried teacher forcing rate set to 1.0 during training cotatron ?

Extending to n+1 target speakers using pretrained Cotatron

Hello,

How would I extend this model to n+1 target speakers to perform any/many to many conversion? When I increase the number of speakers to include the speakers in LibriTTS + our dataset and use the pretrained cotatron weights, I get an embedding mismatch error when attempting to train the decoder because of the different dimensions which is derived from the speakers_list in the global config.yaml. Do I simply keep the speakers_list the same, i.e. don't include our dataset speaker names (include only LibriTTS + VCTK), but train the decoder/synthesizer on the combined data which includes LibriTTS + our dataset?

Thanks

How to split singing voices

Hi, I am trying to reproduce the results presented in the paper "Controllable and Interpretable Singing Voice Decomposition via Assem-VC", with the CSD, NUS-48E and also with custom datasets. In the paper it is said that "all singing voices are split between 1-12 seconds and used for training with corresponding lyrics". I understand that the original .wav files of the datasets need to be splitted to shorter .wav files before building the metadata files with format "path_to_wav|transcription|speaker_id". However, I can't find any code in the repository for doing this. How is this splitting process done? Is it done manually for all the datasets?

Thanks!

Which Hi-Fi gan version did you use?

Can you indicate which git repo? Official or unofficial? and the Mel is it still 70-800 or 0-8000?

Thanks you

Reason to use speaker encoder over speaker embeddings?

What was the reason you switched from speaker embeddings (Cotatron) to a speaker encoder (this). Was it because it worked better? Or was it to support Any to Any voice conversion? I'm curious because I am currently trying to deploy my own architecture and can't really decide between the two.

NaN loss on cotatron_trainer

I got train loss=NaN when trained cotatron_trainer.
Can you upload pretrained checkpoint or how to fix this?

Question about GTA of mel-spectograms

Im trying to understand the GTA part of your paper which seems to have a huge influence and im unsure if I understood it correctly. I understood that much: You have two networks, one which maps a source and target speaker mel spectogram and the transcription to a transformed spectogram and the vocoder which maps the transformed spectogram to waveform.

You first train the first network. Then instead of transforming the waveform to a spectogram and using that as input to the vocoder, in order to train it, you pass the audio through your proposed network and use the output as input to train the vocoder, is that correct?

Possible bottleneck?

I am got warning:

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: Dataloader(num_workers>0) and ddp_spawn do not mix well! Your performance might suffer dramatically. Please consider setting distributed_backend=ddp to use num_workers > 0 (this is a bottleneck of Python .spawn() and PyTorch

is this Ok?

한글 데이터 학습시 혀짧은 소리

안녕하세요, 오픈해주신 코드로 이것 저것 테스트 해보고 있는 개발자 입니다.
다름이 아니라 KSS 국립국어원 데이터 등으로 한글 데이터에 대해서 테스트를 해보고 있는데
Conversion 결과 혀짧은 소리가 나는데 이게 왜 이럴까요?
text쪽도 kor와 korean_cleaners 로 제대로 변경해서 학습 했는데요.
Cotatron 쪽 혹은 Synthesizer쪽 학습이 덜 되서 그런걸까요?

Speech+Transcript conditioned phoneme recognition as an alternative to G2P

Hi @wookladin ,
While creating the training data, G2P gives phonemes based on how a particular word is supposed to be pronounced, but the audio might have a slightly different pronunciation due to various accents. I understand that you've used proprietary G2P for better results. But g2p models only utilize transcript information.

A speech+transcript conditioned phoneme recognizer would give better results wouldn't it?
Phoneme error rates are still high in the latest ASR acoustic models. Usually, ASR acoustic models predict not-so-accurate phonemes and ASR language models predict the transcript from the phonemes. But here we want to improve the accuracy of the phonemes given audio and transcript. I couldn't find any literature around that. Any leads/ideas?

how many hours of speakers voice will it need, to finetune this model?

thks

Code release?

When can we expect a code release? Any timetable for that?

Regarding teacher forcing to calculate alignment

Hi,
The alignment for ith step does not use the mel frame of the ith step. Even if teacher forcing is used, we are essentially predicting alignment at every step using previous mel frames and not utilizing the actual ground truth mel frame.

We could do one of these instead:

Since M_i depends on A_i and since we know ground truth M_i, we can freeze the network weights to find A_i using autograd.
We could also use a phoneme aligner like montreal forced aligner(mfa) to get the alignment matrix directly.

other models

I see that there is a Korean model in the sample, how does this step work?
python datasets/g2p.py -i < input_metadata_filename_with_graphemes > -o < output_filename >

Pre-trained model

Could you please tell us when you are planning to release pre-trained model?
Is it possible for you to provide us some kind of loss graph or just the number of training steps necessary for each module to converge on LibriTTS+VCTK dataset? So we could estimate whether it is possible for mere mortals to train the model without multiple advanced GPUs...
Could you elaborate on audio normalization mentioned in you paper? Is it implemented somewhere in your project or should we process audio files by some other means?
Thank you!

maum-ai / assem-vc Goto Github PK

assem-vc's People

Contributors

Stargazers

Watchers

Forkers

assem-vc's Issues

Recommend Projects

Recommend Topics

Recommend Org