rongjiehuang / generspeech Goto Github PK
View Code? Open in Web Editor NEWPyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.
License: MIT License
PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.
License: MIT License
@Rongjiehuang The links to the pretrained models are down? I can host them if you need let me know!
I'm trying to run Inference towards style transfer of custom voice in colab. I got this error:
Traceback (most recent call last):
File "GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 170, in example_run
out = infer_ins.infer_once(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 153, in infer_once
inp = self.preprocess_input(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 82, in preprocess_input
EmotionEncoder.load_model(self.hparams['emotion_encoder_path'])
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/data_gen/tts/emotion/inference.py", line 33, in load_model
checkpoint = torch.load(weights_fpath)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home1/huangrongjie/Project/Emotion_encoder/1121_emotion_encoder.pt'
I put emotion encoder in 'checkpoints/Emotion_encoder.pt', as you said in readme.
Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?
Kind regards
I would like to reproduce on the new data set, please provide detailed steps for generating a new dictionary
Hi can i ask if your using v1 of mfa, i tried running the inference but when it comes to the MFA portion it generates only an empty txt file, no text grid file at all
it would be great if you can provide the MFA instructions
Hello authors,
I have read your paper and find it to be very promising. I have a few questions about the pretraining stage mentioned in the paper. Specifically, I am wondering how important the finetuning of wav2vec2 for the speaker representation
I also have a question about the VQ codebook. As I understand, a 1-way 128 tokens codebook is used in this work. As mentioned in the paper, VQ is prone to index collapse. Have you considered using a Gaussian-based VAE or a vanilla autoencoder-like bottleneck instead?
Finally, could you explain the function of the shuffle operation? Does it work similarly to an entire channel dropout, or does it have other advantages over dropout?
Thank you in advance for your time and consideration.
Thanks for the amazing work.
I followed the instruction to try generating Non-Parallel Transfer
output, with reference audio from VCTK at the demo page (https://generspeech.github.io/#non-parallel-transfer, ref text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
)
the expected output should be: https://generspeech.github.io/wavs/NonParallelTransfer/VCTK/GenerSpeech/001.wav
but this is what I got: https://drive.google.com/file/d/1yRiW6TRlUcwwbs4MlS33VCmDQmj0pVA2/view?usp=share_link
The command I used to run inference:
PYTHONPATH=. CUDA_VISIBLE_DEVICES=0 python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --hparams="text='We also need a small plastic snake and a big toy frog for the kids.',ref_audio='vctk_001.wav'"
Did I miss something or do anything wrong?
Hello I am trying to setup this on wsl2 not shure if that is the issue or not, how ever my problem output below.
building 'dfftpack' library
<string>:111: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
Running from scipy source directory.
/tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:624: UserWarning:
Atlas (http://math-atlas.sourceforge.net/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [atlas]) or by setting
the ATLAS environment variable.
self.calc_info()
/tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /tmp/pip-build-env-ihzusus3/overlay/include/python3.8 is invalid.
return self.get_paths(self.section, key)
/tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/local/include/python3.8 is invalid.
return self.get_paths(self.section, key)
/tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/include/python3.8 is invalid.
return self.get_paths(self.section, key)
error: library dfftpack has Fortran sources but no Fortran compiler found
----------------------------------------
ERROR: Failed building wheel for scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly
failed
CondaEnvException: Pip failed
any help would be appreciated.
Installing MFA and the python libraries on MAC is giving lot of trouble.
Would it be possible to use any other tool/library or already aligned files for the wav files in assets/ folder?
Would really appreciate this.
Thanks in advance.
thank you for sharing this great work. Could you please share the weights. The link has expired.
thank you!!
When I ran python train_mfa_align.py --config modules/GenerSpeech/config/generspeech.yaml
, I got this error:
train_and_align - DEBUG - ERROR (gmm-est-fmllr[5.5]:FindKeyInternal():util/kaldi-table-inl.h:2106) You provided the "cs" option but are not calling with keys in sorted order: 101287-7565-101287-000007-000001 < 110-1-000039-000003: rspecifier is ark,s,cs:apply-cmvn --utt2spk=ark:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/utt2spk.0 scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/cmvn.0.scp scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/feats.0.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/sat1/lda.mat ark:- ark:- |
How can I solve this?
Thanks for your response.
@Rongjiehuang Could you please specify the versions of all the libraries and tools used so that we could run correctly and reproduce the result ?
Thanks in advance
FileNotFoundError: [Errno 2] No such file or directory: '/home/junjun.guo/.cache/huggingface/hub/models--facebook--wav2vec2-base-960h/refs/main'
Thank you for your work, the pre-training model you provided has expired, can you update it again, thank you very much!
@Rongjiehuang I got this repo to work, but i had to correct some things. Hope it helps someone else.
sudo apt install gfortran libopenblas-base
. These are required but not specified.environment.yaml
to remove duplicates in scipy and numpy, and remove version requirements on scipy and numba (old vresions cause conflicts with numpy).environment.yaml
with pip uninstall nvidia_cublas_cu11
(or whatever version you have).modules/GenerSpeech/config/generspeech.yaml
, change emotion_encoder_path
to checkpoints/Emotion_encoder.pt
sys.path
, either by moving GenerSpeech.py to the GenerSpeech dir or adding these lines at the top of GenerSpeech (otherwise Python can't find the imports)import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
mfa thirdparty download
utils.hparams.py
, lines 29 and 32 should remove help='location of the data corpus'
becuase it's misleading. Line 41 needs to include remove=False
.data_gen_utils
line 299 if there is a word missing from mfa_dict.txt
, because the TextGrid will skip the phones of the missing word. Actually, some common words are not in the dictionary like "her" (HH_ER1) and "processing" (P_R_AA1_S_EH0_S_IH0_NG). You have add them to the dictionary yourself. The correct way is to run mfa validate
and append to mfa_dict.txt
first (see this script),praatio
as the standard TextGrid parser.what is the dataset satisfying the requirement?
Traceback (most recent call last):
File "inference/GenerSpeech.py", line 2, in
from inference.base_tts_infer import BaseTTSInfer
ModuleNotFoundError: No module named 'inference'
Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?
Kind regards
/content/GenerSpeech
Traceback (most recent call last):
File "inference/GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/GenerSpeech/inference/base_tts_infer.py", line 164, in example_run
set_hparams()
File "/content/GenerSpeech/inference/utils/hparams.py", line 94, in set_hparams
if v in ['True', 'False'] or type(config_node[k]) in [bool, list, dict]:
KeyError: 'text'
您好,请问有尝试在中文数据集上训练模型吗?表现如何呢?
I have encountered some OOVs when building the alignment for the reference audio. It's probably due to the inconsistent lexicon used by the ASR and aligner, and it will trigger the following assertion:
GenerSpeech/data_gen/tts/data_gen_utils.py
Line 298 in 2979dd2
Do you know if there is any workaround for this? Many thanks.
The link for downloading the models (Emotion Encoder, HIFI-GAN) is not working
I try some audio with your model with zero-shot english TTS model, the similarity is good but the quality have some noise sound in the background. And the First world didn't appear in full.
text : For the twentieth time that evening the two men shook hands
媒體1.zip
So I want to know how to upgrade it with the adaption type.
Hi, does it able to synethsis Tom Cruise voice with just only the pretrained model and a sample real voice of his own?
hey, thank you for the great work! very innovative .
I installed the exact libraries pointed in the environment.yaml and run inference on the audios provided in the demo. However the result is very different from the results in the demo page.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.