rongjiehuang / generspeech Goto Github PK

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.

License: MIT License

Python 99.92% Shell 0.08%

domain-generalization neurips-2022 speech-synthesis style-transfer text-to-speech tts

generspeech's Issues

Pretrained model links are down!

@Rongjiehuang The links to the pretrained models are down? I can host them if you need let me know!

emotion_encoder_path

I'm trying to run Inference towards style transfer of custom voice in colab. I got this error:

Traceback (most recent call last):
File "GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 170, in example_run
out = infer_ins.infer_once(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 153, in infer_once
inp = self.preprocess_input(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 82, in preprocess_input
EmotionEncoder.load_model(self.hparams['emotion_encoder_path'])
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/data_gen/tts/emotion/inference.py", line 33, in load_model
checkpoint = torch.load(weights_fpath)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home1/huangrongjie/Project/Emotion_encoder/1121_emotion_encoder.pt'

I put emotion encoder in 'checkpoints/Emotion_encoder.pt', as you said in readme.

Mismatch between paper and the implementation

Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?

Kind regards

When we use the style adaptor, how can we get the word/Phoneme boundary?

MFA dictionary

I would like to reproduce on the new data set, please provide detailed steps for generating a new dictionary

MFA inference problem

Hi can i ask if your using v1 of mfa, i tried running the inference but when it comes to the MFA portion it generates only an empty txt file, no text grid file at all

it would be great if you can provide the MFA instructions

Some questions after reading the paper

Hello authors,

I have read your paper and find it to be very promising. I have a few questions about the pretraining stage mentioned in the paper. Specifically, I am wondering how important the finetuning of wav2vec2 for the speaker representation $\mathcal{G}_s$ and the emotion representation $\mathcal{G}_e$ is. Is it possible to directly use the features generated by wav2vec2 for training, or would this significantly harm the style similarity? If we had a model without the pretraining stage, how would the CSMOS compare to the results shown in Table 3?

I also have a question about the VQ codebook. As I understand, a 1-way 128 tokens codebook is used in this work. As mentioned in the paper, VQ is prone to index collapse. Have you considered using a Gaussian-based VAE or a vanilla autoencoder-like bottleneck instead?

Finally, could you explain the function of the shuffle operation? Does it work similarly to an entire channel dropout, or does it have other advantages over dropout?

Thank you in advance for your time and consideration.

Could not reproduce the result

Thanks for the amazing work.
I followed the instruction to try generating Non-Parallel Transfer output, with reference audio from VCTK at the demo page (https://generspeech.github.io/#non-parallel-transfer, ref text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.)
the expected output should be: https://generspeech.github.io/wavs/NonParallelTransfer/VCTK/GenerSpeech/001.wav
but this is what I got: https://drive.google.com/file/d/1yRiW6TRlUcwwbs4MlS33VCmDQmj0pVA2/view?usp=share_link

The command I used to run inference:

PYTHONPATH=. CUDA_VISIBLE_DEVICES=0 python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --hparams="text='We also need a small plastic snake and a big toy frog for the kids.',ref_audio='vctk_001.wav'"

Did I miss something or do anything wrong?

Conda installation issue

Hello I am trying to setup this on wsl2 not shure if that is the issue or not, how ever my problem output below.

building 'dfftpack' library
  <string>:111: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  Running from scipy source directory.
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:624: UserWarning:
      Atlas (http://math-atlas.sourceforge.net/) libraries not found.
      Directories to search for the libraries can be specified in the
      numpy/distutils/site.cfg file (section [atlas]) or by setting
      the ATLAS environment variable.
    self.calc_info()
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /tmp/pip-build-env-ihzusus3/overlay/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/local/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  error: library dfftpack has Fortran sources but no Fortran compiler found
  ----------------------------------------
  ERROR: Failed building wheel for scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

failed

CondaEnvException: Pip failed

any help would be appreciated.

Alternative to MFA installation

Installing MFA and the python libraries on MAC is giving lot of trouble.

Would it be possible to use any other tool/library or already aligned files for the wav files in assets/ folder?

Would really appreciate this.

Thanks in advance.

the link for the pre-trained weights has expired

thank you for sharing this great work. Could you please share the weights. The link has expired.
thank you!!

How to solve this error?

When I ran python train_mfa_align.py --config modules/GenerSpeech/config/generspeech.yaml, I got this error:
train_and_align - DEBUG - ERROR (gmm-est-fmllr[5.5]:FindKeyInternal():util/kaldi-table-inl.h:2106) You provided the "cs" option but are not calling with keys in sorted order: 101287-7565-101287-000007-000001 < 110-1-000039-000003: rspecifier is ark,s,cs:apply-cmvn --utt2spk=ark:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/utt2spk.0 scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/cmvn.0.scp scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/feats.0.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/sat1/lda.mat ark:- ark:- |
How can I solve this?
Thanks for your response.

Versions of libraries/tools used

@Rongjiehuang Could you please specify the versions of all the libraries and tools used so that we could run correctly and reproduce the result ?

Thanks in advance

bug

FileNotFoundError: [Errno 2] No such file or directory: '/home/junjun.guo/.cache/huggingface/hub/models--facebook--wav2vec2-base-960h/refs/main'

pre-training model

Thank you for your work, the pre-training model you provided has expired, can you update it again, thank you very much！

Proper installation steps

@Rongjiehuang I got this repo to work, but i had to correct some things. Hope it helps someone else.

Before starting the install, you should sudo apt install gfortran libopenblas-base. These are required but not specified.
Change environment.yaml to remove duplicates in scipy and numpy, and remove version requirements on scipy and numba (old vresions cause conflicts with numpy).
If you already installed CUDA yourself, remove the one installed with environment.yaml with pip uninstall nvidia_cublas_cu11 (or whatever version you have).
In modules/GenerSpeech/config/generspeech.yaml, change emotion_encoder_path to checkpoints/Emotion_encoder.pt
Add the Generspeech dir root to your sys.path, either by moving GenerSpeech.py to the GenerSpeech dir or adding these lines at the top of GenerSpeech (otherwise Python can't find the imports)

import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))

Run mfa thirdparty download
In utils.hparams.py, lines 29 and 32 should remove help='location of the data corpus' becuase it's misleading. Line 41 needs to include remove=False.
Preprocessing fails at data_gen_utils line 299 if there is a word missing from mfa_dict.txt, because the TextGrid will skip the phones of the missing word. Actually, some common words are not in the dictionary like "her" (HH_ER1) and "processing" (P_R_AA1_S_EH0_S_IH0_NG). You have add them to the dictionary yourself. The correct way is to run mfa validate and append to mfa_dict.txt first (see this script),
Also you may want to use praatio as the standard TextGrid parser.

where is the dataset

thanks for helping me!
I encountered the difficulty when I do:

Prepare dataset: Download and put statistical files at data/binary/training_set
Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.

what is the dataset satisfying the requirement?

bug of inference

Traceback (most recent call last):
File "inference/GenerSpeech.py", line 2, in
from inference.base_tts_infer import BaseTTSInfer
ModuleNotFoundError: No module named 'inference'

Mistmatch between paper and the implementation

Kind regards

What exactly wrong with the key here?

/content/GenerSpeech
Traceback (most recent call last):
File "inference/GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/GenerSpeech/inference/base_tts_infer.py", line 164, in example_run
set_hparams()
File "/content/GenerSpeech/inference/utils/hparams.py", line 94, in set_hparams
if v in ['True', 'False'] or type(config_node[k]) in [bool, list, dict]:
KeyError: 'text'

中文表现力如何呢？

您好，请问有尝试在中文数据集上训练模型吗？表现如何呢？

Handling of OOVs

I have encountered some OOVs when building the alignment for the reference audio. It's probably due to the inconsistent lexicon used by the ASR and aligner, and it will trigger the following assertion:

GenerSpeech/data_gen/tts/data_gen_utils.py

Line 298 in 2979dd2

assert tg_len == ph_len, (tg_len, ph_len, tg_align, ph_list, tg_fn)

Do you know if there is any workaround for this? Many thanks.

Expired Download Link

The link for downloading the models (Emotion Encoder, HIFI-GAN) is not working

How to use the adaption part of your model ?

I try some audio with your model with zero-shot english TTS model, the similarity is good but the quality have some noise sound in the background. And the First world didn't appear in full.
text : For the twentieth time that evening the two men shook hands
媒體1.zip
So I want to know how to upgrade it with the adaption type.

Pretrained models URL not working

Hi this model pipeline is amazing. However, the URL given for pertained models is not working now.

Also, earlier when it worked there were issues like different configuration in yaml files and checkpoint files.

Thanks a ton!

None achademical but a general question here

Hi, does it able to synethsis Tom Cruise voice with just only the pretrained model and a sample real voice of his own?

inference results are different from the demo results

hey, thank you for the great work! very innovative .

I installed the exact libraries pointed in the environment.yaml and run inference on the audios provided in the demo. However the result is very different from the results in the demo page.

demo-result.zip

rongjiehuang / generspeech Goto Github PK

generspeech's Issues

thanks for helping me! I encountered the difficulty when I do:

Prepare dataset: Download and put statistical files at data/binary/training_set Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.

Recommend Projects

Recommend Topics

Recommend Org

thanks for helping me!
I encountered the difficulty when I do:

Prepare dataset: Download and put statistical files at data/binary/training_set
Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.