Giter VIP home page Giter VIP logo

generspeech's Introduction

Hi there 👋

Rongjie Huang (黄融杰) did my Graduate study at College of Computer Science and Software, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at Zhejiang University. During my graduate study, I was lucky to collaborate with the CMU Speech Team led by Prof. Shinji Watanabe, and Audio Research Team at Zhejiang University. I was grateful to intern or collaborate at TikTok, Shanghai AI Lab (OpenGV Lab), Tencent Seattle Lab, Alibaba Damo Academic, with Yi Ren, Jinglin Liu, Chunlei Zhang and Dong Yu.

My research interest includes Multi-Modal Generative AI, Multi-Modal Language Processing, and AI4Science. I have published first-author papers at the top international AI conferences such as NeurIPS/ICLR/ICML/ACL/IJCAI.

I am actively looking for academic collaboration, feel free to drop me an email.

📎 Homepages

💻 Selected Research Papers

Generative AI for Speech, Sing, and Audio: Spoken Large Language Model, Text-to-Audio Synthesis, Text-to-Speech Synthesis, Singing Voice Synthesis

Audio-Visual Language Processing: Audio-Visual Speech-to-Speech Translation, Self-Supervised Learning

My full paper list is shown at my personal homepage.

Spoken Large Language Model

Text-to-Speech Synthesis

Text-to-Audio Synthesis

Audio-Visual Language Processing

Singing Voice Synthesis

generspeech's People

Contributors

rongjiehuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

generspeech's Issues

MFA inference problem

Hi can i ask if your using v1 of mfa, i tried running the inference but when it comes to the MFA portion it generates only an empty txt file, no text grid file at all

it would be great if you can provide the MFA instructions

Mismatch between paper and the implementation

Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?

Kind regards

Could not reproduce the result

Thanks for the amazing work.
I followed the instruction to try generating Non-Parallel Transfer output, with reference audio from VCTK at the demo page (https://generspeech.github.io/#non-parallel-transfer, ref text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.)
the expected output should be: https://generspeech.github.io/wavs/NonParallelTransfer/VCTK/GenerSpeech/001.wav
but this is what I got: https://drive.google.com/file/d/1yRiW6TRlUcwwbs4MlS33VCmDQmj0pVA2/view?usp=share_link

The command I used to run inference:

PYTHONPATH=. CUDA_VISIBLE_DEVICES=0 python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --hparams="text='We also need a small plastic snake and a big toy frog for the kids.',ref_audio='vctk_001.wav'"

Did I miss something or do anything wrong?

Some questions after reading the paper

Hello authors,

I have read your paper and find it to be very promising. I have a few questions about the pretraining stage mentioned in the paper. Specifically, I am wondering how important the finetuning of wav2vec2 for the speaker representation $\mathcal{G}_s$ and the emotion representation $\mathcal{G}_e$ is. Is it possible to directly use the features generated by wav2vec2 for training, or would this significantly harm the style similarity? If we had a model without the pretraining stage, how would the CSMOS compare to the results shown in Table 3?

I also have a question about the VQ codebook. As I understand, a 1-way 128 tokens codebook is used in this work. As mentioned in the paper, VQ is prone to index collapse. Have you considered using a Gaussian-based VAE or a vanilla autoencoder-like bottleneck instead?

Finally, could you explain the function of the shuffle operation? Does it work similarly to an entire channel dropout, or does it have other advantages over dropout?

Thank you in advance for your time and consideration.

How to use the adaption part of your model ?

I try some audio with your model with zero-shot english TTS model, the similarity is good but the quality have some noise sound in the background. And the First world didn't appear in full.
text : For the twentieth time that evening the two men shook hands
媒體1.zip
So I want to know how to upgrade it with the adaption type.

where is the dataset

thanks for helping me!
I encountered the difficulty when I do:

Prepare dataset: Download and put statistical files at data/binary/training_set
Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.

what is the dataset satisfying the requirement?

Alternative to MFA installation

Installing MFA and the python libraries on MAC is giving lot of trouble.

Would it be possible to use any other tool/library or already aligned files for the wav files in assets/ folder?

Would really appreciate this.

Thanks in advance.

bug of inference

Traceback (most recent call last):
File "inference/GenerSpeech.py", line 2, in
from inference.base_tts_infer import BaseTTSInfer
ModuleNotFoundError: No module named 'inference'

How to solve this error?

When I ran python train_mfa_align.py --config modules/GenerSpeech/config/generspeech.yaml, I got this error:
train_and_align - DEBUG - ERROR (gmm-est-fmllr[5.5]:FindKeyInternal():util/kaldi-table-inl.h:2106) You provided the "cs" option but are not calling with keys in sorted order: 101287-7565-101287-000007-000001 < 110-1-000039-000003: rspecifier is ark,s,cs:apply-cmvn --utt2spk=ark:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/utt2spk.0 scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/cmvn.0.scp scp:data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/corpus_data/subset_10000/feats.0.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats data/processed/LibriSpeech/train-other-500/mfa_tmp/mfa_inputs/sat1/lda.mat ark:- ark:- |
How can I solve this?
Thanks for your response.

bug

FileNotFoundError: [Errno 2] No such file or directory: '/home/junjun.guo/.cache/huggingface/hub/models--facebook--wav2vec2-base-960h/refs/main'

Proper installation steps

@Rongjiehuang I got this repo to work, but i had to correct some things. Hope it helps someone else.

  • Before starting the install, you should sudo apt install gfortran libopenblas-base. These are required but not specified.
  • Change environment.yaml to remove duplicates in scipy and numpy, and remove version requirements on scipy and numba (old vresions cause conflicts with numpy).
  • If you already installed CUDA yourself, remove the one installed with environment.yaml with pip uninstall nvidia_cublas_cu11 (or whatever version you have).
  • In modules/GenerSpeech/config/generspeech.yaml, change emotion_encoder_path to checkpoints/Emotion_encoder.pt
  • Add the Generspeech dir root to your sys.path, either by moving GenerSpeech.py to the GenerSpeech dir or adding these lines at the top of GenerSpeech (otherwise Python can't find the imports)
import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
  • Run mfa thirdparty download
  • In utils.hparams.py, lines 29 and 32 should remove help='location of the data corpus' becuase it's misleading. Line 41 needs to include remove=False.
  • Preprocessing fails at data_gen_utils line 299 if there is a word missing from mfa_dict.txt, because the TextGrid will skip the phones of the missing word. Actually, some common words are not in the dictionary like "her" (HH_ER1) and "processing" (P_R_AA1_S_EH0_S_IH0_NG). You have add them to the dictionary yourself. The correct way is to run mfa validate and append to mfa_dict.txt first (see this script),
  • Also you may want to use praatio as the standard TextGrid parser.

MFA dictionary

I would like to reproduce on the new data set, please provide detailed steps for generating a new dictionary

Conda installation issue

Hello I am trying to setup this on wsl2 not shure if that is the issue or not, how ever my problem output below.

building 'dfftpack' library
  <string>:111: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  Running from scipy source directory.
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:624: UserWarning:
      Atlas (http://math-atlas.sourceforge.net/) libraries not found.
      Directories to search for the libraries can be specified in the
      numpy/distutils/site.cfg file (section [atlas]) or by setting
      the ATLAS environment variable.
    self.calc_info()
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /tmp/pip-build-env-ihzusus3/overlay/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/local/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  /tmp/pip-build-env-ihzusus3/overlay/lib/python3.8/site-packages/numpy/distutils/system_info.py:716: UserWarning: Specified path /usr/include/python3.8 is invalid.
    return self.get_paths(self.section, key)
  error: library dfftpack has Fortran sources but no Fortran compiler found
  ----------------------------------------
  ERROR: Failed building wheel for scipy
ERROR: Could not build wheels for scipy which use PEP 517 and cannot be installed directly

failed

CondaEnvException: Pip failed

any help would be appreciated.

Mistmatch between paper and the implementation

Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?

Kind regards

Handling of OOVs

I have encountered some OOVs when building the alignment for the reference audio. It's probably due to the inconsistent lexicon used by the ASR and aligner, and it will trigger the following assertion:

assert tg_len == ph_len, (tg_len, ph_len, tg_align, ph_list, tg_fn)

Do you know if there is any workaround for this? Many thanks.

pre-training model

Thank you for your work, the pre-training model you provided has expired, can you update it again, thank you very much!

Pretrained models URL not working

Hi this model pipeline is amazing. However, the URL given for pertained models is not working now.

Also, earlier when it worked there were issues like different configuration in yaml files and checkpoint files.

image

Thanks a ton!

What exactly wrong with the key here?

/content/GenerSpeech
Traceback (most recent call last):
File "inference/GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/GenerSpeech/inference/base_tts_infer.py", line 164, in example_run
set_hparams()
File "/content/GenerSpeech/inference/utils/hparams.py", line 94, in set_hparams
if v in ['True', 'False'] or type(config_node[k]) in [bool, list, dict]:
KeyError: 'text'

emotion_encoder_path

I'm trying to run Inference towards style transfer of custom voice in colab. I got this error:

Traceback (most recent call last):
File "GenerSpeech.py", line 35, in
GenerSpeechInfer.example_run()
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 170, in example_run
out = infer_ins.infer_once(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 153, in infer_once
inp = self.preprocess_input(inp)
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/inference/base_tts_infer.py", line 82, in preprocess_input
EmotionEncoder.load_model(self.hparams['emotion_encoder_path'])
File "/content/drive/MyDrive/TTS/colab/GenerSpeech/GenerSpeech/data_gen/tts/emotion/inference.py", line 33, in load_model
checkpoint = torch.load(weights_fpath)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home1/huangrongjie/Project/Emotion_encoder/1121_emotion_encoder.pt'

I put emotion encoder in 'checkpoints/Emotion_encoder.pt', as you said in readme.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.